Hyper-Converged Database 2.0 lands in my DataStax docs agent — plus 500+ refreshed pages

DataStax shipped Hyper-Converged Database 2.0. The same afternoon, my DataStax docs agent could answer questions about it — 501 new HCD 2.0 pages, pulled in by a recheck of the whole corpus.

This post is the corpus side of that agent: how I notice when DataStax changes the docs, and how the new pages get into the vector store the agent searches. The agent itself — the bit that decides when to search — is its own story.

Try it: DataStax Docs Agent

What you can now ask ¶

The agent reads a mirror of the DataStax docs stored in Astra DB. That mirror was 6,608 pages. After this refresh it’s 7,228 — most of the growth is HCD 2.0.

Product	New pages	Changed
Hyper-Converged Database (incl. 2.0)	533	885
DataStax Enterprise	52	52
Astra DB Serverless	18	118
Astra DB Classic	16	12
Mission Control	4	13
CQL	0	30

HCD 2.0 is the headline, and it’s a real jump, not a version bump. The big new surface is the Astra Data API on the self-managed product — the JSON document/collection API (and the new Tables model, with Python/TypeScript/Java/C# clients) that used to be an Astra-only idea. There’s also lexical search in public preview alongside vector search, and first-class zero-downtime migration paths from DSE 5.1 and 6.8 to HCD 2.0.

So questions that returned nothing useful last week now land. I asked the live agent:

Does Hyper-Converged Database 2.0 support lexical search?

The DataStax docs agent answering a question about HCD 2.0 lexical search, with a code example and source links to the HCD 2.0 documentation

Three hybrid searches, then a grounded answer: yes, it’s public preview, here’s the lexical config object, here are the analyzers, and here’s the AWS us-east-2 caveat — every claim cited back to a hyper-converged-database/2.0/... page that didn’t exist in the store a few hours earlier.

How I know what changed ¶

I don’t get a webhook when DataStax publishes. I pull each product’s sitemap and diff it. The corpus is 26 DataStax products, each a folder that tracks its own docs.datastax.com sitemap, and the refresh is one differential pass per product:

dx snapshot <product> --force   # re-read the sitemap, refetch, diff
dx extract  <product>            # HTML -> markdown, only changed pages
dx rag      <product>            # markdown -> RAG pages, only changed
dx astra-sync <product>          # changed pages -> Astra DB

The interesting part is what “changed” means at each stage, because the first stage lies to you.

Run the snapshot over a product and almost every page comes back changed — not because the docs changed, but because the rendered HTML carries a moving footer, nav, and build markers. If I trusted that, every refresh would re-embed the entire corpus.

So the hash that matters isn’t the HTML hash. After the HTML is converted to clean markdown, the boilerplate is gone, and most pages hash identically to last time. Here’s a product where the docs genuinely didn’t change:

astra-cli: snapshot updated — unchanged 0, changed 122, added 0, removed 0
web-docling: DONE (122 pages — extracted 122, skipped 0)
rag: DONE (122 pages — written 0, skipped 122, removed 0)
astra-sync (datastax_docs_hybrid): new=0 changed=0 unchanged=122 ... inserted_chunks=0

122 “changed” at the top, zero chunks embedded at the bottom. Three hash layers — HTML, markdown, then the per-page checksum the sync compares — and noise falls out at each one. Across the whole refresh the sitemaps flagged thousands of pages as “changed” (DataStax Enterprise alone reported 3,054), but only about 1,125 had real content edits. HCD is where most of that was: it went from 888 pages to 1,418.

How it lands in the store ¶

The agent searches one Astra DB collection, datastax_docs_hybrid — hybrid vector + lexical (BM25), with NVIDIA’s hosted embedding (nv-embedqa-e5-v5) doing the vectorising server-side. The sync is differential against a checksum manifest: new and changed pages get their old chunks deleted and re-inserted by a deterministic id, removed pages get deleted, unchanged pages are skipped. It’s idempotent — if it dies halfway through HCD’s 1,418 pages, I run it again and it picks up where it left off.

The real cost is embedding, not storage. This refresh pushed ~141,000 chunks through the embedder — most of it HCD’s dense API reference, and a surprising chunk of it Mission Control’s auto-generated CRD pages, which are enormous. I ran it in waves (a small product first as a canary, then HCD on its own, then the rest) and watched the counts. Every product’s live result matched the dry run exactly.

This is the part the old version got wrong. Back then the ingest state was keyed by file path, so a page whose content changed but kept its path was skipped — the corpus could silently rot. Keying on a content checksum instead is the whole fix.

What’s still rough ¶

It’s a beta, and the agent earns that label.

Ask it the lexical-search question and it’s great. Ask it something broader — “can I create and query CQL-style tables through the HCD 2.0 Data API?” — and I watched it run 12 searches and stop on “max iterations” with no answer. The retrieval was finding the right pages; the agent just kept refining its query and ran out the clock. A fixed RAG pipeline would have answered (badly, maybe); the agent’s freedom to re-search is also its freedom to spin. That’s the trade.

Other honest notes: the refresh is manual — I run it when I notice a release, there’s no cron. The agent loop adds latency; you wait while it searches. And the live document count Astra reports is an estimate that runs high right after a big write, until compaction catches up — the checksum manifest is the number I trust.

None of that changes the headline: HCD 2.0 is in the store, and the agent cites it.

Try it: DataStax Docs Agent · earlier versions: v1 · v2