Re-ingesting a RAG doc corpus when upstream docs change


The Astra Docs Chat corpus was loaded once. DataStax ships doc updates regularly; v1 does not auto-refresh. This post outlines a repeatable re-ingest strategy when the markdown export changes: what to detect, how to use existing resume tooling, and when full rebuild beats incremental upsert.

Series: Building Astra Docs Chat · Batch ingest script · Chunking and embedding

Try the chat: Astra Docs Chat


For a personal reference tool, running ingest after major doc releases (or when answers feel stale) may be sufficient. No cron, no diff pipeline, just:

  1. Re-run your crawl/extract step to refresh the local pages/ markdown export
  2. Re-run the Langflow batch ingest script with appropriate flags
  3. Spot-check five questions on Astra Docs Chat

That matches how v1 shipped: one batch load, manual refresh when I notice drift.

Automation is optional until stale answers become painful. Docs-only guardrails reduce harm from stale vectors but do not replace fresh content.


I track crawl and ingest in separate local state files:

File Tracks Phase
page_state.json Crawl/extract per URL (hash, filename) Markdown export
ingest_state.json Langflow ingest per file path Vector load

Crawl state (page_state.json) tracks upstream doc URLs and content hashes. Ingest state (ingest_state.json) tracks which local files have already been uploaded and embedded.

Crawl state uses SHA-256 hashes:

def compute_hash(content: str) -> str:
    return hashlib.sha256(content.encode("utf-8")).hexdigest()

During extract, unchanged pages can skip re-writing markdown. After a crawl, compare what changed before you spend embedding API credits.

Ingest state is path-based only in v1:

{
  "pages/administration_audit-log.md": {
    "status": "ingested",
    "uploaded_path": "7b90824f-.../administration_audit-log.md"
  }
}

Gap: if markdown content changes but the path is unchanged, the batch ingest script skips the file because status is already ingested. Re-ingest requires deliberate state edits or a script enhancement.


After a fresh crawl/export:

  • New files in pages/: the batch script picks them up (not in ingest_state.json)
  • Changed files: remove their keys from ingest_state.json, or extend the ingest script to compare compute_hash() against a stored hash and re-run on mismatch
  • Deleted upstream pages: vectors for removed topics may linger until full collection rebuild or explicit deletion by metadata

Recommended incremental workflow today:

# 1. Refresh markdown export (your crawl/extract tooling)
# ...

# 2. Clear ingest state for files you know changed (or all keys for a major release)
# edit ingest_state.json manually or script it

# 3. Re-ingest via Langflow API (see batch ingest post for script flags)
python ingest_langflow.py --retry-failed

Check ingest_failed.log after every run.


Strategy Pros Cons
Append / upsert per file Fast, resumable Orphan vectors if pages removed; duplicate chunks if not deduped
Truncate collection + full ingest Clean slate 2-4 hours embedding time (batch post timing )
Collection per version Easy rollback More ops complexity

For 271 pages, full rebuild is painful but simple: truncate datastax_astra_docs, delete ingest_state.json, run a full batch ingest pass (batch post ).

Incremental with hash-aware ingest state is the sweet spot for repeat runs. A minimal enhancement:

# pseudocode: on each file before skip
if state[key].get("hash") != compute_hash(file.read_text()):
    del state[key]  # force re-ingest

Store hash alongside ingested status when saving state after success.

Langflow/Astra upsert behaviour depends on component settings (deletion_field, document ids). v1 appends; duplicates can inflate retrieval noise until you rebuild.


The batch ingest script (described in the previous post ) already supports resume flags, for example:

# After fresh export, retry failures only
python ingest_langflow.py --retry-failed

# Force single file (delete its ingest_state.json entry first)
python ingest_langflow.py --limit 1

# Smoke test
python ingest_langflow.py --limit 3

Each file: upload → run ingest endpoint datastax-astra-ingest → atomic state save. Ctrl+C safe.

See Langflow ingest flow for graph details and chunking post if you change split settings during a refresh (that usually warrants full rebuild).


If manual runs become tedious:

  • Scheduled job on your machine or CI: export → ingest on a cron you control
  • Alert when a crawl discovers N new/changed pages (diff page_state.json hashes)
  • Optional webhook from docs pipeline (unlikely for third-party docs you do not control)

Keep Langflow and Astra credentials in CI secrets: same as local LANGFLOW_API_KEY. Ingest runs from CI or your laptop, not from public visitors.

Do not expose /api/v2/files/ to the open internet without network restrictions (self-hosting post ).


Post-refresh validation ¶

Same as initial load:

  1. Five spot-check questions in Langflow Playground
  2. Same five on Astra Docs Chat
  3. Pay attention to renamed API fields: stale vectors plus confident LLM answers are the worst combination

Good test questions from the live UI starters:

  • Collection creation steps
  • PCU groups definition
  • Hybrid search behaviour

If answers reference removed features, prefer full rebuild over incremental patch.


Re-ingest spend is mostly OpenAI embedding API calls, not Astra storage (Astra vector store post ). Budget for a full 271-file run before you schedule weekly refreshes.


Process documented; automation not built in v1. The batch ingest pattern and state files are what I use for manual refresh; hash-aware incremental ingest is the obvious next improvement.


Series index: Building Astra Docs Chat

Open Astra Docs Chat after your next re-ingest and compare answers to the live docs site.

×
Page views: