Re-ingesting a RAG doc corpus when upstream docs change

Update (June 2026): the corpus has since moved to an agentic hybrid setup on Astra DB , with a differential recheck and content-checksum sync that fixes the path-keyed gap described below. This post documents the older v1 (Langflow + OpenAI) pipeline.

The Astra Docs Chat corpus was loaded once. DataStax ships doc updates regularly; v1 does not auto-refresh. This post outlines a repeatable re-ingest strategy when the markdown export changes: what to detect, how to use existing resume tooling, and when full rebuild beats incremental upsert.

Series: Building Astra Docs Chat · Batch ingest script · Chunking and embedding

Try the chat: Astra Docs Chat

When manual re-ingest is enough ¶

For a personal reference tool, running ingest after major doc releases (or when answers feel stale) may be sufficient. No cron, no diff pipeline, just:

Re-run your crawl/extract step to refresh the local pages/ markdown export
Re-run the Langflow batch ingest script with appropriate flags
Spot-check five questions on Astra Docs Chat

That matches how v1 shipped: one batch load, manual refresh when I notice drift.

Automation is optional until stale answers become painful. Docs-only guardrails reduce harm from stale vectors but do not replace fresh content.

Two state files, two jobs ¶

I track crawl and ingest in separate local state files:

File	Tracks	Phase
`page_state.json`	Crawl/extract per URL (hash, filename)	Markdown export
`ingest_state.json`	Langflow ingest per file path	Vector load

page_state.json tracking extracted DataStax Astra DB Serverless documentation URLs with SHA-256 content hashes and local markdown filenames for RAG crawl state

ingest_state.json tracking Langflow batch ingest status and uploaded file paths per markdown page for vector load resume

Crawl state (page_state.json) tracks upstream doc URLs and content hashes. Ingest state (ingest_state.json) tracks which local files have already been uploaded and embedded.

Crawl state uses SHA-256 hashes:

def compute_hash(content: str) -> str:
    return hashlib.sha256(content.encode("utf-8")).hexdigest()

During extract, unchanged pages can skip re-writing markdown. After a crawl, compare what changed before you spend embedding API credits.

Ingest state is path-based only in v1:

{
  "pages/administration_audit-log.md": {
    "status": "ingested",
    "uploaded_path": "7b90824f-.../administration_audit-log.md"
  }
}

Gap: if markdown content changes but the path is unchanged, the batch ingest script skips the file because status is already ingested. Re-ingest requires deliberate state edits or a script enhancement.

Detecting what changed ¶

After a fresh crawl/export:

New files in pages/: the batch script picks them up (not in ingest_state.json)
Changed files: remove their keys from ingest_state.json, or extend the ingest script to compare compute_hash() against a stored hash and re-run on mismatch
Deleted upstream pages: vectors for removed topics may linger until full collection rebuild or explicit deletion by metadata

Recommended incremental workflow today:

# 1. Refresh markdown export (your crawl/extract tooling)
# ...

# 2. Clear ingest state for files you know changed (or all keys for a major release)
# edit ingest_state.json manually or script it

# 3. Re-ingest via Langflow API (see batch ingest post for script flags)
python ingest_langflow.py --retry-failed

Check ingest_failed.log after every run.

Incremental vs full rebuild ¶

Strategy	Pros	Cons
Append / upsert per file	Fast, resumable	Orphan vectors if pages removed; duplicate chunks if not deduped
Truncate collection + full ingest	Clean slate	2-4 hours embedding time (batch post timing )
Collection per version	Easy rollback	More ops complexity

For 271 pages, full rebuild is painful but simple: truncate datastax_astra_docs, delete ingest_state.json, run a full batch ingest pass (batch post ).

Incremental with hash-aware ingest state is the sweet spot for repeat runs. A minimal enhancement:

# pseudocode: on each file before skip
if state[key].get("hash") != compute_hash(file.read_text()):
    del state[key]  # force re-ingest

Store hash alongside ingested status when saving state after success.

Langflow/Astra upsert behaviour depends on component settings (deletion_field, document ids). v1 appends; duplicates can inflate retrieval noise until you rebuild.

Using existing resume tooling ¶

The batch ingest script (described in the previous post ) already supports resume flags, for example:

# After fresh export, retry failures only
python ingest_langflow.py --retry-failed

# Force single file (delete its ingest_state.json entry first)
python ingest_langflow.py --limit 1

# Smoke test
python ingest_langflow.py --limit 3

Each file: upload → run ingest endpoint datastax-astra-ingest → atomic state save. Ctrl+C safe.

See Langflow ingest flow for graph details and chunking post if you change split settings during a refresh (that usually warrants full rebuild).

Automating later ¶

If manual runs become tedious:

Scheduled job on your machine or CI: export → ingest on a cron you control
Alert when a crawl discovers N new/changed pages (diff page_state.json hashes)
Optional webhook from docs pipeline (unlikely for third-party docs you do not control)

Keep Langflow and Astra credentials in CI secrets: same as local LANGFLOW_API_KEY. Ingest runs from CI or your laptop, not from public visitors.

Do not expose /api/v2/files/ to the open internet without network restrictions (self-hosting post ).

Post-refresh validation ¶

Same as initial load:

Five spot-check questions in Langflow Playground
Same five on Astra Docs Chat
Pay attention to renamed API fields: stale vectors plus confident LLM answers are the worst combination

Good test questions from the live UI starters:

Collection creation steps
PCU groups definition
Hybrid search behaviour

If answers reference removed features, prefer full rebuild over incremental patch.

Chunking and embedding technical docs for RAG : when refresh is a good time to revisit split settings
Docs-only guardrails : stale vectors plus confident answers are the worst combination

Series index: Building Astra Docs Chat

Open Astra Docs Chat after your next re-ingest and compare answers to the live docs site.