Rebuilding the document analyzer on Cloudflare’s full stack

Earlier this year I wrote about building a document analyzer using Ollama and Llama2 running on my NAS at home. It worked. But I was already running the rest of the project on Cloudflare Workers, and having the AI piece live on a home server felt increasingly out of place. If the NAS was slow, the tool was slow. If it was off, the tool was off.

The obvious move was to go full Cloudflare. This post is about what that looks like now and, specifically, what happens behind the scenes when you press Analyze.

You can try it here: Document Analyzer

The document analyzer interface showing a completed analysis. A collapsed “Show Thinking” spoiler sits above the result area, which contains a multi-sentence summary of the uploaded document.

What changed at the top level ¶

The original setup was:

Ollama running on my NAS, exposed via Cloudflare Tunnel
A single Cloudflare Worker handling requests
Three prompts: summarize, key points, sentiment

The new version:

Workers AI running @cf/google/gemma-4-26b-a4b-it - no home server involved
Cloudflare KV for caching results
D1 for storing analysis history
Analytics Engine for metrics
AI Gateway in front of every model call
17 prompts, validated server-side
Full TypeScript throughout

Same idea, different foundation.

What happens when you press Analyze ¶

The front end sends a POST to /api/analyze with two fields: the document text and the chosen prompt. Everything interesting happens in the Worker from there.

Step 1: Prompt validation

The prompt is checked against a server-side allow-list. The prompts are stored as a JSON string in a Worker environment variable and parsed at request time. If the prompt string does not match one of the known prompts exactly, the request is rejected with a 400. This is a simple guard against someone crafting a POST with an arbitrary instruction.

Step 2: Cache lookup

A SHA-256 hash is computed from the combination of the document text and the prompt. That hash becomes the KV cache key. If the hash is already in KV, the cached result is returned immediately as a Server-Sent Events stream. The response is instant and the AI is never called.

Step 3: AI call

On a cache miss, the Worker calls Workers AI via AI Gateway. The model is @cf/google/gemma-4-26b-a4b-it, Google’s Gemma 4 with a 128k token context window. That headroom matters for longer documents — the smaller Llama 3.1 8B models top out at 32k tokens, which is not enough once you factor in the system preamble, the document, and the expected output. The request includes a system preamble that constrains the model to the document content only - no external assumptions, no padding, shortest answer that addresses the prompt.

Gemma 4 is also a reasoning model, meaning the API response includes two distinct token streams: delta.reasoning for the model’s internal chain-of-thought, and delta.content for the actual answer. The Worker forwards both to the browser as separate SSE event types ({ reasoning } and { chunk }), which the front end handles independently.

Step 4: True streaming via TransformStream

The Worker pipes the AI stream through a TransformStream that processes each SSE chunk as it arrives, parsing the line and forwarding the appropriate event to the browser immediately. There is no buffering — the first token reaches the browser before the model has finished generating the last one.

The transform’s flush() method fires when the stream ends. That is when fullResult is resolved and the side effects kick off via ctx.waitUntil:

The full result is written to KV with a 24-hour TTL
A row is inserted into D1 with the prompt, a hash of the document, the first 200 characters of the document, and the first 500 characters of the result
A data point is written to Analytics Engine

The browser is already displaying the response by the time any of these writes happen.

Showing the model’s thinking ¶

Reasoning models think before they answer. With Gemma 4 that reasoning is visible in the stream — it just was not being surfaced anywhere. Showing it felt like a straightforward improvement, as long as it did not get in the way of the actual result.

The implementation has two parts. On the backend, the TransformStream already handles delta.reasoning and delta.content separately. Both are forwarded to the browser, just tagged differently. On the frontend, a <details> element sits above the result area. When the first { reasoning } event arrives, it becomes visible and the text streams into it. Clicking it expands to show the full chain-of-thought. When the answer finishes, the label changes from “Thinking…” to “Show Thinking” so it is obvious it is now a static record rather than a live feed.

The document analyzer result area with a collapsed “Show Thinking” disclosure element above it. The label reads “Show Thinking” in teal, with a right-pointing arrow and a faint “click to see model reasoning” hint on the right.

Expand it and you can watch the model work through the document — checking word counts, noting structure, deciding what is worth including. It is more interesting than a blinking cursor.

The “Show Thinking” disclosure element expanded, revealing the model’s internal reasoning as monospace text. The chain-of-thought shows the model working through the document structure before producing its answer.

The reasoning tokens are not cached, so they only appear on a fresh request. A cache hit returns just the final answer.

Why hash document and prompt together ¶

The cache key is sha256(documentText + prompt). Not just the document. Not just the prompt.

The same document analyzed with “Summary” and “Sentiment” should produce different cached results. Hashing both together means each unique combination gets its own cache entry. The 24-hour TTL is a reasonable balance - documents do not change, but I did not want stale entries accumulating indefinitely.

The other benefit: the hash is what goes into D1 as doc_hash. No raw document content is stored in the database, only a snippet for reference.

What Analytics Engine actually tells you ¶

Each request writes a data point with:

Blob: the prompt label (e.g. “Summary”, “Key points”)
Doubles: cache hit as 0 or 1, response time in milliseconds, document length in characters

This lets me query things like: which prompts are used most, what percentage of requests are cache hits, and whether certain prompts are slower than others. Nothing complex, but enough to understand how the tool is actually being used without guessing.

Same proxy pattern on Astra Docs Chat ¶

When I built Astra Docs Chat , I applied the same rule: the browser never calls the upstream AI service directly. A Cloudflare Pages Function at /api/astra-chat proxies to a private Langflow instance instead. See Proxying Langflow from Cloudflare Pages Functions for that write-up.

Try it ¶

Open jamieede.com/analyzer , paste in any document, pick a prompt, and see what comes back.

The version described here is live. If you are curious about a specific part of the implementation or have a prompt type you think is missing, I am interested to hear it, reach out on LinkedIn