A torque figure with a missing digit is not a typo. It is a mechanic over-tightening a brake bolt. When you turn a scanned service manual into a RAG chatbot, OCR fidelity stops being a quality nicety and becomes a safety property.
The chatbot in question reads a 1994 Yamaha XV250 Virago service manual: 291 pages, scanned to an image-only PDF, no text layer at all. It is live here: virago.edestudio.us . Ask it about valve clearances or jet sizes and it answers from the OCR’d corpus, reading the retrieved chunk text verbatim. Whatever the OCR got wrong is what the rider is told.
Context: Chunking and embedding technical documentation for RAG · live manual chat: virago.edestudio.us
Why the numbers are the product ¶
A spec manual is mostly numbers: torque settings, valve clearances, fluid capacities, jet sizes, tyre pressures. An OCR error in prose is annoying. An OCR error in a figure is dangerous, and it poisons RAG twice over:
- Retrieval. Garbled text embeds into the wrong neighbourhood, so the right chunk never gets retrieved.
- Generation. The model reads the retrieved chunk verbatim. There is no second source to cross-check.
So “good enough to read” is not the bar. The bar is “faithful enough to act on.” That is the thing I wanted to measure, not guess at.
That page is the whole problem in one image. It is a picture. Every digit on it has to be recovered by an OCR engine before the chatbot can read it back to you, and the engine choice is where the fidelity is won or lost.
Four local pipelines ¶
The manual arrives as a pure image scan, so something has to read the pixels. These are local pipelines on purpose: no document leaves the machine, no hosted API, no per-page cost. I had four ways to turn the scan into text, and they are easy to confuse, so here they are plainly:
| Pipeline | What it produces |
|---|---|
| Docling + EasyOCR | Markdown. Docling runs its own EasyOCR plus table-structure. This was the corpus first indexed and live on the site. |
| Tesseract PDF | A searchable PDF: the page image with an invisible Tesseract text layer overlaid (--oem 1 --psm 1). |
| Docling + Tesseract layer | Markdown, but with Docling’s OCR turned off, reusing the Tesseract text layer from the PDF instead of EasyOCR. |
| Docling + RapidOCR | Markdown. Docling’s layout and table model with RapidOCR reading the pixels. This is what ships now. |
Two of these are about the same idea from different angles. “Docling + Tesseract layer” came out of asking: what if Docling does the layout and table work but reads text Tesseract already put in the PDF, instead of re-running OCR with EasyOCR? That recovers a lot, but it depends on a Tesseract pass having been run first. “Docling + RapidOCR” asks the more direct question: keep Docling doing genuine OCR, but swap the weak EasyOCR engine for a stronger one. No pre-baked text layer required — which matters, because the next manuals in the queue (a Honda VFR750F, a Honda Monkey) are the same kind of image-only scan, and a pipeline that depends on a separate Tesseract step is a pipeline with an extra thing to forget.
The OCR-engine swap is one option object:
# live (worst): Docling runs its own EasyOCR over the page images
ocr = EasyOcrOptions(force_full_page_ocr=True)
po = PdfPipelineOptions(do_ocr=True, ocr_options=ocr, do_table_structure=True)
# shipped: Docling runs RapidOCR over the same page images
ocr = RapidOcrOptions(backend="torch")
po = PdfPipelineOptions(do_ocr=True, ocr_options=ocr, do_table_structure=True)
# the no-OCR variant: reuse the Tesseract layer already in the PDF
po = PdfPipelineOptions(do_ocr=False, do_table_structure=True)
How I tested it ¶
No vibes. I built a ground-truth set by reading the scanned spec, maintenance and torque pages by eye and transcribing 100 known values across 14 categories: model codes, dimensions, weights, capacities, engine specs, valve and cam internals, clearances, ratios, pressures, carburetor jets, suspension, brakes, electrical bulbs and torque settings.
The metric is deliberately strict: exact match. For each value, does its correct rendering, digits and unit together, appear in that corpus? Pass or fail. 249 cm³ is not “95 percent right” when it comes out as 249 cm'. It is wrong. Edit-distance scoring would have flattered every engine and hidden exactly the damage that matters.
Matching normalises whitespace only. It never changes a digit or a unit glyph, so a value that is wrong in a corpus stays wrong in the score.
Here is the whole experiment in one frame: 100 ground-truth values down the rows, grouped into their 14 categories, and the four pipelines across the columns. Green is an exact match, red is a value that came out wrong or never appeared. Read a column top to bottom and you are reading that pipeline’s entire report card.
The eye does the aggregation before any number is quoted: the EasyOCR column is visibly redder than its neighbours, and the bottom band (torque) is red across all four. The rest of this post is just putting numbers on what that picture already shows.
The result ¶
| Pipeline | Exact-match fidelity |
|---|---|
| Docling + EasyOCR (live) | 61% |
| Tesseract PDF | 83% |
| Docling + Tesseract layer | 82% |
| Docling + RapidOCR | 85% |
The corpus that was first serving the chatbot is the worst of the four. Both fixes recover most of the gap, but Docling + RapidOCR is the most faithful overall at 85 percent, and it gets there while doing real OCR on the pixels rather than leaning on a separate Tesseract pass. That is the one that shipped.
The lift is not uniform. This is the per-category difference between RapidOCR and the live EasyOCR corpus, in percentage points. Every category recovers or holds; none go backwards.
The biggest wins are exactly where EasyOCR fell down: model codes (+80), and the dense maintenance tables — capacities and clearances (+60 each), pressures (+50), engine internal (+26). Even torque, the hardest category, moves up 20 points. Nothing regresses.
Where the damage hides ¶
The overall number hides the interesting part. Broken down by category, the failures are not spread evenly. They cluster exactly where EasyOCR struggles: model codes, and dense multi-row maintenance tables.
The fastest way to see the structure is a heatmap. Rows are the 14 categories (with their value counts), columns are the four pipelines, and the colour is exact-match fidelity from red (0 percent) to green (100). The EasyOCR column is where the red and amber live; RapidOCR is the greenest column overall.
Two patterns jump out of the colour alone. The vertical amber-and-red stripe down the EasyOCR column is the EasyOCR tax. The horizontal red band across the torque row, present in every column, is a different problem entirely, one no OCR engine fully fixes. Hold that second pattern; it is the subject of the honesty section below. Here is the same data as a grouped bar chart, if you prefer reading heights to colours:
The 19-value engine_internal category is a clear signal: valve, cam, cylinder and rocker dimensions are 100 percent in the Tesseract routes, 52.6 percent in the live EasyOCR corpus, and 78.9 percent in RapidOCR. Those are tightly packed maintenance tables, and EasyOCR plus Docling cram and mangle them. This is the kind of page those 19 values come off, a valve and valve-guide specification table with IN and EX columns, limits, and four-decimal millimetre figures packed two and three to a row:
Every figure on that page is four significant digits sitting next to another four-digit figure, with the only thing telling 6.975 ~ 6.990 apart from the EX column beside it being its horizontal position. That is exactly the layout EasyOCR-into-Docling collapses, and exactly why engine_internal halves in the live corpus.
Here is what “EasyOCR damage” actually looks like at the character level, true value against what the live corpus produced:
| True value | EasyOCR (live) reads | Failure |
|---|---|---|
XV250U |
XV2SOU |
digit to letter: 5 to S, 0 to O |
249 cm³ |
249 cm' |
superscript dropped |
11 kg/cm² |
11 kg/cm? |
superscript to question mark |
302 lb |
302 Ib |
l to capital I |
65W/60W |
65W/6oW |
0 to lowercase o |
1st 2nd 3rd 4th |
Ist 2nd 3rd Ath |
ordinals mangled |
0.6 ~ 0.7 mm |
0.6 0.7 mm |
range separator dropped |
The colour coding in that figure is the whole diagnostic. The blue failures are character-level: the engine read the right cell but mapped a glyph wrong, and a better OCR engine fixes them — which is precisely what RapidOCR does, recovering the model codes, the dropped tildes and the glyph swaps. The red one is layout-level: the digits are all correct but Docling’s table model scattered them, and a better OCR engine does much less for it.
Honesty: RapidOCR is the best one, not a safe one ¶
This is the part the headline number would let you skip. RapidOCR is clearly the most faithful of the four, but 85 percent is not 100. Two stubborn problems remain.
- A handful of categories sit at 50 percent across every engine. Weights, for instance:
302 lband304 lbcome out as302 Ib/304 Ibno matter which engine reads them, because the failure is a lowercase-L-to-capital-I confusion that every pipeline shares. Superscript units (cm³,kg/cm²) are similar — fragile everywhere. - Torque tables are only half-recovered. The usable unit of a torque spec is the triple (
58 Nm, 5.8 m-kg, 42 ft-lb) tied to its bolt. Docling’s table model crams those columns, so the triple survives in only some rows. RapidOCR gets torque to 50 percent (5 of 10), well above the Tesseract routes’ near-zero, but the other half is still scattered — because the cramming is the table-structure model’s doing, not the OCR engine’s.
This is the page the torque values come from, and you can see why a table model struggles with it. Sixty-odd rows, each a part name and a thread size followed by the same value printed three ways (Nm, then m-kg, then ft-lb), with a “Remarks” column that is empty for most rows and the page itself split into two stacked sub-tables:
The information a mechanic needs (this bolt, this torque) lives in the horizontal adjacency of four cells. When the table model mis-maps a single column boundary, the triple that should read 58 / 5.8 / 42 next to “Front wheel axle” gets split across rows, and the value the chatbot retrieves is no longer tied to its bolt. That is a layout failure, not a reading failure, which is why even the best OCR engine only partly fixes it.
So the win from RapidOCR is specific and worth naming precisely: it fixes the character-level damage (model codes, dropped separators, glyph swaps) by reading the pixels better, and it lifts torque from the floor without needing a separate Tesseract pass. It does not fully fix the layout-level damage, because that is Docling’s table model, not the OCR engine. Knowing which failure belongs to which stage is the whole point of testing instead of assuming.
What this means for the live corpus ¶
The takeaway is small and practical: the OCR engine you skip matters more than the one you run. The chatbot was serving the weakest corpus, and a 24-point lift was sitting one option object away. Re-OCRing with RapidOCR (then section-splitting the markdown and re-indexing the Cloudflare AI Search instance) lifts the corpus from 61 to 85 percent fidelity on the values that matter, and does it with a fully local pipeline that the next two manuals can reuse unchanged.
The torque tables still need a separate fix, probably a layout-aware pass or hand-correction of the tightening-torque pages, because no OCR swap will fully un-cram them. And until the superscript and torque problems are solved, the chatbot’s system prompt earns its keep: it is told to flag OCR-ambiguous, safety-critical values and advise checking against the PDF, which sits in a slide-over panel next to every answer.
The test itself is in the document-extract harness: a ground-truth JSON, a strict matcher, and a script that scores every corpus and draws these charts. When the corpus changes, the number moves, and I can see it rather than hope.
Go break it yourself: virago.edestudio.us . Ask for a torque setting and a model code, then check the answer against the scan. That gap, in one question, is what this whole post is about.