A vendor sends you their MSA as a PDF. You drop it into your comparison tool. The tool returns "we couldn't extract text from this document". You look at the file. It opens normally in your PDF reader. The text is right there. What gives?
The two kinds of PDF most procurement teams don't know exist
Every PDF you'll ever receive belongs to one of two species:
- Native-text PDFs. Generated digitally — Word "Save as PDF", Google Docs export, Pages, LaTeX. The text is selectable, searchable, and machine-readable. About 60–70 % of vendor proposals you receive.
- Image-based PDFs (scans). Generated by printing a document and scanning it — or by image-based exports from some Word-to-PDF tools. Visually identical to native-text, but the text is pixels, not characters. Selecting "text" actually copies an image. About 30–40 % of vendor proposals.
The classic test: open the PDF, click in the middle of a paragraph, try to select a sentence. If the selection respects words and punctuation, it's native text. If selection draws a rectangle around the image, it's a scan.
Why scanned PDFs break most AI comparison tools
The standard text-extraction libraries (pdfjs, pdfminer, PyPDF2) read the text layer of a PDF. If there is no text layer — because the PDF is an image — they return an empty string. Most AI vendor-comparison tools take that empty string at face value, feed it to the LLM, and produce a useless analysis ("Document 2 contains no extractable content"). The user is stuck.
The fix is Optical Character Recognition (OCR): run each page through a vision model that reads the pixels and outputs the text it sees. OCR is a 60-year-old technology and modern open-source engines (Tesseract, EasyOCR, PaddleOCR) hit 95 %+ accuracy on cleanly-scanned business documents in English.
How POCsheet handles it
When you upload a PDF, POCsheet now:
- Tries native text extraction first (cheapest, instant).
- If the extracted text averages fewer than ~30 characters per page and no single page broke 80 characters, treats the PDF as a scan.
- Falls back to Tesseract.js, running OCR over every page client-side, in your browser. PDFs never leave your machine.
- Shows a live progress indicator: "Scanning
SLA_v2.pdfvia OCR — page 6 / 14…". - Feeds the OCR'd text into the standard AI pipeline. The resulting report works exactly like any other comparison — including source citations.
OCR is slower than native extraction — about 3–6 seconds per page at the 1.5× scale we use. A 12-page scanned MSA takes ~45 seconds. The UI tells the user this is happening, with progress, so the wait is honest. For native-text PDFs (the majority), nothing changes: extraction is still instant.
What scanned PDFs typically mean about the deal
Worth saying: a vendor who sends you a scanned proposal in 2026 is doing one of three things:
- Their template lives in a doc-management system that exports as image (older Conga / DocuSign workflows).
- Their legal team prints, signs and rescans to create a "final" version with wet-ink signatures.
- They're flattening the document so it can't be edited or commented on.
None of these is a red flag on its own. The third one is worth a quick check — are they trying to discourage redlines? In any case, OCR + AI comparison means you no longer have to manually re-type the contract into a Word doc to negotiate it.
Limits to be honest about
- OCR accuracy drops sharply on poor-quality scans (skewed, low-DPI, faded). For these, manual re-OCR with a desktop tool may be better.
- Non-English documents OCR'd with English-only models will produce nonsense. POCsheet uses English by default; multi-language OCR is on the roadmap.
- Tables in scanned PDFs come back as flat text, not structured rows. The LLM usually reconstructs the structure but it's not perfect.