OCR (Optical Character Recognition) converts a scanned PDF — or any PDF whose content is an image rather than text — into a document where the text can be selected, copied, and searched. A scanned contract you can't copy text from, a photographed receipt, a faxed form: these are the documents OCR is built for.
Digital PDFs vs scanned PDFs
A digital PDF is created directly by software: Word, Excel, Adobe InDesign, a web browser's print function. The text is stored as text in the PDF file — you can click and drag to select it, search it with Ctrl+F, and copy it without any processing.
A scanned PDF is a photograph of a page: each page is a raster image (JPEG or PNG) embedded in a PDF wrapper. There is no text layer. To search or copy the text, OCR must be run to read the image and produce a text representation.
The distinction matters because OCR is only needed for scanned PDFs. Running OCR on a digital PDF is harmless but wasteful — the tool will usually detect that the text layer already exists and pass it through unchanged.
How OCR accuracy is determined
OCR accuracy depends on input image quality more than anything else. The factors that matter most:
Resolution: scanned pages at 300 DPI produce high-accuracy OCR. Pages scanned at 72 DPI (typical for screen captures) produce unreliable results. If the characters look blurry or pixelated when you zoom in on the original, OCR accuracy will be poor.
Scan angle: pages that were placed at an angle on the scanner (visible as text that runs slightly diagonally) reduce accuracy significantly. Pre-processing that deskews the image before OCR helps.
Font and layout: standard typefaces (Times, Helvetica, Arial) OCR with near-perfect accuracy on clean scans. Handwriting does not OCR well with standard models — specialized handwriting recognition is a different technology.
Background: pages with dark, stained, or patterned backgrounds confuse OCR engines. A clean white background produces the best results.
When OCR output needs review
OCR is never 100% accurate. Common errors include: '0' and 'O' confused, '1', 'l', and 'I' confused, punctuation dropped or misread, line breaks inserted mid-word, tables reformatted as prose. For a legal document, a medical record, or any text where precision matters, review the extracted text against the original before using it.
For long documents, extract a sample of known content (a few paragraphs you can compare manually) to assess the accuracy of the OCR output before relying on it.
Using Filum's OCR PDF tool
Upload the PDF. Filum runs Tesseract OCR (the leading open-source engine, also used by Google) on each page and returns a searchable PDF with an embedded text layer. The original page images are preserved — the PDF looks identical to the input, but now supports text selection and Ctrl+F search.
The file is processed on a secure server. Your document is sent over an encrypted connection, processed, and immediately deleted — it is never stored permanently.