How To OCR And Merge PDF Documents Using Free Command Line Utilities On Windows
Do you need to make a PDF file searchable? Follow this quick guide to create a searchable version of your PDF file without any paid software.
Do you need to make a PDF file searchable? Follow this quick guide to create a searchable version of your PDF file – Without any paid software!
The procedure described here could work for Linux and Mac as well as the same utilities are available on those platforms. I have only tested this in the Windows 10 environment. All commands are run in the standard Windows 10 command prompt.
Tesseract is the primary utility that is used for OCR. However, Tessaract does not accept a PDF file as input hence we have to follow a convoluted process of converting the PDF to PNG's by page, then running Tessaract on each page to produce an OCR version of the PDF page, and finally, merge all the individual PDF pages into a single file.
Download the utilities
- PDFtoPNG - https://dl.xpdfreader.com/xpdf-tools-win-4.02.zip
- Tesseract - https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.1.0.20190314.exe
- https://sourceforge.net/projects/qpdf/files/qpdf/10.0.1/qpdf-10.0.1-bin-msvc32.zip/download
Make sure all of these are in the same folder along with the input PDF file. Alternatively, add all of them to your PATH environment variable. Then open a new command prompt window.
Step 1: Convert PDF file to PNG
pdftopng input.pdf intermediate-file
Step 2: OCR PNG images of PDF pages to create an OCR'ed version of the page
FORFILES /S /M *.png /C "cmd /c tesseract @fname.png @fname pdf"
Step 3: Merge PDF files together
Ensure all your input PDF files are in the same directory and run the following command.
qpdf --empty --pages *.pdf -- out.pdf