using Tesseract to produce hOCR pdf

Sunday January 27, 2019

At first I tried using a GUI application called gImageReader. However, it failed to impress me. Then, I tried tesseract, a command line tool for performing OCR on image files.

To recap, I encountered a scenario where I had a scanned PDF document that I wanted to turn it into a searchable PDF, because a PDF that is not text-searchable is a pretty useless PDF, frankly speaking.

On Ubuntu 18.04 LTS, I installed tesseract with the following command:

-- install tesseract from official repo
$ sudo apt install tesseract-ocr

As of today, the official repository gave me tesseract v4.0.0-beta.1. Pretty recent I would say. Per current source code, tesseract is now at v4.0.0 (full release version). Note that tesseract can’t work directly with PDF. The recommended way to get started is to convert the PDF into .tiff file with ImageMagick’s convert tool.

-- use ImageMagick's convert tool
$ convert -density 300 input.pdf -depth 8 -strip -background white -alpha off intermediate.tiff

At first, this convert tool gave me an error message related to usage restriction where by default convert is not allowed to work with PDF files. To relax this restriction, the policy file at /etc/ImageMagick-6/policy.xml needs to be updated.

-- open the file
$ sudo vim /etc/ImageMagick-6/policy.xml

I searched for PDF, which returned line 76. I changed the original policy from

<policy domain="coder" rights="none" pattern="PDF" />

… to this:

<policy domain="coder" rights="read|write" pattern="PDF" />

Caution. I tried this on a PDF file with the size of 739.4 KB and I got an output .tiff file with the size of 134.3 MB.

Okay, next phase: creating a searchable PDF from the intermediate.tiff.

-- run tesseract
$ tesseract intermediate.tiff output -l eng pdf

Essentially, this command takes the intermediate.tiff file, runs some recognition with the specific eng language, then spits it out as output.pdf.

Done.