To recap, I encountered a scenario where I had a scanned PDF document that I wanted to turn it into a searchable PDF, because a PDF that is not text-searchable is a pretty useless PDF, frankly speaking.
On Ubuntu 18.04 LTS, I installed
tesseract with the following command:
-- install tesseract from official repo $ sudo apt install tesseract-ocr
As of today, the official repository gave me
tesseract v4.0.0-beta.1. Pretty recent I would say. Per current source code,
tesseract is now at v4.0.0 (full release version). Note that
tesseract can’t work directly with PDF. The recommended way to get started is to convert the PDF into
.tiff file with ImageMagick’s
-- use ImageMagick's convert tool $ convert -density 300 input.pdf -depth 8 -strip -background white -alpha off intermediate.tiff
-depth 8control the resolution and quality of the image.
-strip -background white -alpha offremoves alpha channel and turns the background itno white. Tesseract kinda needs this.
input.pdfis the PDF input and
intermediate.tiffis the file that we feed into
At first, this
convert tool gave me an error message related to usage restriction where by default
convert is not allowed to work with PDF files. To relax this restriction, the policy file at
/etc/ImageMagick-6/policy.xml needs to be updated.
-- open the file $ sudo vim /etc/ImageMagick-6/policy.xml
I searched for
<policy domain="coder" rights="none" pattern="PDF" />
… to this:
<policy domain="coder" rights="read|write" pattern="PDF" />
Caution. I tried this on a PDF file with the size of 739.4 KB and I got an output
.tiff file with the size of 134.3 MB.
Okay, next phase: creating a searchable PDF from the
-- run tesseract $ tesseract intermediate.tiff output -l eng pdf
Essentially, this command takes the
intermediate.tiff file, runs some recognition with the specific
eng language, then spits it out as