At first I tried using a GUI application called gImageReader. However, it failed to impress me. Then, I tried tesseract
, a command line tool for performing OCR on image files.
To recap, I encountered a scenario where I had a scanned PDF document that I wanted to turn it into a searchable PDF, because a PDF that is not text-searchable is a pretty useless PDF, frankly speaking.
On Ubuntu 18.04 LTS, I installed tesseract
with the following command:
-- install tesseract from official repo
$ sudo apt install tesseract-ocr
As of today, the official repository gave me tesseract
v4.0.0-beta.1. Pretty recent I would say. Per current source code, tesseract
is now at v4.0.0 (full release version). Note that tesseract
can’t work directly with PDF. The recommended way to get started is to convert the PDF into .tiff
file with ImageMagick’s convert
tool.
-- use ImageMagick's convert tool
$ convert -density 300 input.pdf -depth 8 -strip -background white -alpha off intermediate.tiff
-density 300
and-depth 8
control the resolution and quality of the image.-strip -background white -alpha off
removes alpha channel and turns the background itno white. Tesseract kinda needs this.input.pdf
is the PDF input andintermediate.tiff
is the file that we feed intotesseract
later on.
At first, this convert
tool gave me an error message related to usage restriction where by default convert
is not allowed to work with PDF files. To relax this restriction, the policy file at /etc/ImageMagick-6/policy.xml
needs to be updated.
-- open the file
$ sudo vim /etc/ImageMagick-6/policy.xml
I searched for PDF
, which returned line 76. I changed the original policy from
<policy domain="coder" rights="none" pattern="PDF" />
… to this:
<policy domain="coder" rights="read|write" pattern="PDF" />
Caution. I tried this on a PDF file with the size of 739.4 KB and I got an output .tiff
file with the size of 134.3 MB.
Okay, next phase: creating a searchable PDF from the intermediate.tiff
.
-- run tesseract
$ tesseract intermediate.tiff output -l eng pdf
Essentially, this command takes the intermediate.tiff
file, runs some recognition with the specific eng
language, then spits it out as output.pdf
.
Done.