Skip navigation.

Compile OCR Tessaract on Kubuntu

As some of you may know, I am trying to scan a whole book written by my grandfather, as I plan to publish it on the internet. It is a very inspiring book which could change our society.

So, first thing first, I need to scan the whole book and use an Optical Character Recognition (OCR) software as a first step to put it into HTML form.

This is the main documentation wiki for Ubuntu OCR.

I only had a test scan in .pdf format to test with, so I followed the instructions to use GIMP to transform it into a TIFF (.tif) file.

I used Tessaract to OCR the resulting image but got an empty file as a result.

I found this issue
which was fixed in tesseract 2.04, which is a higher version from the ubuntu one. So I had to compile it from source.

I downloaded the 2.04 source, and the french and english language files.
I tried ./configure but got:

tesseract-2.04$ ./configure
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking for cl.exe... no
checking for g++... no
checking for C++ compiler default output file name...
configure: error: C++ compiler cannot create executables

Apparently, I was missing the g++ compiler.
I installed the package build-essential and tried again ./configure which worked this time.

I then tried make which ended with the following error:

g++ -DHAVE_CONFIG_H -I. -I..     -g -O2 -MT svutil.o -MD -MP -MF .deps/svutil.Tpo -c -o svutil.o svutil.cpp
svutil.cpp: In constructor ‘SVNetwork::SVNetwork(const char*, int)’:
svutil.cpp:323: error: ‘snprintf’ was not declared in this scope
make[3]: *** [svutil.o] Error 1

It was a known issue which had only been fixed in the upcoming 3.0 version.

The solution:
edit "viewer/svutil.cpp"
And... add in the line below at about line 35:

#include <stdio.h>

then do make again.
Lastly do:
sudo make install

Note that the ubuntu package installs the tessaract binary here: /usr/bin/tesseract, while the manually compiled software is at: /usr/local/bin/tesseract. Make sure to use the full path to call the version you want.

Unfortunately, compiling tessaract didn't help me. I need to do a better scan of the book, directly to TIFF format this time.
I still posted this here, hoping it would help someone. If it did, do post a comment below. Spammers may abstain: comments are moderated.

OCR / Scan documentation

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Use [fn]...[/fn] (or <fn>...</fn>) to insert automatically numbered footnotes.
  • Allowed HTML tags: <a> <em> <strong> <blockquote> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.
  • Web page addresses and e-mail addresses turn into links automatically. (Better URL filter.)

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.