As some of you may know, I am trying to scan a whole book written by my grandfather, as I plan to publish it on the internet. It is a very inspiring book which could change our society.
So, first thing first, I need to scan the whole book and use an Optical Character Recognition (OCR) software as a first step to put it into HTML form.
This is the main documentation wiki for Ubuntu OCR.
I only had a test scan in .pdf format to test with, so I followed the instructions to use GIMP to transform it into a TIFF (.tif) file.
I used Tessaract to OCR the resulting image but got an empty file as a result.
I found this issue
which was fixed in tesseract 2.04, which is a higher version from the ubuntu one. So I had to compile it from source.
I downloaded the 2.04 source, and the french and english language files.
I tried ./configure but got:
tesseract-2.04$ ./configure
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking for cl.exe... no
checking for g++... no
checking for C++ compiler default output file name...
configure: error: C++ compiler cannot create executables
Apparently, I was missing the g++ compiler.
I installed the package build-essential and tried again ./configure which worked this time.
I then tried make which ended with the following error:
g++ -DHAVE_CONFIG_H -I. -I.. -g -O2 -MT svutil.o -MD -MP -MF .deps/svutil.Tpo -c -o svutil.o svutil.cpp
svutil.cpp: In constructor ‘SVNetwork::SVNetwork(const char*, int)’:
svutil.cpp:323: error: ‘snprintf’ was not declared in this scope
make[3]: *** [svutil.o] Error 1
It was a known issue which had only been fixed in the upcoming 3.0 version.
The solution:
edit "viewer/svutil.cpp"
And... add in the line below at about line 35:
#include <stdio.h>
then do make again.
Lastly do:
sudo make install
Note that the ubuntu package installs the tessaract binary here: /usr/bin/tesseract, while the manually compiled software is at: /usr/local/bin/tesseract. Make sure to use the full path to call the version you want.
Unfortunately, compiling tessaract didn't help me. I need to do a better scan of the book, directly to TIFF format this time.
I still posted this here, hoping it would help someone. If it did, do post a comment below. Spammers may abstain: comments are moderated.
OCR / Scan documentation
See the new OCR and scanner documentation pages:
http://linux.overshoot.tv/wiki/ocr_optical_character_recognition
http://linux.overshoot.tv/wiki/scanners
Post new comment