tesseract - command line OCR tool
Part of the process to train tesseract for a new language. Tesseract needs to know the set of possible characters it can output. To generate the unicharset data file, use the unicharset_extractor program on the training pages bounding box files: unicharset_extractor fontfile_1.box fontfile_2.box ...
This manual page documents briefly the unicharset_extractor command. tesseract is a commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005. Tesseract needs to have access to character properties isalpha, isdigit, isupper, islower. This data must be encoded in the unicharset data file. Each line of this file corresponds to one character. The character in UTF-8 is followed by a hexadecimal number representing a binary mask that encodes the properties. Each bit corresponds to a property. If the bit is set to 1, it means that the property is true. The bit ordering is (from least significant bit to most significant bit): isalpha, islower, isupper, isdigit.


feh(1), convert(1), mftraining(1), cntraining(1), tesseract(1), wordlist2dawg(1).


tesseract was written by Ray Smith. This manual page was written by Jeffrey Ratcliffe <Jeffrey.Ratcliffe@gmail.com>, for the Debian project (but may be used by others). August 21, 2007 TESSERACT(1)

