ocrodjvu - OCR for DjVu files


ocrodjvu {-o | --save-bundled} output-djvu-file [option...] djvu-file ocrodjvu {-i | --save-indirect} index-djvu-file [option...] djvu-file ocrodjvu --save-script script-file [option...] djvu-file ocrodjvu --in-place [option...] djvu-file ocrodjvu --dry-run [option...] djvu-file ocrodjvu {--version | --help | -h | --list-engines | --list-languages}


ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on DjVu files. The following OCR engines are supported: o OCRopus[1] (internally, ocrodjvu calls ocroscript's recognize (or rec-tess) command, so that ultimately Tesseract acts as the OCR backend); o Cuneiform for Linux[2].


OCR engine options --engine=engine-id Use this OCR engine. The default is 'ocropus' (OCRopus). --list-engines Print list of available OCR engines. Options controlling output It is mandatory to use exactly one of the following options: -o, --save-bundled=output-djvu-file Save OCR results as a bundled multi-page document into output-djvu-file. -i, --save-indirect=index-djvu-file Save OCR results as an indirect multi-page document. Use index-djvu-file as the index file name; put the component files into the same directory. The directory must exist and be writable. --save-script=script-file Save a djvused script with OCR results into script-file. --in-place Save OCR results in place. (Use this option to retain compatibility with ocrodjvu < 0.2.) --dry-run Don't change any files, throw OCR results away. Text segmentation options -t lines, --details lines Record location of every line. Don't record locations of particular words or characters. This is the default for OCRopus 0.2. -t words, --details=words Record location of every line and every word. Don't record locations of particular characters. This is the default for OCRopus >= 0.3.1 and for Cuneiform. This option is ineffective with OCRopus 0.2. -t chars, --details=chars Record location of every line, every word and every character. This option is ineffective with OCRopus 0.2. --word-segmentation=simple Consider each non-empty sequence of non-whitespace characters a single word. This is the default, despite being linguistically incorrect. --word-segmentation=uax29 Use the Unicode Text Segmentation[3] algorithm to break lines into words. This option breaks assumptions of some DjVu tools that words are separated by spaces, and therefore is it not recommended. Other options --clear-text Remove existing hidden text if present in the pages not selected for OCR. (Use this option to retain compatibility with ocrodjvu < 0.2.) --ocr-only Don't save pages that were not processed. --language=language-id Set recognition language. language-id is typically an ISO 639-2 three-letter code. For OCRopus, the default is 'eng' (English), unless the tesslanguage environment variable is set. For other OCR engines, the default is always 'eng'. --list-languages Print list of available languages for the currently selected OCR engine. --render=mask Render only masks of page images. This is the default. --render=foreground Render only foreground layers of page images. --render=all Render all layers of page images. This option is necessary to OCR DjVu files with invalid foreground/background separation. -p, --pages=page-range Specifies pages to process. page-range is a comma-separated list of sub-ranges. Each sub-range is either a single page (e.g. 17) or a contiguous range of pages (e.g. 37-42). Pages are numbered from 1. The default is to process all pages. -j, --jobs=n Start up to n OCR processes. -D, --debug To ease debugging, don't delete intermediate files. --version Output version information and exit. -h, --help Display help and exit.
The following environment variables affects ocrodjvu: tesslanguage Recognition language for Tesseract. (Use this variable is deprecated in favor of the --language option.) TMPDIR Directory for temporary files. The default is /tmp.
djvu(1), ocroscript(1), tesseract(1)


Jakub Wilk <jwilk@jwilk.net> Author.


Copyright (C) 2008, 2009, 2010 Jakub Wilk


1. OCRopus //ocropus.googlecode.com/ 2. Cuneiform for Linux //launchpad.net/cuneiform-linux 3. Unicode Text Segmentation //unicode.org/reports/tr29/ OCRODJVU(1)

