NAME
tesseract - command-line OCR engineSYNOPSIS
tesseract FILE OUTPUTBASE [OPTIONS]... [CONFIGFILE]...DESCRIPTION
is a commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed at Google since then.IN/OUT ARGUMENTS
FILEThe name of the input file. This can either be
an image file or a text file.
Most image file formats (anything readable by Leptonica) are supported.
A text file lists the names of all input images (one image name per line). The
results will be combined in a single file for each output file format (txt,
pdf, hocr, xml).
If FILE is stdin or - then the standard input is used.
OUTPUTBASE
The basename of the output file (to which the
appropriate extension will be appended). By default the output will be a text
file with .txt added to the basename unless there are one or more parameters
set which explicitly specify the desired output.
If OUTPUTBASE is stdout or - then the standard output is used.
OPTIONS
-c CONFIGVAR=VALUESet value for parameter CONFIGVAR to
VALUE. Multiple -c arguments are allowed.
--dpi N
Specify the resolution N in DPI for the
input image(s). A typical value for N is 300. Without this option, the
resolution is read from the metadata included in the image. If an image does
not include that information, Tesseract tries to guess it.
-l LANG, -l SCRIPT
The language or script to use. If none is
specified, eng (English) is assumed. Multiple languages may be specified,
separated by plus characters. Tesseract uses 3-character ISO 639-2 language
codes (see LANGUAGES AND SCRIPTS).
--psm N
Set Tesseract to only run a subset of layout
analysis and assume a certain form of image. The options for N are:
--oem N
0 = Orientation and script detection (OSD) only. 1 = Automatic page segmentation with OSD. 2 = Automatic page segmentation, but no OSD, or OCR. (not implemented) 3 = Fully automatic page segmentation, but no OSD. (Default) 4 = Assume a single column of text of variable sizes. 5 = Assume a single uniform block of vertically aligned text. 6 = Assume a single uniform block of text. 7 = Treat the image as a single text line. 8 = Treat the image as a single word. 9 = Treat the image as a single word in a circle. 10 = Treat the image as a single character. 11 = Sparse text. Find as much text as possible in no particular order. 12 = Sparse text with OSD. 13 = Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
Specify OCR Engine mode. The options for
N are:
--tessdata-dir PATH
0 = Original Tesseract only. 1 = Neural nets LSTM only. 2 = Tesseract + LSTM. 3 = Default, based on what is available.
Specify the location of tessdata path.
--user-patterns FILE
Specify the location of user patterns
file.
--user-words FILE
Specify the location of user words file.
CONFIGFILE
The name of a config to use. The name can be a
file in tessdata/configs or tessdata/tessconfigs, or an absolute or relative
file path. A config is a plain text file which contains a list of parameters
and their values, one per line, with a space separating parameter from value.
Interesting config files include:
•alto — Output in ALTO
format ( OUTPUTBASE.xml).
•hocr — Output in hOCR
format ( OUTPUTBASE.hocr).
•pdf — Output PDF
(OUTPUTBASE.pdf).
•tsv — Output TSV
(OUTPUTBASE.tsv).
•txt — Output plain text
( OUTPUTBASE.txt).
•get.images — Write
processed input images to file (
OUTPUTBASE.processedPAGENUMBER.tif).
•logfile — Redirect debug
messages to file (tesseract.log).
•lstm.train — Output
files used by LSTM training ( OUTPUTBASE.lstmf).
•makebox — Write box file
( OUTPUTBASE.box).
•quiet — Redirect debug
messages to /dev/null.
SINGLE OPTIONS
-h, --helpShow help message.
--help-extra
Show extra help for advanced users.
--help-psm
Show page segmentation modes.
--help-oem
Show OCR Engine modes.
-v, --version
Returns the current version of the
executable.
--list-langs
List available languages for tesseract engine.
Can be used with --tessdata-dir PATH.
--print-parameters
Print tesseract parameters.
LANGUAGES AND SCRIPTS
To recognize some text with Tesseract, it is normally necessary to specify the language(s) or script(s) of the text (unless it is English text which is supported by default) using -l LANG or -l SCRIPT.CONFIG FILES AND AUGMENTING WITH USER DATA
Tesseract config files consist of lines with parameter-value pairs (space separated). The parameters are documented as flags in the source code like the following one in tesseractclass.h:the quick brown fox jumped
1-\d\d\d-GOOG-411 www.\n\\\*.com
load_system_dawg F load_freq_dawg F user_words_suffix user-words user_patterns_suffix user-patterns
ENVIRONMENT VARIABLES
TESSDATA_PREFIXIf the TESSDATA_PREFIX is set to a path, then
that path is used to find the tessdata directory with language and script
recognition models and config files. Using --tessdata-dir PATH
is the recommended alternative.
OMP_THREAD_LIMIT
If the tesseract executable was built with
multithreading support, it will normally use four CPU cores for the OCR
process. While this can be faster for a single image, it gives bad performance
if the host computer provides less than four CPU cores or if OCR is made for
many images. Only a single CPU core is used with OMP_THREAD_LIMIT=1.
HISTORY
The engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. A lot of the code was written in C, and then some more was written in C++. The C++ code makes heavy use of a list system using macros. This predates STL, was portable before STL, and is more efficient than STL lists, but has the big negative that if you do get a segmentation violation, it is hard to debug.RESOURCES
Main web site: https://github.com/tesseract-ocr User forum: https://groups.google.com/g/tesseract-ocr Documentation: https://tesseract-ocr.github.io/ Information on training: https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.htmlSEE ALSO
ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1), shape_training(1), mftraining(1), unicharambigs(5), unicharset(5), unicharset_extractor(1), wordlist2dawg(1)AUTHOR
Tesseract development was led at Hewlett-Packard and Google by Ray Smith. The development team has included:COPYING
Licensed under the Apache License, Version 2.001/11/2023 |