unicharambigs - Tesseract unicharset ambiguities
The unicharambigs file (a component of traineddata, see
combine_tessdata(1) ) is
used by Tesseract to represent possible ambiguities between characters, or
groups of characters.
The file contains a number of lines, laid out as follow:
[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
Field one |
the number of characters contained in field two |
Field two |
the character sequence to be replaced |
Field three |
the number of characters contained in field four |
Field four |
the character sequence used to replace field two |
Field five |
contains either 1 or 0. 1 denotes a mandatory replacement, 0 denotes an
optional replacement. |
Characters appearing in fields two and four should appear in unicharset. The
numbers in fields one and three refer to the number of unichars (not bytes).
v1
2 ' ' 1 " 1
1 m 2 r n 0
3 i i i 1 m 0
The first line is a version identifier. In this example, all instances of the
2 character sequence
'' will
always be replaced by the
1 character sequence
"; a
1 character sequence
m may be replaced by the
2 character sequence
rn,
and the
3 character sequence
may be replaced by the
1
character sequence
m.
Version 3.03 and on supports a new, simpler format for the unicharambigs file:
In this format, the "error" and "correction" are simple
UTF-8 strings separated by a space, and, after another space, the same type
specifier as v1 (0 for optional and 1 for mandatory substitution). Note the
downside of this simpler format is that Tesseract has to encode the UTF-8
strings into the components of the unicharset. In complex scripts, this
encoding may be ambiguous. In this case, the encoding is chosen such as to use
the least UTF-8 characters for each component, ie the shortest unicharset
components will make up the encoding.
The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
similar format, called DangAmbigs (
dangerous ambiguities) was used:
the format was almost identical, except only mandatory replacements could be
specified, and field 5 was absent.
This is a documentation "bug": it’s not currently clear what
should be done in the case of ligatures (such as
fi) which may also
appear as regular letters in the unicharset.
tesseract(1),
unicharset(5)
https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05.html#the-unicharambigs-file
The Tesseract OCR engine was written by Ray Smith and his research groups at
Hewlett Packard (1985-1995) and Google (2006-present).