apertium —
machine translation application platform
apertium |
[-au]
[-d
datadir]
[-f
format]
language-pair
[infile
[outfile]] |
apertium is the application that most people will
be using as it simplifies the use of apertium/lt-toolbox tools for machine
translation purposes.
This tool tries to ease the use of
lt-toolbox
(which contains all the lexical processing modules and tools) and
apertium (which contains the rest of the engine)
by providing a unique front-end to the end-user.
The different modules behind the apertium machine translation architecture are
in order:
- de-formatter
- Separates the text to be translated from the format
information.
- morphological-analyser
- Tokenizes the text in surface forms.
- part-of-speech tagger
- Chooses one surface forms among homographs.
- lexical transfer module
- Reads each source-language lexical form and delivers a
corresponding target-language lexical form.
- structural transfer module
- Detects fixed-length patterns of lexical forms (chunks or
phrases) needing special processing due to grammatical divergences between
the two languages and performs the corresponding transformations.
- morphological generator
- Delivers a target-language surface form for each
target-language lexical form, by suitably inflecting it.
- post-generator
- Performs orthographical operations such as contractions and
apostrophations.
- re-formatter
- Restores the format information encapsulated by the
de-formatter into the translated text and removes the encapsulation
sequences used to protect certain characters in the source text.
-
-d
datadir
- The directory holding the linguistic data. By default it
will use the expected installation path.
- language-pair
- The language pair:
LANG1–LANG2
(for instance “es-ca” or “ca-es”).
-
-f
format
- Specifies the format of the input and output files which
can have these values:
- txt
-
(default value) Input and
output files are in text format.
- html
- Input and output files are in “html”
format. This “html” is the one accepted by the vast
majority of web browsers.
- html-noent
- Input and output files are in “html”
format, but preserving native encoding characters rather than using
HTML text entities.
- rtf
- Input and output files are in “rtf”
format. The accepted “rtf” is the one generated by
Microsoft WordPad and Microsoft Office up to and including Office
97.
- -u
- Disable marking of unknown words with the
‘
*
’ character.
- -H
- Enable header-detection (only used in some language pairs;
will lead to stray ‘
❡
’
characters in pairs that don't support it).
- -a
- Enable marking of disambiguated words with the
‘
=
’ character.
These are the two files that can be used with this command:
-
-m
memory.tmx
- use a translation memory to recycle translations
-
-o
direction
- translation direction using the translation memory, by
default “direction” is used instead
- -l
- lists the available translation directions and exits
direction typically,
LANG1–LANG2,
but see modes.xml in language data
- infile
- Input file (
stdin
by default).
- outfile
- Output file (
stdout
by default).
apertium-tagger(1),
lt-comp(1),
lt-expand(1),
lt-proc(1)
Copyright © 2005, 2006 Universitat d'Alacant / Universidad de Alicante.
This is free software. You may redistribute copies of it under the terms of
the GNU
General Public License.
Many... lurking in the dark and waiting for you!