apertium

NAME

apertium — machine translation application platform

SYNOPSIS

apertium

[-au] [-d datadir] [-f format] language-pair [infile [outfile]]

DESCRIPTION

apertium is the application that most people will be using as it simplifies the use of apertium/lt-toolbox tools for machine translation purposes.

This tool tries to ease the use of lt-toolbox (which contains all the lexical processing modules and tools) and apertium (which contains the rest of the engine) by providing a unique front-end to the end-user.

The different modules behind the apertium machine translation architecture are in order:

de-formatter: Separates the text to be translated from the format information.
morphological-analyser: Tokenizes the text in surface forms.
part-of-speech tagger: Chooses one surface forms among homographs.
lexical transfer module: Reads each source-language lexical form and delivers a corresponding target-language lexical form.
structural transfer module: Detects fixed-length patterns of lexical forms (chunks or phrases) needing special processing due to grammatical divergences between the two languages and performs the corresponding transformations.
morphological generator: Delivers a target-language surface form for each target-language lexical form, by suitably inflecting it.
post-generator: Performs orthographical operations such as contractions and apostrophations.
re-formatter: Restores the format information encapsulated by the de-formatter into the translated text and removes the encapsulation sequences used to protect certain characters in the source text.

OPTIONS

-d datadir

The directory holding the linguistic data. By default it will use the expected installation path.

language-pair

The language pair: LANG1–LANG2 (for instance “es-ca” or “ca-es”).

-f format

Specifies the format of the input and output files which can have these values:

txt: (default value) Input and output files are in text format.
html: Input and output files are in “html” format. This “html” is the one accepted by the vast majority of web browsers.
html-noent: Input and output files are in “html” format, but preserving native encoding characters rather than using HTML text entities.
rtf: Input and output files are in “rtf” format. The accepted “rtf” is the one generated by Microsoft WordPad and Microsoft Office up to and including Office 97.

-u

Disable marking of unknown words with the ‘*’ character.

-H

Enable header-detection (only used in some language pairs; will lead to stray ‘❡’ characters in pairs that don't support it).

-a

Enable marking of disambiguated words with the ‘=’ character.

FILES

These are the two files that can be used with this command:

-m memory.tmx: use a translation memory to recycle translations
-o direction: translation direction using the translation memory, by default “direction” is used instead
-l: lists the available translation directions and exits direction typically, LANG1–LANG2, but see modes.xml in language data
infile: Input file (stdin by default).
outfile: Output file (stdout by default).

COPYRIGHT

BUGS

Many... lurking in the dark and waiting for you!

March 8, 2006

Apertium