The faculty of language is one of the most important
features distinguishing human beings from other creatures, and the
human languages which have evolved over thousands of years are among
men's greatest cultural achievements. They are organic structures
which are so highly flexible that they often thought to be chaotic.
How can it be that a computer program can cope with these organic
structures, the languages, understand them and even translate from
one into another? Without entering into philosophical considerations,
it is fair to say that a computer program that it understands as
little of languages as it does of the orbits of satellites which
it can accurately compute, or of chess, even if it is able to beat
Kasparov.
Translation programs apply rules and knowledge with which their
developers are trying to model language. Sometimes such rules are
found by statistically analyzing huge amounts of text data, but
in any case with the aim to imitate the behavior of a translator.
Since languages are so highly complex, nobody has been able yet
to model the functioning of languages completely and accurately.
This becomes evident when translation programs make mistakes or
even break down.
The main difficulty which translation programs have to struggle
with is the ambiguity of many linguistic utterances, of single words
and also whole sentences. A large portion of the rules in translation
systems describe which meaning is required under which conditions.
This can be illustrated by examples such as the following ones:
Der Kurs findet statt. (The course takes place.)
Der Kurs fällt. (The rate is falling.)
Briefträger beißen Hunde selten.
Dogs seldom bite postmen.
Postmen seldom bite dogs.
The first example show different readings of the German word Kurs
which are disambiguated by the context; the second example shows
an ambiguous sentence structure (typical for German) - it is not
clear whether Briefträger is subject or object of the sentence.
In spite of all difficulties, machine translation, which has been
worked on since the beginning of computers in the forties of the
last century, has made enough progress that it now has become a
big help when dealing with foreign language texts. How they work
is sketched here.
Translation in seven steps
We describe here the translation of written texts or documents,
not the interpretation of spoken utterances. The transfer of spoken
to written language, and the synthesis of spoken language from written
texts are topics in their own right which can be treated separately.
1. Segmenting documents into words, sentences and formatting information
The basic elements of translation programs are words and rules
for combining them to form sentences, paragraphs and complete texts.
Every document to be translated first needs to be decomposed into
words, numbers and punctuation marks. Since the layout of the translation
in most cases should look just like the original, this information
must also be recognized so it can be inserted into the translation
at the proper places.
Since the rules of combining - the grammatical rules - address
sentences, also sentence boundaries need to be determined. Unfortunately,
this is less easy than it may appear at first sight. A period may
mark the end of a sentence, an abbreviation, a German ordinal number,
it may be a decimal point or part of an e-mail or internet address
2. Reduction of word forms to their canonical form and dictionary
lookup
Every translation program needs a dictionary. Here all information
is stored which is necessary for the analysis of sentences and their
translation, e.g. part of speech, gender, or semantic classification.
In principle, each possible form of a word could be put into the
dictionary, e.g. German schlafen, schlafe, schläfst, schläft,
schlaft, schlief, ... Often this is not done, but a so-called
morphological decomposition is preferred where the different word
forms are reduced to a canonical form - the keyword in conventional
dictionaries. This form is then used to do the dictionary lookup,
and the word form at hand is assigned its corresponding grammatical
information. E.g. schläfst - 2nd person singular present.
3. Recognizing sentential structures
In the beginning many researchers believed that could obtain reasonable
translations by having a program translate word by word. It became
clear very quickly that this was an illusion, because firstly, languages
differ very much in word order, and secondly, many words can have
more than one meaning of which only one is valid in a given sentence.
The results were completely unintelligible sequences of alternate
word translations which nobody could use.
So, a translation program must "know" grammar. Each word
and each phrase must be assigned its role in the sentence, and it
must be determined as precisely as possible which combinations are
probable, possible, excluded. The precision of these rules is decisive
for translation quality.
The meaning of words not only depends on the context within a sentence,
but also relationships between sentences are important. The use
of pronouns such as German er, sie, es can make the interpretation
of a sentence more difficult. E.g., how shall the word einstellen
be translated in the sentence
Das Unternehmen stellt sie ein.
Is it hire, adjust, stop or still something else? This depends
on whether sie refers to a person, a machine or the production
of something. If that is not known, neither a human being nor a
program is able to produce a reasonable translation for this sentence.
4. Assigning translations to single words
Each word and many word groups are associated with one or more
translations in the dictionary. When after grammatical analysis
the contexts of the words are known, the appropriate translations
can be selected:
5. Generating the structure of target sentences
Starting from the structure of the source sentence and the word
translations selected, the structure of the target sentence is built
up. It can be quite different from the original. Thus
John grows a beard.
becomes
John lässt sich einen Bart wachsen.
because the word wachsen in German is not transitive, and
therefore an additional verb - lassen - is required as a
kind of intermediary.
6. Generating word forms
During the generation of the correct word order for the target
sentence, translation programs usually work with canonical forms
or word stems. Only after the structure established, forms such
as lass, ein and wachs of the previous example become
lässt, einen and wachsen.
7. Adding layout information
The layout information which was taken out in the first step must
now be added to the translations such that in the end there is a
new text which almost looks like the original. One note may be in
order here: some formatting information such as bold face must be
available even during the translation process, since the corresponding
translations should appear in bold as well. |