1. Machine translation
Machine translation of natural languages known as MT is a subfield of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. While semi-automated tools have been applauded in the recent past as the most realistic path to follow, it is no longer the case.
The current consensus is that fully automated, efficient translation tools should remain the primary goal. The nature of users of such systems and the type of text involved leave little room for continued dependence on human aids [Salem, Y 2008]. It was one of the earliest applications suggested for digital computers, but turning this dream into reality has turned out to be a much harder, and in many ways a much more interesting task than at first appeared
1.1. Machine Translation Approaches
Machine translation system develops by using main three approaches depending on their difficulty and complexity: Rule-Based Machine Translation (RBMT), Corpus-Based Machine Translation (CMT) and Hybrid Machine Translation.
1.1.1. Rule-Based Machine Translation (RBMT)
Systems that use rule-based approach are based on transferring using a core of solid linguistic knowledge. The linguistic knowledge acquired for one natural language processing system may be reused to build knowledge required for a similar task in another system. The advantage of the rule-based approach over the corpus-based approach is clear for:
1) less-resourced languages, for which large corpora, possibly parallel or bilingual, with representative structures and entities are neither available nor easily affordable.
2) for morphologically rich languages, which even with the availability of corpora suffer from data sparseness. These have motivated many researchers to fully or partially follow the rule based approach in developing their Arabic natural processing tools and systems. In this paper we address our successful efforts that involved rule-based approach for different Arabic natural language processing tasks.
The core process is mediated by bilingual dictionaries and rules for converting SL structures into TL structures, and/or by dictionaries and rules for deriving ‘intermediary representations’ from which output can be generated. The preceding stage of analysis interprets (surface) input SL strings into appropriate ‘translation units’ (e.g. canonical noun and verb forms) and relations (e.g. dependencies and syntactic units). The succeeding stage of synthesis (or generation) derives TL texts from the TL structures or representations produced by the core ‘transfer’ (or ‘interlingua’) process.
Figure 2.1 Rule Based Machine Translation Schema
· RBMT problems
- complexity of grammar rules :interactivity unpredictable, incomplete coverage, multi-level representation (artificial distinctions?, failure at one stage means no result)
- complexity of dictionaries :incomplete coverage of meanings, selection restrictions
- collocations, phrasal verbs, verb/noun phrases, etc.
- complex structures :complex tree transduction rules; long sentences, embeddings, discontinuities, coordination
- pronouns, anaphora, ‘discourse’
- semantic problems
- overcome (to some extent) by use of knowledge bases (in KBMT), but knowledge bases hugely complex
- many difficulties overcome/minimized in domain restricted and/or controlled language systems
Rule-based machine translation approaches can be classified into the following categories: direct machine translation, interlingua machine translation and transfer based machine translation.
1.1.1.1. Direct Translation
In direct translation, we precede word-by-word through the source language text, translating each word as we go. We make use of no intermediate structures, except for shallow morphological analysis; each source word is directly mapped onto some target word. Direct translation is thus based on a large bilingual dictionary; each entry in the dictionary can be viewed as a small program whose job is to translate one word. After the words are translated, simple reordering rules can apply [Jurafsky, D, 2007].
Figure 2.2Direct machine translation [Jurafsky, D. 2009 ]
1.1.1.2. Transfer Based Machine Translation
Transfer Based MT idea: to make a translation it is necessary to have an intermediate representation that captures the "meaning" of the original sentence in order to generate the correct translation. In interlingua-based MT this intermediate representation must be independent of the languages in question, whereas in transfer-based MT, it has some dependence on the language pair involved.
The way in which transfer-based machine translation systems work varies substantially, but in general they follow the same pattern: they apply sets of linguistic rules which are defined as correspondences between the structure of the source language and that of the target language. The first stage involves analysing the input text for morphology and syntax (and sometimes semantics) to create an internal representation. The translation is generated from this representation using both bilingual dictionaries and grammatical rules.
It is possible with this translation strategy to obtain fairly high quality translations, with accuracy in the region of 90% (although this is highly dependent on the language pair in question — for example the distance between the two).
1.1.1.3. Interlingua Machine Translation
In this approach, the source language translated is transformed into an Interlingua. The target language is then generated from the interlingua. Within the rule-based machine translation paradigm, the Interlingua approach is an alternative to the direct approach and the transfer approach.
In the direct approach, words are translated directly without passing through an additional representation. In the transfer approach the source language is transformed into an abstract, less language-specific representation. Linguistic rules which are specific to the language pair then transform the source language representation into an abstract target language representation and from this the target sentence is generated.
An Interlingua must represents all the information which a, sentence expresses. A sentence generally has various kind of meanings. These meanings are represented in the form of a word itself, the combination, tense, aspect, mood and sentence style, but how to express these things, of course, varies depending on the language. An interlingua must represent all these kind of information in universal way[Hiroshi, U ].
Figure 2.3 Interlingua Machine Translation
-advantages: The advantages are that it requires fewer components in order to relate each source language to each target language, it takes fewer components to add a new language, it supports paraphrases of the input in the original language, it allows both the analyzers and generators to be written by monolingual system developers, and it handles languages that are very different from each other (e.g. English and Arabic).
The second claimed advantage of Interlingua based systems is that
an intermediate language-neutral representation of meaning should
be able to provide a neutral basis of comparison for equivalent
texts that differ syntactically, but share the same meaning. This
would be of great use in related fields such as information retrieval, where the current state of the art relies largely on syntactic
matching for the gathering of relevant information. If natural language could be easily transformed into a semantic based Interlingua, our ability to search for and find information could be dramatically improved [Lampert A.2004 ].
an intermediate language-neutral representation of meaning should
be able to provide a neutral basis of comparison for equivalent
texts that differ syntactically, but share the same meaning. This
would be of great use in related fields such as information retrieval, where the current state of the art relies largely on syntactic
matching for the gathering of relevant information. If natural language could be easily transformed into a semantic based Interlingua, our ability to search for and find information could be dramatically improved [Lampert A.2004 ].
-The disadvantage: is that the definition of an Interlingua is difficult and maybe even impossible for a wider domain. The ideal context for Interlingua machine translation is thus multilingual machine translation in a very specific domain.
1.1.2. Corpus-Based Machine Translation
The three classic architectures for MT (Direct, Transfer, and Interlingua) all provide answers to the questions of what representations to use and what steps to perform to translate. But if we know that true translation, which is both faithful to the source language and natural as an utterance in the target language, is sometimes impossible. If you are going to go ahead and produce a translation anyway, you have to compromise. This is exactly what translators do in practice: they produce translations that do tolerably well on both criteria.
In corpus based machine we store the relation between the word and anther N words, we store the count times that the word come after N-words in the same language to use it in define the sentences order or to predict the next word And also we store the relation between two languages to do that we store first Language words in large words table and the count of times that word translate to specific word in the second language.
The decoder model use this data to translate the word to another word in the target language and reorder this translated words in the most predicted order according to language model.
They are two approaches use corpus Statistical Machine translation and example base machine translation.
1.1.2.1. Statistical Machine translation
Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation.
the core process of SMT involves a ‘translation model’ which takes as input SL words or word sequences (‘phrases’) and produces as output TL words or word sequences. The following stage involves a ‘language model’ which synthesizes the sets of TL words in ‘meaningful’ strings which are intended to be equivalent to the input sentences. In SMT the preceding ‘analysis’ stage is represented by the (trivial) process of matching individual words or word sequences of input SL text against entries in the translation model. More important is the essential preparatory stage of aligning SL and TL texts from a corpus and deriving the statistical frequency data for the ‘translation model’ (or adding statistical data from a corpus to a pre-existing ‘translation model’.) The monolingual ‘language model’ may or may not be derived from the same corpus as the ‘translation model’.
There are many statistical models developed:
Word-based translation
In word-based translation, the fundamental unit of translation is a word in some natural language. Typically, the number of words in translated sentences is different, because of compound words, morphology and idioms. The ratio of the lengths of sequences of translated words is called fertility, which tells how many foreign words each native word produces. Necessarily it is assumed by information theory that each covers the same concept. In practice this is not really true. For example, the English word Bank can be translated in Arabic by either بنك or شاطئ , depending the sentences.
Simple word-based translation can't translate between languages with different fertility. Word-based translation systems can relatively simply be made to cope with high fertility, but they could map a single word to multiple words, but not the other way about. For example, if we were translating from French to English, each word in English could produce any number of Arabic words— sometimes none at all. But there's no way to group two English words producing a single Arabic word [Jurafsky, D. 2009].
Figure 2.4 Word-Based Machine Translation model
The job of the translation model, given an English sentence E and a foreign sentence F, is to assign a probability that E generates F. While we can estimate these probabilities by thinking about how each individual word is translated.
· Phrase-Based Model
modern statistical MT is based on the intuition that a better way to compute these probabilities is by considering the behavior of phrases. As we see in Fig. 2.5 entire phrases often need to be translated and moved as a unit. The intuition of phrase-based statistical MT is to use phrases (sequences of words) as well as single words as the fundamental units of translation. Example based machine translation [Jurafsky, D. 2009].
Figure 2.5 Complex reordering necessary when translating from English to German [Jurafsky D. 2009. ].
· Syntax-based translation
Syntax-based translation is based on the idea of translating syntactic units, rather than single words or strings of words (as in phrase-based MT), i.e. (partial) parse trees of sentences / utterances. The idea of syntax-based translation is quite old in MT, though it's statistical counterpart did not take off until the advent of strong stochastic parsers in the 1990s. Examples of this approach include DOP-based MT and, more recently, synchronous context-free grammars [Jurafsky, D. 2009].
Figure 2.6 Statistical Based Machine Translation
1.1.2.2. Example Based machine Translation.
The Example Based machine Translation (EBMT) model is less clearly defined than the SMT model. Basically (if somewhat superficially), a system is an EBMT system if it uses segments (word sequences (strings) and not individual words) of source language (SL) texts extracted from a text corpus (its example database) to build texts in a target language (TL) with the same meaning. The basic units for EBMT are thus sequences of words (phrases). Within EBMT there is however a plethora of different methods, a multiplicity of techniques, many of which derive from other approaches: methods used in RBMT systems, methods found in SMT, some techniques used with translation memories (TM), etc. In particular, there seems to be no clear consensus on what EBMT is or what it is not. In the introduction to their collection of EBMT papers (Carl & Way 2003), the editors –probably wisely – refrain from attempting a definition, arguing that scientific fields can prosper without clear watertight frameworks, indeed may thrive precisely because they are not so defined.
The basic processes of EBMT as: the alignment of texts, the
matching of input sentences against phrases (examples) in the corpus, the selection and extraction of equivalent TL phrases, and the adaptation and combining of TL phrases as acceptable output sentences [Hutchins, J].
matching of input sentences against phrases (examples) in the corpus, the selection and extraction of equivalent TL phrases, and the adaptation and combining of TL phrases as acceptable output sentences [Hutchins, J].
Figure 2.7 Example Base Machine Translation
The core process of EBMT is the selection and extraction of TL fragments corresponding to SL fragments. It is preceded by an ‘analysis’ stage for the decomposition of input sentences into appropriate fragments (or templates with variables) and their matching against SL fragments (in a database). Whether the ‘matching’ involves precompiled fragments (templates derived from the corpus), whether the fragments are derived at ‘runtime’, and whether the fragments (chunks) contain variables or not, are all secondary factors. The succeeding stage of synthesis (or ‘recombination’
as most EBMT authors refer to it) adapts the extracted TL fragments and combines them into TL (output) sentences. As in SMT, there are essential preparatory stages which align SL and TL sentences in the bilingual database and which derive any templates or patterns used in the processes of matching and extracting.
1.1.3. Hybrid Machine Translation
In the early 90s, statistical and rule-based approaches were
seen in strict contrast. But PROs and CONs are complementary:
Syntax | Structural Semantics | Lexical Semantics | Lexical Adaptivity | |
Rule-based MT | ++ | + | - | -- |
Statistical MT | -- | -- | + | + |
Example-based MT | - | -- | - | ++ |
Table 2‑1 Two Different Types of Hybridisation(Eisele, A . 2007)
1.1.3.1. Deep Integration:
· Making a rule-based system adaptive by adding a module for rule learning
· Making a SMT system syntax-aware by adding syntactical constraints/rules
1.1.3.2. Shallow Integration:
Unlike Deep Integration, shallow integration systems do not attempt to achieve an exhaustive linguistic analysis. They are designed for specific tasks ignoring many details in input and linguistic (grammar) framework.
Utilizing rule-based (e.g. regular grammars) or statistics-based approaches , they are in general faster than Deep Integration, but only deliver flat, simple, partial, non exhaustive representations
we just[Eisele, A. 2007]:
we just[Eisele, A. 2007]:
· creating training data with rule-based systems.
· using the rule-base system’s lexicon as training data.
· consensus translation of system outputs.
Figure 2.8Multi-engine MT via black-box integration (as done in VerbMobil and earlier) |
When we build a hybrid machine translation system we must first select the approaches we combine RBMT, SMT and EBMT usually they use RBMT with other approach and what is the basic approach also we have to choose which model from this approaches we will use for example from RMBT we can us one of these models ( transfer, Direct Translation , Interlingua machine translation ,Transfer based machine translation ) [Eisele, A. 2007] .
