السبت، 8 أكتوبر 2011

Machine translation

1. Machine translation

Machine translation of natural languages known as MT is a subfield of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. While semi-automated tools have been applauded in the recent past as the most realistic path to follow, it is no longer the case.

The current consensus is that fully automated, efficient translation tools should remain the primary goal. The nature of users of such systems and the type of text involved leave little room for continued dependence on human aids [Salem, Y 2008]. It was one of the earliest applications suggested for digital computers, but turning this dream into reality has turned out to be a much harder, and in many ways a much more interesting task than at first appeared

1.1. Machine Translation Approaches

Machine translation system develops by using main three approaches depending on their difficulty and complexity: Rule-Based Machine Translation (RBMT), Corpus-Based Machine Translation (CMT) and Hybrid Machine Translation.

1.1.1. Rule-Based Machine Translation (RBMT)

Systems that use rule-based approach are based on transferring using a core of solid linguistic knowledge. The linguistic knowledge acquired for one natural language processing system may be reused to build knowledge required for a similar task in another system. The advantage of the rule-based approach over the corpus-based approach is clear for:

1) less-resourced languages, for which large corpora, possibly parallel or bilingual, with representative structures and entities are neither available nor easily affordable.

2) for morphologically rich languages, which even with the availability of corpora suffer from data sparseness. These have motivated many researchers to fully or partially follow the rule based approach in developing their Arabic natural processing tools and systems. In this paper we address our successful efforts that involved rule-based approach for different Arabic natural language processing tasks.

The core process is mediated by bilingual dictionaries and rules for converting SL structures into TL structures, and/or by dictionaries and rules for deriving ‘intermediary representations’ from which output can be generated. The preceding stage of analysis interprets (surface) input SL strings into appropriate ‘translation units’ (e.g. canonical noun and verb forms) and relations (e.g. dependencies and syntactic units). The succeeding stage of synthesis (or generation) derives TL texts from the TL structures or representations produced by the core ‘transfer’ (or ‘interlingua’) process.

Figure ‎2.1 Rule Based Machine Translation Schema

· RBMT problems

- complexity of grammar rules :interactivity unpredictable, incomplete coverage, multi-level representation (artificial distinctions?, failure at one stage means no result)

- complexity of dictionaries :incomplete coverage of meanings, selection restrictions

- collocations, phrasal verbs, verb/noun phrases, etc.

- complex structures :complex tree transduction rules; long sentences, embeddings, discontinuities, coordination

- pronouns, anaphora, ‘discourse’

- semantic problems

- overcome (to some extent) by use of knowledge bases (in KBMT), but knowledge bases hugely complex

- many difficulties overcome/minimized in domain restricted and/or controlled language systems

Rule-based machine translation approaches can be classified into the following categories: direct machine translation, interlingua machine translation and transfer based machine translation.

1.1.1.1. Direct Translation

In direct translation, we precede word-by-word through the source language text, translating each word as we go. We make use of no intermediate structures, except for shallow morphological analysis; each source word is directly mapped onto some target word. Direct translation is thus based on a large bilingual dictionary; each entry in the dictionary can be viewed as a small program whose job is to translate one word. After the words are translated, simple reordering rules can apply [Jurafsky, D, 2007].

Figure ‎2.2Direct machine translation [Jurafsky, D. 2009 ]

1.1.1.2. Transfer Based Machine Translation

Transfer Based MT idea: to make a translation it is necessary to have an intermediate representation that captures the "meaning" of the original sentence in order to generate the correct translation. In interlingua-based MT this intermediate representation must be independent of the languages in question, whereas in transfer-based MT, it has some dependence on the language pair involved.

The way in which transfer-based machine translation systems work varies substantially, but in general they follow the same pattern: they apply sets of linguistic rules which are defined as correspondences between the structure of the source language and that of the target language. The first stage involves analysing the input text for morphology and syntax (and sometimes semantics) to create an internal representation. The translation is generated from this representation using both bilingual dictionaries and grammatical rules.

It is possible with this translation strategy to obtain fairly high quality translations, with accuracy in the region of 90% (although this is highly dependent on the language pair in question — for example the distance between the two).

1.1.1.3. Interlingua Machine Translation

In this approach, the source language translated is transformed into an Interlingua. The target language is then generated from the interlingua. Within the rule-based machine translation paradigm, the Interlingua approach is an alternative to the direct approach and the transfer approach.

In the direct approach, words are translated directly without passing through an additional representation. In the transfer approach the source language is transformed into an abstract, less language-specific representation. Linguistic rules which are specific to the language pair then transform the source language representation into an abstract target language representation and from this the target sentence is generated.

An Interlingua must represents all the information which a, sentence expresses. A sentence generally has various kind of meanings. These meanings are represented in the form of a word itself, the combination, tense, aspect, mood and sentence style, but how to express these things, of course, varies depending on the language. An interlingua must represent all these kind of information in universal way[Hiroshi, U ].

Figure ‎2.3 Interlingua Machine Translation

-advantages: The advantages are that it requires fewer components in order to relate each source language to each target language, it takes fewer components to add a new language, it supports paraphrases of the input in the original language, it allows both the analyzers and generators to be written by monolingual system developers, and it handles languages that are very different from each other (e.g. English and Arabic).

The second claimed advantage of Interlingua based systems is that
an intermediate language-neutral representation of meaning should
be able to provide a neutral basis of comparison for equivalent
texts that differ syntactically, but share the same meaning. This
would be of great use in related fields such as information retrieval, where the current state of the art relies largely on syntactic
matching for the gathering of relevant information. If natural language could be easily transformed into a semantic based Interlingua, our ability to search for and find information could be dramatically improved [Lampert A.2004 ].

-The disadvantage: is that the definition of an Interlingua is difficult and maybe even impossible for a wider domain. The ideal context for Interlingua machine translation is thus multilingual machine translation in a very specific domain.

1.1.2. Corpus-Based Machine Translation

The three classic architectures for MT (Direct, Transfer, and Interlingua) all provide answers to the questions of what representations to use and what steps to perform to translate. But if we know that true translation, which is both faithful to the source language and natural as an utterance in the target language, is sometimes impossible. If you are going to go ahead and produce a translation anyway, you have to compromise. This is exactly what translators do in practice: they produce translations that do tolerably well on both criteria.

In corpus based machine we store the relation between the word and anther N words, we store the count times that the word come after N-words in the same language to use it in define the sentences order or to predict the next word And also we store the relation between two languages to do that we store first Language words in large words table and the count of times that word translate to specific word in the second language.

The decoder model use this data to translate the word to another word in the target language and reorder this translated words in the most predicted order according to language model.

They are two approaches use corpus Statistical Machine translation and example base machine translation.

1.1.2.1. Statistical Machine translation

Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation.

the core process of SMT involves a ‘translation model’ which takes as input SL words or word sequences (‘phrases’) and produces as output TL words or word sequences. The following stage involves a ‘language model’ which synthesizes the sets of TL words in ‘meaningful’ strings which are intended to be equivalent to the input sentences. In SMT the preceding ‘analysis’ stage is represented by the (trivial) process of matching individual words or word sequences of input SL text against entries in the translation model. More important is the essential preparatory stage of aligning SL and TL texts from a corpus and deriving the statistical frequency data for the ‘translation model’ (or adding statistical data from a corpus to a pre-existing ‘translation model’.) The monolingual ‘language model’ may or may not be derived from the same corpus as the ‘translation model’.

There are many statistical models developed:

Word-based translation

In word-based translation, the fundamental unit of translation is a word in some natural language. Typically, the number of words in translated sentences is different, because of compound words, morphology and idioms. The ratio of the lengths of sequences of translated words is called fertility, which tells how many foreign words each native word produces. Necessarily it is assumed by information theory that each covers the same concept. In practice this is not really true. For example, the English word Bank can be translated in Arabic by either بنك or شاطئ , depending the sentences.

Simple word-based translation can't translate between languages with different fertility. Word-based translation systems can relatively simply be made to cope with high fertility, but they could map a single word to multiple words, but not the other way about. For example, if we were translating from French to English, each word in English could produce any number of Arabic words— sometimes none at all. But there's no way to group two English words producing a single Arabic word [Jurafsky, D. 2009].

Figure ‎2.4 Word-Based Machine Translation model

The job of the translation model, given an English sentence E and a foreign sentence F, is to assign a probability that E generates F. While we can estimate these probabilities by thinking about how each individual word is translated.

· Phrase-Based Model

modern statistical MT is based on the intuition that a better way to compute these probabilities is by considering the behavior of phrases. As we see in Fig. 2.5 entire phrases often need to be translated and moved as a unit. The intuition of phrase-based statistical MT is to use phrases (sequences of words) as well as single words as the fundamental units of translation. Example based machine translation [Jurafsky, D. 2009].

Figure ‎2.5 Complex reordering necessary when translating from English to German [Jurafsky D. 2009. ].

· Syntax-based translation

Syntax-based translation is based on the idea of translating syntactic units, rather than single words or strings of words (as in phrase-based MT), i.e. (partial) parse trees of sentences / utterances. The idea of syntax-based translation is quite old in MT, though it's statistical counterpart did not take off until the advent of strong stochastic parsers in the 1990s. Examples of this approach include DOP-based MT and, more recently, synchronous context-free grammars [Jurafsky, D. 2009].

Figure ‎2.6 Statistical Based Machine Translation

1.1.2.2. Example Based machine Translation.

The Example Based machine Translation (EBMT) model is less clearly defined than the SMT model. Basically (if somewhat superficially), a system is an EBMT system if it uses segments (word sequences (strings) and not individual words) of source language (SL) texts extracted from a text corpus (its example database) to build texts in a target language (TL) with the same meaning. The basic units for EBMT are thus sequences of words (phrases). Within EBMT there is however a plethora of different methods, a multiplicity of techniques, many of which derive from other approaches: methods used in RBMT systems, methods found in SMT, some techniques used with translation memories (TM), etc. In particular, there seems to be no clear consensus on what EBMT is or what it is not. In the introduction to their collection of EBMT papers (Carl & Way 2003), the editors –probably wisely – refrain from attempting a definition, arguing that scientific fields can prosper without clear watertight frameworks, indeed may thrive precisely because they are not so defined.

The basic processes of EBMT as: the alignment of texts, the
matching of input sentences against phrases (examples) in the corpus, the selection and extraction of equivalent TL phrases, and the adaptation and combining of TL phrases as acceptable output sentences [Hutchins, J].

Figure ‎2.7 Example Base Machine Translation

The core process of EBMT is the selection and extraction of TL fragments corresponding to SL fragments. It is preceded by an ‘analysis’ stage for the decomposition of input sentences into appropriate fragments (or templates with variables) and their matching against SL fragments (in a database). Whether the ‘matching’ involves precompiled fragments (templates derived from the corpus), whether the fragments are derived at ‘runtime’, and whether the fragments (chunks) contain variables or not, are all secondary factors. The succeeding stage of synthesis (or ‘recombination’

as most EBMT authors refer to it) adapts the extracted TL fragments and combines them into TL (output) sentences. As in SMT, there are essential preparatory stages which align SL and TL sentences in the bilingual database and which derive any templates or patterns used in the processes of matching and extracting.

1.1.3. Hybrid Machine Translation

In the early 90s, statistical and rule-based approaches were

seen in strict contrast. But PROs and CONs are complementary:

	Syntax	Structural Semantics	Lexical Semantics	Lexical Adaptivity
Rule-based MT	++	+	-	--
Statistical MT	--	--	+	+
Example-based MT	-	--	-	++

Table ‎2‑1 Two Different Types of Hybridisation(Eisele, A . 2007)

1.1.3.1. Deep Integration:

It is try to apply as much linguistic knowledge as possible to analyze natural language utterances by Design a new setup that combines the advantages of two paradigms, e.g. by integrating some good features of Approach B into Approach A, such as[Eisele, A. 2007]:

· Making a rule-based system adaptive by adding a module for rule learning

· Making a SMT system syntax-aware by adding syntactical constraints/rules

1.1.3.2. Shallow Integration:

Unlike Deep Integration, shallow integration systems do not attempt to achieve an exhaustive linguistic analysis. They are designed for specific tasks ignoring many details in input and linguistic (grammar) framework.

Utilizing rule-based (e.g. regular grammars) or statistics-based approaches , they are in general faster than Deep Integration, but only deliver flat, simple, partial, non exhaustive representations
we just[Eisele, A. 2007]:

· creating training data with rule-based systems.

· using the rule-base system’s lexicon as training data.

· consensus translation of system outputs.

Figure ‎2.8Multi-engine MT via black-box integration (as done in VerbMobil and earlier)

When we build a hybrid machine translation system we must first select the approaches we combine RBMT, SMT and EBMT usually they use RBMT with other approach and what is the basic approach also we have to choose which model from this approaches we will use for example from RMBT we can us one of these models ( transfer, Direct Translation , Interlingua machine translation ,Transfer based machine translation ) [Eisele, A. 2007] .

السبت، 7 مارس 2009

World Languages

Rank, Countries² Population³
language (in millions)
1. Chinese, Mandarin Brunei, Cambodia, China, Indonesia, Malaysia, Mongolia,
Philippines, Singapore, S. Africa, Taiwan, Thailand 1120
2. English Australia, Belize, Botswana, Brunei, Cameroon, Canada,
Eritrea, Ethiopia, Fiji, The Gambia, Ghana, Guyana, India, Ireland,
Israel, Lesotho, Liberia, Malaysia, Micronesia, Namibia, Nauru, New
Zealand, Palau, Papua New Guinea, Philippines, Samoa, Seychelles, Sierra
Leone, Singapore, Solomon Islands, Somalia, S. Africa, Suriname,
Swaziland, Tonga, U.K., U.S., Vanuatu, Zimbabwe, many Caribbean states,
Zambia. 480
3. Spanish Algeria, Andorra, Argentina, Belize, Benin, Bolivia, Chad,
Chile, Colombia, Costa Rica, Cuba, Dominican Rep., Ecuador, El Salvador,
Eq. Guinea, Guatemala, Honduras, Ivory Coast, Madagascar, Mali, Mexico,
Morocco, Nicaragua, Niger, Panama, Paraguay, Peru, Spain, Togo, Tunisia,
United States, Uruguay, Venezuela. 332
4. Arabic Egypt, Sudan, ALgeria, Morocco, Tunisia, Lybia, Saudi Arabia,
Syria, Jordan, Yemen, UAE, Oman, Iraq, Lebanon 235
5. Bengali Bangladesh, India, Singapore 189
6. Hindi India, Nepal, Singapore, S. Africa, Uganda 182
7. Russian Belarus, China, Estonia, Georgia, Israel, Kazakhstan,
Kyrgyzstan, Latvia, Lithuania, Moldova, Mongolia, Russia, Turkmenistan,
Ukraine, U.S., Uzbekistan 180
8. Portuguese Angola, Brazil, Cape Verde, France, Guinea-Bissau, Mozambique,
Portugal, São Tomé and Príncipe 170
9. Japanese Japan, Singapore, Taiwan 125
10. German Austria, Belgium, Bolivia, Czech Rep., Denmark, Germany,
Hungary, Italy, Kazakhstan, Liechtenstein, Luxembourg, Paraguay, Poland,
Romania, Slovakia, Switzerland 98
11. Chinese, Wu China 77.2
12. Javanese Indonesia, Malaysia, Singapore 75.5
13. Korean China, Japan, N. Korea, S. Korea, Singapore, Thailand 75
14. French Algeria, Andorra, Belgium, Benin, Burkina Faso, Burundi,
Cambodia, Cameroon, Canada, Chad, Comoros, Congo, Democratic Republic of
the Congo, Djibouti, France, Gabon, Guinea, Haiti, Ivory Coast, Laos,
Luxembourg, Madagascar, Mali, Mauritania, Monaco, Morocco, Niger,
Rwanda, Senegal, Seychelles, Switzerland, Togo, Tunisia, Vanuatu,
Vietnam 72
15. Turkish Bulgaria, Cyprus, Greece, Macedonia, Romania, Turkey,
Uzbekistan 69
16. Vietnamese China, Vietnam 67.7
17. Telugu India, Singapore 66.4
18. Chinese, Yue(Cantonese) Brunei, China, Costa Rica, Indonesia, Malaysia, Panama,
Philippines, Singapore, Thailand, Vietnam 66
19. Marathi India 64.8
20. Tamil India, Malaysia, Mauritius, Singapore, S. Africa, Sri Lanka 63.1
21. Italian Croatia, Eritrea, France, Italy, San Marino, Slovenia,
Switzerland 59
22. Urdu Afghanistan, India, Mauritius, Pakistan, S. Africa,
Thailand 58
23. Chinese, Min Nan Brunei, China, Indonesia, Malaysia, Philippines, Singapore,
Taiwan, Thailand 49
24. Chinese, Jinyu China 45
25. Gujarati India, Kenya, Pakistan, Singapore, S. Africa, Tanzania,
Uganda, Zimbabwe 44
26. Polish Czech Rep., Germany, Israel, Poland, Romania, Slovakia 44
27. Ukrainian Poland, Slovakia, Ukraine 41
28. Persian Iran, Iraq, Afghanistan, Oman, Qatar, Tajikistan, U A
Emirates 37.3
29. Chinese, Xiang China 36
30. Malayalam India, Singapore 34
31. Chinese, Hakka Brunei, China, Indonesia, Malaysia, Panama, Singapore,
Suriname, Taiwan, Thailand 34
32. Kannada India 33.7
33. Oriya India 31
34. Panjabi, Western India, Pakistan 30
35. Sunda Indonesia 27
35. Panjabi, Eastern India, Kenya, Singapore 26
36. Romanian Hungary, Israel, Moldova, Romania, Serbia and Montenegro,
Ukraine 26
37. Bhojpuri India, Mauritius, Nepal 25
38. Azerbaijani, South Afghanistan, Iran, Iraq, Syria, Turkey 24.4
40. Maithili India, Nepal 24.3
41. Hausa Benin, Burkina Faso, Cameroon, Ghana, Niger, Nigeria, Sudan,
Togo 24.2
43. Burmese Bangladesh, Myanmar 22
44. Serbo-Croatian⁴ Bosnia and Herzegovina, Croatia, Macedonia, Serbia and
Montenegro, Slovenia 21
45. Chinese, Gan China 20.6
46. Awadhi India, Nepal 20.5
47. Thai Singapore, Thailand, Malaysia 20
48. Dutch Belgium, France, Netherlands, Suriname 20
49. Yoruba Benin, Nigeria 20
50. Sindhi Afghanistan, India, Pakistan, Singapore 19.7
1. Many of the languages listed are
technically dialects, not separate languages. They are listed separately
because they differ from each other enough to be mutually
unintelligible.
2. The countries listed under Spanish,
English, Portuguese, French, and Serbo-Croatian do not include those in
which less than 1% of the population speaks the language as a first
language.
3. The population figures refer to first
language speakers in all countries and are general estimates.
4. Serbo-Croatian is now known variously as
Serbian, Croatian, or Bosnian, depending on the speaker's ethnic or
political affiliation.

الجمعة، 6 فبراير 2009

Natural language processing (NLP)

Natural language processing (NLP) is a field of computer science concerned with the interactions between computers and human (natural) languages. Natural language generation systems convert information from computer databases into readable human language. Natural language understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate. Many problems within NLP apply to both generation and understanding; for example, a computer must be able to model morphology (the structure of words) in order to understand an English sentence, but a model of morphology is also needed for producing a grammatically correct English sentence.

NLP has significant overlap with the field of computational linguistics, and is often considered a sub-field of artificial intelligence. The term natural language is used to distinguish human languages (such as Spanish, Swahili or Swedish) from formal or computer languages (such as C++, Java or LISP). Although NLP may encompass both text and speech, work on speech processing has evolved into a separate field.

Contents
1 Tasks and limitations
2 Subproblems
3 Statistical NLP
4 Major tasks in NLP
5 Concrete problems
6 Evaluation of natural language processing
6.1 Objectives
6.2 Short history of evaluation in NLP
6.3 Different types of evaluation
6.4 Shared tasks (Campaigns)
7 Standardization in NLP
8 Journals
9 Organizations and conferences
9.1 Associations
9.2 Conferences
10 Software tools
11 See also
11.1 Implementations
12 References
12.1 Related academic articles
13 External links
13.1 Resources
13.2 Organizations

Tasks and limitations
In theory, natural-language processing is a very attractive method of human-computer interaction. Early systems such as SHRDLU, working in restricted "blocks worlds" with restricted vocabularies, worked extremely well, leading researchers to excessive optimism, which was soon lost when the systems were extended to more realistic situations with real-world ambiguity and complexity.

Natural-language understanding is sometimes referred to as an AI-complete problem, because natural-language recognition seems to require extensive knowledge about the outside world and the ability to manipulate it. The definition of "understanding" is one of the major problems in natural-language processing.

Subproblems
Speech segmentation
In most spoken languages, the sounds representing successive letters blend into each other, so the conversion of the analog signal to discrete characters can be a very difficult process. Also, in natural speech there are hardly any pauses between successive words; the location of those boundaries usually must take into account grammatical and semantic constraints, as well as the context.
Text segmentation
Some written languages like Chinese, Japanese and Thai do not have single-word boundaries either, so any significant text parsing usually requires the identification of word boundaries, which is often a non-trivial task.
Part-of-speech tagging
Word sense disambiguation
Many words have more than one meaning; we have to select the meaning which makes the most sense in context.
Syntactic ambiguity
The grammar for natural languages is ambiguous, i.e. there are often multiple possible parse trees for a given sentence. Choosing the most appropriate one usually requires semantic and contextual information. Specific problem components of syntactic ambiguity include sentence boundary disambiguation.
Imperfect or irregular input
Foreign or regional accents and vocal impediments in speech; typing or grammatical errors, OCR errors in texts.
Speech acts and plans
A sentence can often be considered an action by the speaker. The sentence structure alone may not contain enough information to define this action. For instance, a question is actually the speaker requesting some sort of response from the listener. The desired response may be verbal, physical, or some combination. For example, "Can you pass the class?" is a request for a simple yes-or-no answer, while "Can you pass the salt?" is requesting a physical action to be performed. It is not appropriate to respond with "Yes, I can pass the salt," without the accompanying action (although "No" or "I can't reach the salt" would explain a lack of action).

Statistical NLP
Main article: statistical natural language processing
Statistical natural-language processing uses stochastic, probabilistic and statistical methods to resolve some of the difficulties discussed above, especially those which arise because longer sentences are highly ambiguous when processed with realistic grammars, yielding thousands or millions of possible analyses. Methods for disambiguation often involve the use of corpora and Markov models. Statistical NLP comprises all quantitative approaches to automated language processing, including probabilistic modeling, information theory, and linear algebra[1]. The technology for statistical NLP comes mainly from machine learning and data mining, both of which are fields of artificial intelligence that involve learning from data.

Major tasks in NLP
Automatic summarization -
Foreign language reading aid
Foreign language writing aid
Information extraction
Information retrieval (IR) - IR is concerned with storing, searching and retrieving information. It is a separate field within computer science (closer to databases), but IR relies on some NLP methods (for example, stemming). Some current research and applications seek to bridge the gap between IR and NLP.
Machine translation - Automatically translating from one human language to another.
Named entity recognition (NER) - Given a stream of text, determining which items in the text map to proper names, such as people or places. Although in English, named entities are marked with capitalized words, many other languages do not use capitalization to distinguish named entities.
Natural language generation
Natural language understanding
Optical character recognition
anaphora resolution
Question answering - Given a human language question, the task of producing a human-language answer. The question may be a closed-ended (such as "What is the capital of Canada?") or open-ended (such as "What is the meaning of life?").
Speech recognition - Given a sound clip of a person or people speaking, the task of producing a text dictation of the speaker(s). (The opposite of text to speech.)
Spoken dialogue system
Text simplification
Text-to-speech
Text-proofing

Concrete problems
See also: Garden path sentence
Some examples of the problems faced by natural-language-understanding systems:

The sentences "We gave the monkeys the bananas because they were hungry" and "We gave the monkeys the bananas because they were over-ripe" have the same surface grammatical structure. However, the pronoun they refers to monkeys in one sentence and bananas in the other, and it is impossible to tell which without a knowledge of the properties of monkeys and bananas.
A string of words may be interpreted in different ways. For example, the string "Time flies like an arrow" may be interpreted in a variety of ways:
The common simile: time moves quickly just like an arrow does;
measure the speed of flies like you would measure that of an arrow (thus interpreted as an imperative) - i.e. (You should) time flies as you would (time) an arrow.;
measure the speed of flies like an arrow would - i.e. Time flies in the same way that an arrow would (time them).;
measure the speed of flies that are like arrows - i.e. Time those flies that are like arrows;
all of a type of flying insect, "time-flies," collectively enjoys a single arrow (compare Fruit flies like a banana);
each of a type of flying insect, "time-flies," individually enjoys a different arrow (similar comparison applies);
A concrete object, for example the magazine, Time, travels through the air in an arrow-like manner.
English is particularly challenging in this regard because it has little inflectional morphology to distinguish between parts of speech.

English and several other languages don't specify which word an adjective applies to. For example, in the string "pretty little girls' school".
Does the school look little?
Do the girls look little?
Do the girls look pretty?
Does the school look pretty?
We will often imply additional information in spoken language by the way we place stress on words. The sentence "I never said she stole my money" demonstrates the importance stress can play in a sentence, and thus the inherent difficulty a natural language processor can have in parsing it. Depending on which word the speaker places the stress, this sentence could have several distinct meanings:
"I never said she stole my money" - Someone else said it, but I didn't.
"I never said she stole my money" - I simply didn't ever say it.
"I never said she stole my money" - I might have implied it in some way, but I never explicitly said it.
"I never said she stole my money" - I said someone took it; I didn't say it was she.
"I never said she stole my money" - I just said she probably borrowed it.
"I never said she stole my money" - I said she stole someone else's money.
"I never said she stole my money" - I said she stole something, but not my money.

Evaluation of natural language processing

Objectives
The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system, in order to determine whether (or to what extent) the system answers the goals of its designers, or meets the needs of its users. Research in NLP evaluation has received considerable attention, because the definition of proper evaluation criteria is one way to specify precisely an NLP problem, going thus beyond the vagueness of tasks defined only as language understanding or language generation. A precise set of evaluation criteria, which includes mainly evaluation data and evaluation metrics, enables several teams to compare their solutions to a given NLP problem.

Short history of evaluation in NLP
The first evaluation campaign on written texts seems to be a campaign dedicated to message understanding in 1987 (Pallet 1998). Then, the Parseval/GEIG project compared phrase-structure grammars (Black 1991). A series of campaigns within Tipster project were realized on tasks like summarization, translation and searching (Hirshman 1998). In 1994, in Germany, the Morpholympics compared German taggers. Then, the Senseval and Romanseval campaigns were conducted with the objectives of semantic disambiguation. In 1996, the Sparkle campaign compared syntactic parsers in four different languages (English, French, German and Italian). In France, the Grace project compared a set of 21 taggers for French in 1997 (Adda 1999). In 2004, during the Technolangue/Easy project, 13 parsers for French were compared. Large-scale evaluation of dependency parsers were performed in the context of the CoNLL shared tasks in 2006 and 2007. In Italy, the evalita campaign was conducted in 2007 to compare various tools for Italian evalita web site. In France, within the ANR-Passage project (end of 2007), 10 parsers for French were compared passage web site.

Adda G., Mariani J., Paroubek P., Rajman M. 1999 L'action GRACE d'évaluation de l'assignation des parties du discours pour le français. Langues vol-2
Black E., Abney S., Flickinger D., Gdaniec C., Grishman R., Harrison P., Hindle D., Ingria R., Jelinek F., Klavans J., Liberman M., Marcus M., Reukos S., Santoni B., Strzalkowski T. 1991 A procedure for quantitatively comparing the syntactic coverage of English grammars. DARPA Speech and Natural Language Workshop
Hirshman L. 1998 Language understanding evaluation: lessons learned from MUC and ATIS. LREC Granada
Pallet D.S. 1998 The NIST role in automatic speech recognition benchmark tests. LREC Granada
Different types of evaluation
Depending on the evaluation procedures, a number of distinctions are traditionally made in NLP evaluation.
Intrinsic vs. extrinsic evaluation
Intrinsic evaluation considers an isolated NLP system and characterizes its performance mainly with respect to a gold standard result, pre-defined by the evaluators. Extrinsic evaluation, also called evaluation in use considers the NLP system in a more complex setting, either as an embedded system or serving a precise function for a human user. The extrinsic performance of the system is then characterized in terms of its utility with respect to the overall task of the complex system or the human user. For example, consider a syntactic parser that is based on the output of some new part of speech (POS) tagger. An intrinsic evaluation would run the POS tagger on some labelled data, and compare the system output of the POS tagger to the gold standard (correct) output. An extrinsic evaluation would run the parser with some other POS tagger, and then with the new POS tagger, and compare the parsing accuracy.

Black-box vs. glass-box evaluation
Black-box evaluation requires one to run an NLP system on a given data set and to measure a number of parameters related to the quality of the process (speed, reliability, resource consumption) and, most importantly, to the quality of the result (e.g. the accuracy of data annotation or the fidelity of a translation). Glass-box evaluation looks at the design of the system, the algorithms that are implemented, the linguistic resources it uses (e.g. vocabulary size), etc. Given the complexity of NLP problems, it is often difficult to predict performance only on the basis of glass-box evaluation, but this type of evaluation is more informative with respect to error analysis or future developments of a system.

Automatic vs. manual evaluation
In many cases, automatic procedures can be defined to evaluate an NLP system by comparing its output with the gold standard (or desired) one. Although the cost of producing the gold standard can be quite high, automatic evaluation can be repeated as often as needed without much additional costs (on the same input data). However, for many NLP problems, the definition of a gold standard is a complex task, and can prove impossible when inter-annotator agreement is insufficient. Manual evaluation is performed by human judges, which are instructed to estimate the quality of a system, or most often of a sample of its output, based on a number of criteria. Although, thanks to their linguistic competence, human judges can be considered as the reference for a number of language processing tasks, there is also considerable variation across their ratings. This is why automatic evaluation is sometimes referred to as objective evaluation, while the human kind appears to be more subjective.

Shared tasks (Campaigns)
BioCreative
Message Understanding Conference
Technolangue/Easy
Text Retrieval Conference

Standardization in NLP
An ISO sub-committee is working in order to ease interoperability between Lexical resources and NLP programs. The sub-committee is part of ISO/TC37 and is called ISO/TC37/SC4. Some ISO standards are already published but most of them are under construction, mainly on lexicon representation (see LMF), annotation and data category registry.

Journals
Computational Linguistics
Language Resources and Evaluation
Linguistic Issues in Language Technology

Organizations and conferences

Associations
Association for Computational Linguistics
Association for Machine Translation in the Americas
AFNLP - Asian Federation of Natural Language Processing Associations
Australasian Language Technology Association (ALTA)

Conferences
Language Resources and Evaluation

Software tools
Main article: Natural language processing toolkits
General Architecture for Text Engineering
Natural Language Toolkit (NLTK): a Python library suite
Expert System S.p.A.
OpenNLP
MontyLingua
NLP Software Packages - Free software packages for NLP research, including a Semantic Role Labeler, Named Entity Tagger, Coreference Resolution, and more! This also the home of Learning-Based Java (Machine Learning Framework) and Sparse Network of Winnows (Learning Architecture).

See also
Biomedical text mining
Chatterbot
Compound term processing
Computational linguistics
Computer-assisted reviewing
Controlled natural language
Human language technology
Information retrieval
Latent semantic indexing
Lexical markup framework
lojban / loglan
Transderivational search
Speech Recognition

Implementations
Cypher, a framework for transforming natural language phrases and statements into SPARQL and RDF. Uses the Metalanguage Ontology to describe language constructs such as phrase grammars, morphology rules and lexicons.
Infonic Sentiment, an NLP-based news analysis software package that reads news flows and provides news sentiment signals for the algorithmic trading systems of investment banks
LinguaStream, a generic platform for NLP experimentation
MARF, a framework for voice and statistical NLP processing
Nortel Speech Server, a speech processing system primarily used for large-vocabulary speech recognition, natural-language understanding, text-to-speech, and speaker verification

References
^ Christopher D. Manning, Hinrich Schutze Foundations of Statistical Natural Language Processing, MIT Press (1999), ISBN 978-0262133609, p. xxxi
http://en.wikipedia.org/wiki/Natural_language_processing
[edit] Related academic articles
Bates, M. (1995). Models of natural language understanding. Proceedings of the National Academy of Sciences of the United States of America, Vol. 92, No. 22 (Oct. 24, 1995), pp. 9977-9982.

كل تقنيات

السبت، 8 أكتوبر 2011

Machine translation

1. Machine translation

1.1. Machine Translation Approaches