<Insert Picture Here>
Arabic NLP: Overview, the State of the Art
Challenges and Opportunities
Ali Farghaly
Overview (1)
1. to the Arabic language and culture
2. to Arabic NLP
a – inherent properties of Arabic
b – problems of Arabic Linguistics
Overview (2)
• Inherent Opportunities for the Arabic Language
1. Classical Arabic has survived 15 centuries,
other language failed to do so
2. Arabic is capable of reinventing itself
3. Classical Arabic is a living language in which 1.4
billion Moslems perform their daily prayers
4. The significance of the Arabic language culturally,
strategically and linguistically
Overview (3)
Why NLP is important?
• Fundamental transition from the Industrial Economy
to the Knowledge Economy in the 1980s and 1990s
• Knowledge is coded in Language
• Necessity for NLP Systems to categorize, retrieve,
translate, and/or answer questions from unstructured
Overview (4)
• NLP History
• Four generations of NLP
• Disappointment with the First Generation of Machine
Translation Systems, ALPAC Report (1966)
• Second Generation of NLP Systems (1970’s-1980’s)
Overview (5)
• Third Generation NLP Systems 1990’s – present
• Success of Statistical Approaches
• Problems with Statistical Approaches
• The Emergence of the Hybrid Approach (4th
Overview (6)
Future Directions in Arabic NLP
• New Attitude towards Arabic Grammar
• Focus on Constituency
• The Need for Arabic Language Planning
Overview (7)
• Deal with syntactic ambiguity, co-reference,
unbounded dependencies, phrasal constituencies,
PRO Drop .etc.
• Clear Objectives of Arabic NLP for the Arab World
• Could be different from Arabic NLP for the Western
• Conclusion
Challenges (1)
• To the Arabic language and culture
• The English language is becoming the language of
the World Wide Web: emails, blogs, chats etc. taking
away functionalities from Arabic
• Number of books, papers published in the Arab
countries is minimal compared to that produced in
the USA and English speaking countries
• Thus, we consume rather than produce knowledge
• No first class research universities in the Arab world
Challenges (2)
• Even when we report research, we do not use Arabic
• Globalization has intensified the influence of the
Western culture in the Arab World
• Almost all Arab universities teach science and
mathematics in a foreign language
Challenges (3)
• To Arabic NLP
• Inherent properties of the Arabic language
1. The Arabic script (no short vowels and no
2. Explosion of ambiguity (average 2.3 per word in
other languages to 19.2 in Arabic.
Example: 22 analyses of “‫ ”ثمن‬by Buckwalter (2004)
Challenges (4)
• 3. Complex word structure
e.g. “‫‘ “ ورأيتهم‬and I saw them’
• 4. The problem of Normalization
‫آ‬،‫إ‬،‫أ‬،‫ ا‬‫ا‬
losing distinction ‫ آن‬، ‫ إن‬، ‫أن‬
5. Arabic as a Pro Drop Language
• The Arabic language can meet all the needs of its
• The Arabs were producers of knowledge at a time
when the rest of the world were were consumers of
• Contemporary Arab scholars proved their ability
to produce knowledge
Opportunities (1)
Lessons from recent history
• Unprecedented accumulation of knowledge
1. Dramatic increase in the number of academic
2. Huge investment in R & D companies
3. Fundamental changes in industry and society
similar to the Industrial Revolution
4. Impressive progress in many fields such as
medicine, space exploration, computer
software and hardware development etc.
The Knowledge Economy
• Fundamental Aspects of the Knowledge Economy
1. Strategic product is knowledge rather than
manufactured goods
2. Industrial workers are replaced by
knowledge workers
3. Global labor market
4. Democratization of knowledge
The Knowledge Economy & NLP
• The age of on-line information, electronic
communication, World Wide Web (www)
• Millions of documents are created every minute –
from kb -> mg -> gig -> terabites
• Explosion of knowledge can lead to explosion of
The Knowledge Economy & NLP
• Democratization of knowledge through the use of the
computer/cell phone as a communication tool
• Governments, industry, academia, and individuals,
desperately, need tools to process information
• Information is coded in natural language
The knowledge Economy & NLP
• Globalization -> Multilingual applications such as
machine translation and cross language applications
• Information Retrieval (IR) and Information Extraction
(IE) are becoming increasingly important
• key word search is being replaced by question
answering systems
• Knowledge is encoded in natural language
NLP - Flashback
• The invention of the computer and language
- First application: breaking the Nazi’s secret
- Second application: Russian to English
machine translation (Warren, 1949)
1st Generation of MT
Principles of the first generation
• Capitalized on the speed lookup offered by the
• MT is essentially a matter of correct pairing of the
source language expressions with the target
language equivalents
• Trivial reordering of words
Problems with 1st Generation MT
• naïve concept of language structure
• Heavy reliance on bilingual dictionaries
• No attempt to mimic human translation
• Unrealistic goals and promises
2nd Generation MT (1)
Principles of the Transfer Approach
Three Components
1. analysis of source language (SL)
2. transfer the structure of SL to TL
3. Generation of target language surface forms
2nd Generation MT (2)
Basic Principles
• Linguistic knowledge is essential for the
understanding of the source text
• Target specific domains for better translation
• More realistic goals and promises
2nd Generation MT (3)
• Positive developments in NLP technology: chart
parsing (Woods 1970, unification grammar Shieber
1986), definite clause grammar (Periera 1980)
• Driven by the commercial market: The Georgetown
System, Pan American Health Organization,
EURORTA Project (Interlingua approach) etc.
• Emergence of lexical approaches to grammar
Problems with 2nd Generation MT
• Linguistic knowledge is expensive
• Explosion of syntactic ambiguity (300 parse for
each input sentence)
• Needed huge computing power
• Limited successes: The METAL system and the
Canadian weather forecast translation system
Statistical Approaches to NLP
• Built on Probability theory
• Works well for specific domains
• Relies on training data (machine learning)
• Very fast
• Does not require linguistic knowledge
3rd Generation of MT Systems (1)
• Relies on the machine learning approach
• Benefits from the existence of huge corpora through
the Internet
• Low development cost
• Rapid development time
3rd Generation of MT Systems (2)
• Heavy reliance on parallel corpus at several levels
• Does not require any linguistic knowledge: “Give me
enough parallel corpus, and I will give you machine
translation system in hours”
• Represents an empirical approach to language “The
proof is in the pudding” (Manning 2000)
• Unlike the transfer approach, does not attempt to
mimic human translators
3rd Generation MT Systems (3)
Benefited by
• Computers becoming much faster, more powerful
and less costly
• Accumulation of huge corpora on the Internet
• Availability of annotated Treebanks for training
(Linguistic Data Consortium
Problems with 3rd Generation MT Systems
• Performs well when dealing with data similar
to the training set
• Performance deteriorates when documents
are different from training set
• There comes a point when adding more training
data does not improve performance (The Threshold
Problems with 3rd Generation MT Systems
• There are domains when data is sparse
• Sometimes the training data itself is noisy (full of
• Does not provide any insight into language,
linguistics or the translation process
Arabic NLP Goals
1. Transfer of knowledge and technology to the Arab World
2. Modernize and fertilize the Arabic language
3. Improve and modernize Arabic linguistics
4. Make information retrieval, extraction, summarization and
translation available to the Arab user
Arabic NLP History (1)
Followed and integrated with main stream NLP
1978 - 1989
Kuwait: Mohammed Al-Sharikh & Nabil Ali – Sakhr
Morocco: Hlaal (1979, 1985) on Arabic morphology
Holland: Everhard Ditter on MSA
US: The Weidner English/Arabic MT system
Arabic NLP History (2)
IBM Scientific Centers in Kuwait and Cairo
France: The Dinar Lexical Data Base, Joseph Dichy
Language Resources and Human Language
Technology work (ELRA/Elda Choukry)
Arabic NLP History (3)
• The Language Weaver Statistical Arabic to English
MT system
• The SYSTRAN Arabic to English MT system
• The Apptek Arabic to English Hybrid MT
• The LDC Arabic Treebank University of Pennsylvania
Arabic NLP History (4)
• The Prague Dependency Arabic Treebank
• Arabic Entity Extraction (Shaalan 2007; Zitouni 2008)
• Arabic Dialects Modeling Project at Columbia
University, USA (Diab and Habash, 2007)
Future Directions in Arabic NLP (1)
New Attitude toward Arabic Grammar
• The need for explicit description of MSA
Consider the idafa:
‫مدير البنك‬
‫حاد الذكاء‬
‫فوق المنزل‬
Future Directions in Arabic NLP (2)
• The first is a noun phrase
• The second is an adjectival phrase
• The third is a prepositional phrase
• The description of all as idafa is not helpful
to Arabic NLP
Future Directions in Arabic NLP (3)
• We need to focus on constituency without
case endings. Consider:
‫قال الرجل أن الوزير قد استدعاه‬
‫قال الرجل أن الوزير قد أقاله الرئيس‬
In the first, alwaziir is a subject and in the second is an
object. In both sentences it is marked accusative
Future Directions in Arabic NLP (4)
• We need to describe rules for Arabic anaphoric
• Subjectless sentences (Pro Drop)
• Discourse Analysis
• Arabic love of nominalization
Future Directions in Arabic NLP (5)
Defining MSA
• Mark differences between MSA and CA
• New Arabic grammars - acknowledging the
heritage while being liberated from the paradigm
• A grammar that is more relevant to Arabic
Information Retrieval and Arabic MT
Conclusion (1)
• Arabic NLP can help in transforming Arab societies
• Good progress has been achieved in Arabic NLP
• More explicit grammar of MSA will enhance
and speed the development of NLP systems
• Arabic needs to be restored as the language of
Of science and research
Conclusion (2)
• Standards of usage need to be enforced to
preserve Arabic as the expression of the Arabic
• Linguists need to do their homework by writing
explicit grammars for discourse Analysis, Anaphoric
Relations, Syntactic Structures etc.

Slide 1