<Insert Picture Here> Arabic NLP: Overview, the State of the Art Challenges and Opportunities Ali Farghaly Overview (1) Challenges 1. to the Arabic language and culture 2. to Arabic NLP a – inherent properties of Arabic b – problems of Arabic Linguistics Overview (2) • Inherent Opportunities for the Arabic Language 1. Classical Arabic has survived 15 centuries, other language failed to do so 2. Arabic is capable of reinventing itself 3. Classical Arabic is a living language in which 1.4 billion Moslems perform their daily prayers 4. The significance of the Arabic language culturally, strategically and linguistically Overview (3) Why NLP is important? • Fundamental transition from the Industrial Economy to the Knowledge Economy in the 1980s and 1990s • Knowledge is coded in Language • Necessity for NLP Systems to categorize, retrieve, translate, and/or answer questions from unstructured texts 4 Overview (4) • NLP History • Four generations of NLP • Disappointment with the First Generation of Machine Translation Systems, ALPAC Report (1966) • Second Generation of NLP Systems (1970’s-1980’s) Overview (5) • Third Generation NLP Systems 1990’s – present • Success of Statistical Approaches • Problems with Statistical Approaches • The Emergence of the Hybrid Approach (4th generation?) Overview (6) Future Directions in Arabic NLP • New Attitude towards Arabic Grammar • Focus on Constituency • The Need for Arabic Language Planning Overview (7) • Deal with syntactic ambiguity, co-reference, unbounded dependencies, phrasal constituencies, PRO Drop .etc. • Clear Objectives of Arabic NLP for the Arab World • Could be different from Arabic NLP for the Western World • Conclusion Challenges (1) • To the Arabic language and culture • The English language is becoming the language of the World Wide Web: emails, blogs, chats etc. taking away functionalities from Arabic • Number of books, papers published in the Arab countries is minimal compared to that produced in the USA and English speaking countries • Thus, we consume rather than produce knowledge • No first class research universities in the Arab world Challenges (2) • Even when we report research, we do not use Arabic • Globalization has intensified the influence of the Western culture in the Arab World • Almost all Arab universities teach science and mathematics in a foreign language Challenges (3) • To Arabic NLP • Inherent properties of the Arabic language 1. The Arabic script (no short vowels and no capitalization) 2. Explosion of ambiguity (average 2.3 per word in other languages to 19.2 in Arabic. Example: 22 analyses of “ ”ثمنby Buckwalter (2004) Challenges (4) • 3. Complex word structure e.g. “‘ “ ورأيتهمand I saw them’ • 4. The problem of Normalization آ،إ،أ، اا losing distinction آن، إن، أن 5. Arabic as a Pro Drop Language Assumptions • The Arabic language can meet all the needs of its speakers • The Arabs were producers of knowledge at a time when the rest of the world were were consumers of knowledge • Contemporary Arab scholars proved their ability to produce knowledge Opportunities (1) Lessons from recent history • Unprecedented accumulation of knowledge 1. Dramatic increase in the number of academic publications 2. Huge investment in R & D companies 3. Fundamental changes in industry and society similar to the Industrial Revolution 4. Impressive progress in many fields such as medicine, space exploration, computer software and hardware development etc. The Knowledge Economy • Fundamental Aspects of the Knowledge Economy 1. Strategic product is knowledge rather than manufactured goods 2. Industrial workers are replaced by knowledge workers 3. Global labor market 4. Democratization of knowledge The Knowledge Economy & NLP (1) • The age of on-line information, electronic communication, World Wide Web (www) • Millions of documents are created every minute – from kb -> mg -> gig -> terabites • Explosion of knowledge can lead to explosion of ignorance The Knowledge Economy & NLP (2) • Democratization of knowledge through the use of the computer/cell phone as a communication tool • Governments, industry, academia, and individuals, desperately, need tools to process information • Information is coded in natural language The knowledge Economy & NLP (3) • Globalization -> Multilingual applications such as machine translation and cross language applications • Information Retrieval (IR) and Information Extraction (IE) are becoming increasingly important • key word search is being replaced by question answering systems • Knowledge is encoded in natural language NLP - Flashback • The invention of the computer and language 1940’s - First application: breaking the Nazi’s secret code - Second application: Russian to English machine translation (Warren, 1949) 1st Generation of MT Principles of the first generation • Capitalized on the speed lookup offered by the computer • MT is essentially a matter of correct pairing of the source language expressions with the target language equivalents • Trivial reordering of words Problems with 1st Generation MT • naïve concept of language structure • Heavy reliance on bilingual dictionaries • No attempt to mimic human translation • Unrealistic goals and promises 2nd Generation MT (1) • Principles of the Transfer Approach Three Components 1. analysis of source language (SL) 2. transfer the structure of SL to TL 3. Generation of target language surface forms 2nd Generation MT (2) Basic Principles • Linguistic knowledge is essential for the understanding of the source text • Target specific domains for better translation • More realistic goals and promises 2nd Generation MT (3) • Positive developments in NLP technology: chart parsing (Woods 1970, unification grammar Shieber 1986), definite clause grammar (Periera 1980) • Driven by the commercial market: The Georgetown System, Pan American Health Organization, EURORTA Project (Interlingua approach) etc. • Emergence of lexical approaches to grammar Problems with 2nd Generation MT Limitations • Linguistic knowledge is expensive • Explosion of syntactic ambiguity (300 parse for each input sentence) • Needed huge computing power • Limited successes: The METAL system and the Canadian weather forecast translation system Statistical Approaches to NLP • Built on Probability theory • Works well for specific domains • Relies on training data (machine learning) • Very fast • Does not require linguistic knowledge 3rd Generation of MT Systems (1) Principles • Relies on the machine learning approach • Benefits from the existence of huge corpora through the Internet • Low development cost • Rapid development time 3rd Generation of MT Systems (2) • Heavy reliance on parallel corpus at several levels • Does not require any linguistic knowledge: “Give me enough parallel corpus, and I will give you machine translation system in hours” • Represents an empirical approach to language “The proof is in the pudding” (Manning 2000) • Unlike the transfer approach, does not attempt to mimic human translators 3rd Generation MT Systems (3) Benefited by • Computers becoming much faster, more powerful and less costly • Accumulation of huge corpora on the Internet • Availability of annotated Treebanks for training (Linguistic Data Consortium Problems with 3rd Generation MT Systems Limitations • Performs well when dealing with data similar to the training set • Performance deteriorates when documents are different from training set • There comes a point when adding more training data does not improve performance (The Threshold Problem) Problems with 3rd Generation MT Systems • There are domains when data is sparse • Sometimes the training data itself is noisy (full of errors) • Does not provide any insight into language, linguistics or the translation process Arabic NLP Goals Goals 1. Transfer of knowledge and technology to the Arab World 2. Modernize and fertilize the Arabic language 3. Improve and modernize Arabic linguistics 4. Make information retrieval, extraction, summarization and translation available to the Arab user Arabic NLP History (1) • Followed and integrated with main stream NLP • 1978 - 1989 • • • • Kuwait: Mohammed Al-Sharikh & Nabil Ali – Sakhr Morocco: Hlaal (1979, 1985) on Arabic morphology Holland: Everhard Ditter on MSA US: The Weidner English/Arabic MT system Arabic NLP History (2) • IBM Scientific Centers in Kuwait and Cairo • France: The Dinar Lexical Data Base, Joseph Dichy • Language Resources and Human Language Technology work (ELRA/Elda Choukry) Arabic NLP History (3) • The Language Weaver Statistical Arabic to English MT system • The SYSTRAN Arabic to English MT system • The Apptek Arabic to English Hybrid MT • The LDC Arabic Treebank University of Pennsylvania Arabic NLP History (4) • The Prague Dependency Arabic Treebank • Arabic Entity Extraction (Shaalan 2007; Zitouni 2008) • Arabic Dialects Modeling Project at Columbia University, USA (Diab and Habash, 2007) Future Directions in Arabic NLP (1) New Attitude toward Arabic Grammar • The need for explicit description of MSA Consider the idafa: مدير البنك حاد الذكاء فوق المنزل Future Directions in Arabic NLP (2) • The first is a noun phrase • The second is an adjectival phrase • The third is a prepositional phrase • The description of all as idafa is not helpful to Arabic NLP Future Directions in Arabic NLP (3) • We need to focus on constituency without case endings. Consider: قال الرجل أن الوزير قد استدعاه قال الرجل أن الوزير قد أقاله الرئيس In the first, alwaziir is a subject and in the second is an object. In both sentences it is marked accusative Future Directions in Arabic NLP (4) • We need to describe rules for Arabic anaphoric relations • Subjectless sentences (Pro Drop) • Discourse Analysis • Arabic love of nominalization Future Directions in Arabic NLP (5) Defining MSA • Mark differences between MSA and CA • New Arabic grammars - acknowledging the heritage while being liberated from the paradigm • A grammar that is more relevant to Arabic Information Retrieval and Arabic MT Conclusion (1) • Arabic NLP can help in transforming Arab societies • Good progress has been achieved in Arabic NLP • More explicit grammar of MSA will enhance and speed the development of NLP systems • Arabic needs to be restored as the language of Of science and research Conclusion (2) • Standards of usage need to be enforced to preserve Arabic as the expression of the Arabic identity • Linguists need to do their homework by writing explicit grammars for discourse Analysis, Anaphoric Relations, Syntactic Structures etc.