Egyptian Ministry of Communications and Information Technology
Research and Development Centers of Excellence Initiative
Data Mining and Computer Modeling Center of Excellence
Arabic Text Mining Project Presentation
By Prof. Mohsen A. A. Rashwan; Cairo University, RDI
&
Dr. Mohamed Attia; RDI
1
Formation
• EMCIT has sought to make the Centers-of-Excellence initiative in a try
to establish slim, focused, responsive, and effective bodies of R&D in
vital modernistic areas of advanced CIT, beyond any bureaucracy of
the bulkier conventional institutions.
• EMCIT has started with the Data Mining & Computer Modeling CoE,
and other centers of Mobile Computing, Micro-Electronics, …, are
following.
• The Data Mining CoE is now up and running with 5 major projects
serving; Arabic Text Mining, Basic DM Research, Tourism, e-Health,
and Oil & Gas.
• The staff of the Text Mining project is a selected group of - so far 27 brightest professors, graduate researchers, and engineers specialized
in Computer Science, Computational Linguistics, and Classic
Linguistics. They come from both the academia and the private IT
sector.
2
Need, Challenge, Edge, and Capability
• The strategic move towards CIT as a firm basis of a modernized
economy infrastructure for Egypt makes it clear why Data Mining in
general and Text Mining in specific emerge as an R&D priority in Egypt.
• As mountains of Arabic text documents have been accumulating over
years, the knowledge contained in these treasures are badly sought as
the basis of sound decision making regarding virtually all kinds of vital
activities.
• The novelty of the TM paradigm, along with the sophisticated Arabic
language specifics which is 1600+ years aged and spoken natively by
about 6% of world population, both present the non trivial challenge of
developing effective Arabic Text Mining tools & applications.
• In addition of the well chosen HR devoted to such as task, we think we
have an edge in this area due to being native specialists in Arabic NLP
with good past experiences in such projects; e.g. the Euro-Med. project
of NEMLAR; www.NEMLAR.org
3
Arabic NLP infrastructure, Text Mining tools, and Applications
4
Phenomenon, Challenge, and Solution
• Phenomenon: Arabic is a highly reflective and inflective language with
a tremendous vocabulary generation capabilities. Billions of full-form
words are possible!
• Challenge: This makes all various kinds of stochastic methodologies
deployed in language-independent Text Mining tools perform poorly
when applied on full-form Arabic text than on other less inflective and
derivative languages (e.g. English) due to a higher dimensionality and
more diluted correlations.
• Solution: Our approach is to replace the surface target text by
effective types of Text Factorization that both reduces dimensionality
and concentrates correlations of the resulting sequences over the
(original) surface text.
Finding and deploying effective language factorization(s) with those
two features strikingly helps whatever kind of statistical machine
learning methodology used for text mining applications on Arabic text
(or the languages alike).
5
Arabic Language Factorisation
• Arabic lexical factorization, Part-of-Speech tagging, and lexical
semantic factorization are kinds of text factorizations of special
relevance to text mining as we think.
• A simple, regular, and comprehensive Arabic lexical model with a
compact set of morphemes has been designed and proven to cover the
lexical sophistications of Arabic language.
• Arabic lexicon, lexical analyzer, and PoS tagger have been built
according to this model and deployed into many application where they
proved effective.
• A knowledge base that maps the Arabic lexicon to (tokenized) semantic
fields have been built.
• Cont.
6
Arabic Language Factorisation
• Cont.’d
• The standard semantic relations (synonymy, antonymy, …, etc.) among
our set of semantic fields along with the lexical semantic analyzer
based on them are being perfected over the rest of the TM project life
time.
• In fact, that lexical → semantic knowledge base maps minimally
constrained lexical compounds (not final-form words) to semantic fields
which allows best chances for maximum hits ratio as well as least
ambiguous lexical semantic factorization of input Arabic text.
• In all the aforementioned types of Arabic text factorization,
considerable ambiguity arises in different phases of analysis.
Disambiguation is done through statistical methods working on
stochastic supervised training models.
7
Thanks for your attention.
To probe further..
[email protected]
&
[email protected]
8
Descargar

Machine Learning for Data Mining