Cooperation for Arabic Language Resources
and Tools – The MEDAR Project
Bente Maegaard, Mohamed Attia, Khalid Choukri,
Olivier Hamon, Steven Krauwer, Mustafa Yaseen
Presented by: Bente Maegaard, University of Copenhagen,
Co-ordinator of MEDAR
MEDAR: Background and mission
Mission
• Support the development of language technology, language
resources and tools for the Arabic language
• Important for the people, the economy and the culture in
the Arab countries
But current efforts are too small and too fragmented
• MEDAR is funded by the European Commission, and
focuses on the Mediterranean area, but our scope for
collaboration is much broader – all Arab countries, all
continents – and we also want to include other Semitic
languages in the future.
2
MEDAR partners
• University of Copenhagen,
Denmark (coord.)
• ELDA, France
• University of Balamand,
Lebanon
• Al-Ahlyya Amman University,
Jordan
• Universiteit Utrecht,
The Netherlands
• ILSP - Athena, Greece
• RDI, Egypt
• Birzeit University,
West Bank and Gaza Strip
• ENSIAS, University of
Mohammed V Soussi, Morocco
• CEA, France
• CNRS, France
• The Open University, United
Kingdom
• Université Lumière Lyon 2,
France
• IBM, Egypt
• Sakhr, Egypt
3
MEDAR Objectives and ‘streams’
1) Technical stream
• Survey of players, projects, products
• BLARK for Arabic
• Focus on multilingual tools, develop MT
2) Roadmap stream
• Cooperation roadmap
• Network creation
3) Dissemination stream
4
Multilingual sub-project
• Focus: Machine Translation
• English-Arabic
• Into Arabic
• Important to use Open Source
• Education and training
5
MT system, corpora
• MOSES was chosen as the MT system
• Wide community
• Already experiments English-Arabic
• Previous experience of consortium partners
• Basic MOSES system developed by Balamand
• Enhanced system provided by IBM Cairo and Dublin
City University.
• Partners collected parallel corpus, monolingual
corpora
6
Evaluation - 1
Automatic evaluation
• 10,000 words evaluation corpus
• In 200,000 words masking corpus
• Four human translations have been produced,
validated
Human evaluation
7
Evaluation - 2
• Second evaluation campaign will take place in June
• External participants have been invited and expressed
interest
8
Resources for the community
• MT systems, the baselines developed in the project
will be made publicly available according to the
original licenses (MOSES, Giza++ ..)
• Training data, through ELRA, fair conditions
• Evaluation package, through ELRA, fair conditions
9
Cooperation roadmap
Roadmap concept
• Set goals
• Define the steps to get there
• Define timeline
The MEDAR roadmap covers 3 periods
• 2010-2012
• 2012-2014
• 2013-2015
10
Elements of the roadmap
• Players and human resources, education
• Technology and R&D
• E-infrastructure: internet penetration, mobile
penetration
• Market
A few examples are presented here, please refer to the
booklet
11
Players and human resources,
Education
Players need skilled work force - not enough HLT experts
• We need HLT enabled professionals
• Typically one could add
• Linguistics, phonetics, language or speech processing – to
engineers’ education
• Computing, machine learning, language or speech
processing – to linguists’ education
• Do this in collaboration with other universities in the
region, and with e.g. universities in Europe or the US
12
Players and human resources,
Education - 2
• Staff exchange
• Student grants
• Participation of (more) Arabic partners in EU funded
projects
MEDAR has chosen this as an area to investigate
further
Partners will elaborate a cooperation scheme
13
Technology
• BLARK - Basic building blocks: LR and tools
• Reusable
• Can be shared with other players
• Follow standards
• We need more resources and tools for Semitic
languages, and they need to be shared. Free or cheap.
• Essential for education, research and first
development
14
Technology - 2
Driving applications
• Fight illiteracy through HLT – speech enabled software etc
• Collaborate to make this happen
• Governments could introduce eGovernment etc.
• Many basic technologies are needed
• Discussion ongoing with other parties
• Agree what they are
• Agree on distribution of tasks, if possible
15
E-infrastructure - Internet users
16
Penetration rates
17
Market
Important factors
• Piracy (38% worldwide, 60% in Middle-East and
Africa)
• Fight piracy – this is ongoing
• Provide IT services, not products which can be
copied
18
Conclusions
• Long-term goal of MEDAR
• Create better conditions for the development of language
and speech technology for Arabic – in order to support the
people, the culture, the economy
• Through collaboration and networking
• Therefore we welcome all comments and invite for a broad
cooperation,
• Not only for Arabic, also for other Semitic languages.
• And also with partners outside the EU/Mediterranean
Arabic countries
19
MEDAR
Acknowledgement: All MEDAR partners
Mediterranean Arabic Language
and Speech Technology
See the full Roadmap report and other information at
www.medar.info
20
Descargar

MEDAR LREC 2010