www.RDI-eg.com
Automatic Full Phonetic Transcription
of Arabic Script with and without
Language Factorization
Based on research conducted by RDI’s NLP group (2003-2009)
http://www.RDI-eg.com/RDI/Technologies/Arabic_NLP.htm
Mohsen Rashwan, Mohamed Al-Badrashiny, and Mohamed Attia
Presented by
Mohamed Attia
Talk hosted by
Group of Computational Linguistics - Dept. of Computer Science
University of Toronto – Toronto - Canada
Oct. 7th, 2009
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
The Problem of Ambiguity with NLP
 Numerous non-trivial NLP tasks that are handled via rule-based (i.e.
language factorizing) methods typically end up with multiple possible
solutions/analyses; e.g. Morphological Analysis, PoS Tagging, Syntax
Analysis, Lexical Semantic Analysis ... etc.
 This residual ambiguity arises due to our incomplete knowledge of
the underlying dynamics of the linguistic phenomenon, and maybe also
due to the lack of higher language processing layers constraining such
a phenomenon; e.g. absence of semantic analysis layer constraining
morphological and syntax analysis.
 Statistical methods are well known to be one of the most (if not the
ever most) effective, feasible, and widely adopted approaches to
automatically resolve that ambiguity.
2/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
Statistical disambiguation of factorized sequences of language entities
3/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
Intermediate Ambiguous NLP Tasks
 Sometimes, such ambiguous NLP tasks are not sought for the sake
of their outputs themselves, but as an intermediate step to infer
another final output.
 An example is the problem of automatically obtaining the phonetic
transcription of a given Arabic crude text w1 … wn , which can be
directly inferred as a one-to-one mapping of diacritics on the
characters of the input words. But these diacritics are typically absent
in MSA script!
 The NLP solution to this TTS problem is to indirectly infer the
diacritics d1 … dn via factorizing the crude input words by
morphological analysis, PoS tagging, and Arabic phonetic grammar.
Slides no. 13 to 26 provides a review of these language factorization
models.
 However these language factorization processes are themselves
highly ambiguous!
4/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
Arabic morphological analysis as an intermediate ambiguous language
factorization towards the target output of the diacritics of i/p words
5/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
Why not to Go without Language Factorization Altogether!?
 Some researchers, however, argue
that if statistical disambiguation is
eventually deployed to get the most
likely sequence of outputs, why do
not we go fully statistical; i.e.
un-factorizing
from
the
very
beginning and give up the burden of
rule-based methods?
 For our example; this means the
statistical disambiguation (as well as
the statistical language models) are
built from manually diacritized text
corpora where spelling characters and
their full diacritics are both supplied
for each word.
6/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
Cannot Cover, but How Accurate and How Fast?
 The obvious answer in many such cases (including the one of our example) is
to overcome the problem of poor coverage when the input language entities are
produced via a highly generative linguistic process; e.g. Arabic morphology.
 However, that sound question may be modified so that it enquires about the
performance (accuracy and speed) of statistically disambiguating un-factorized
language entities (at least those frequent ones that may be covered without
factorization) as compared to statistically disambiguating factorized language
entities.
 The rest of this presentation discusses 4 issues in this regard:
1- The statistical disambiguation methodology deployed in both cases.
2- The related Arabic NLP factorization models and the architecture of the
factorizing system.
3- The architecture of the hybrid (factorizing/un-factorizing) Arabic
phonetic transcription system.
4- Results analysis: factorizing system vs. hybrid system, and hybrid
system vs. other groups’.
7/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
1- Statistical Disambiguation Methodology
Noisy Channel Model for Statistical Disambiguation
With maximum a posteriori probability (MAP) criterion:
 For our example; O is the crude Arabic i/p text words sequence.
- In case of the factorizing system; I is any valid sequence of factorizations;
e.g. Arabic morphological analyses (quadruples), and the ^ denotes the most
likely one.
- In case of the un-factorizing system; I is any valid sequence of diacritics, and
the ^ denotes the most likely one.
8/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
1- Statistical Disambiguation Methodology
Likelihood Probability
In other pattern recognition problems; e.g. OCR and ASR, the term P(O|I)
referred to as the likelihood probability, is modeled via probability distributions;
e.g. HMM.
Our language factorization models enable us to do better by viewing the
availability of possible structures for a given i/p string - in terms of probabilities
- as a binary decision of whether the observed string complies with the formal
rules of the factorization models or not. This simplifies the MAP formula into:
where R(O) is the part of space of the factorization model
corresponding to the observed input string; i.e.
 In case of the factorizing system; I is now restricted to only possible
factorized sequences that can generate (via synthesis) that input sequence, and
the ^ denotes the most likely one.
 In case of the un-factorizing system; I is a possible sequence of diacritics
matching that i/p sequence, and the ^ denotes the most likely one.
9/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
1- Statistical Disambiguation Methodology
Statistical Language Models, and Search Space
The term P(I) is conventionally called the (Statistical) Language Model (SLM).
Let us replace the conventional symbol I by the more adequate for our problem,
by Q which is more convenient for our specific problem/set of problems.
With the aid of the 1st graph in this presentation; the problem is now reduced to
searching for the most likely sequence of qi,f(i); 1≤i≤L, i.e. the one with the
highest marginal probability through the following lattice:
This creates
search space:
a
Cartesian
A*
search
algorithm
is
guaranteed to exit with the
most likely path via two treesearch strategies .
10/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
1- Statistical Disambiguation Methodology
Lattice Search, and n-Gram Probabilities
1- Heuristic probability estimation of the rest of the path to be expanded next.
This is called the h* function.
combined with
2- Best-first tree expansion of the path with highest sum of start-to-expansion
probability; the g function, plus the h* function.
It is then required to estimate the marginal probability of any whole/partial
possible path in the lattice. Via the chain rule and the attenuating correlation
assumption, this probability is approximated by the formula:
Where h+1 is the maximum affordable length of n-grams in the SLM.
11/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
1- Statistical Disambiguation Methodology
Computing Probabilities of n-Grams with Zipfian Sparseness
 These conditional probabilities are primarily calculated via the famous
Bayesian formula. Due to the Zipfian sparseness, the Good-Turing discount and
Katz’s back-off techniques are also deployed to obtain smooth distributions as
well as reliable estimations of rare and unseen events respectively.
 While the DB of elementary n-gram probabilities P(q1…qn); (1≤n≤h) are built
during the training phase, the task of the statistical disambiguation in the
runtime is rendered to:
12/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
2- Arabic NLP Factorization Models
Arabic Phonetic Transcription: Problem Definition
Despite Arabic is an intensively diacritized language, Modern Standard
Arabic (MSA) is typically written by the contemporary natives without
diacritics!
So, it is the task of the NLP system to accurately infer all the missing
diacritics of all the input words in the input Arabic text, and also to
amend those diacritics in order to account for the mutual phonetic
effects among adjacent words upon their continuous pronunciation.
13/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
2- Arabic NLP Factorization Models
Challenges of Arabic Phonetic Transcription
 Modern standard Arabic (MSA) is typically written without diacritics.
 MSA script is typically full of many common spelling mistakes.
 The extreme derivative and inflective nature of Arabic, which
necessitates treating it as a morpheme-based rather than a
vocabulary-based language. The size of generable Arabic vocabulary is
within the order of billions!
 One (or more) diacritic in about 65% of the words in Arabic text is
dependent on the syntactic case-ending of each word.
 Lexical and Syntax grammars alone produce a high avg. no. of
possible solutions at each word of the text. (High Ambiguity)
 7.5% of open-domain Arabic text are transliterated words which lack
any Arabic constraining model. Moreover, many of these words are
confusingly analyzable as normal Arabic words!
14/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
2- Arabic NLP Factorization Models
The Ladder of NLP Layers; Undiscovered Levels
 Theoretically
speaking,
NLP problems should be
combinatorially tackled at all
the NLP layers, which is yet
far beyond the reach of the
current state-of-the-art of
science.
 Moreover, NLP researchers
have not developed firm
knowledge at all the NLP
layers yet.
15/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
2- Arabic NLP Factorization Models
Language Factorizations Deployed for Solving the Problem
 Arabic morphological analysis (and statistical disambiguation) is
deployed to retrieve the syntax-independent lexical phonetic info of
each input Arabic word from its building morphemes.
 Arabic PoS-tagging (along with morphological analysis) are deployed
to statistically infer the most likely syntax-dependent (case-ending)
phonetic info of i/p Arabic words.
 For transliterated (foreign) words, intra-word Arabic Phonetic
Grammar is deployed to constrain the statistical search for the most
likely diacritization that matches the spelling of each input
transliterated word.
 Inter-word Arabic phonetic Grammar is deployed (synthetically) to
phonetically concatenate fully diacritized adjacent words of all kinds.
16/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
The Architecture of the
Factorizing Arabic Phonetic
Transcription System
17/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
2- Arabic NLP Factorization Models
Arabic Morphological Structure: Morphemes
 Arabic is a highly derivative and inflective language whose words
can be decomposed into a relatively compact set of morphemes.
 Our Arabic morphological
acknowledge the following
Morphemes
model
P: 260 prefixes.
Rd: 4,600 derivative roots.
Body
P
S
Frd: 1,000 regular derivative patterns.
Fid : 300 irregularly derived words.
Derivative
Rf: 260 roots of fixed words.
Non-derivative
Ff: 300 fixed words.
Rd
Frd
Fixed
Fid
Rf
Arabized
Ff
Ra
Fa
Ra: 240 roots of Arabized words.
Fa: 290 Arabized words.
S: 550 suffixes.
18/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
2- Arabic NLP Factorization Models
Arabic Morphological Structure: Lexicon
A
comprehensive
Arabic
lexicon has been built to be
the repository of the linguistic
(orthographic, phonological,
morphological,
Syntactic)
description of each Arabic
morpheme along with all their
possible mutual interactivities
(with other morphemes) are
registered as extensively as
possible
in
a
compact
structured format.
This lexicon is the core of all
our language factorizations.
19/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
2- Arabic NLP Factorization Models
Canonical Structure of Arabic Morphology
w  q  (t : p , r , f , s )
20/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
The Multiplicity
of Possible
Arabic Lexical
Analyses
21/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
2- Arabic NLP Factorization Models
The Arabic Lexical Disambiguation Lattice
After this process we obtain the diacritization of each Arabic word
except for the case ending ones.
22/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
2- Arabic NLP Factorization Models
The Arabic Case Endings Disambiguation Lattice
After this process we obtain the case ending diacritics of each Arabic
word.
23/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
2- Arabic NLP Factorization Models
Inferring the Diacritization of Transliterated Words
 Foreign names and terminology frequently appear as transliterated Arabic
strings in real-life Arabic text at a rate of 7.5% = 1/14 approx.
These words are not constrained by Arabic Morphological or Syntactic models.
Look-Up table-based approach is not a viable solution due to:
-
Its
Its
Its
Its
lack of completeness and bad coverage.
lack of tolerability to spelling variance.
inability to attaching Arabic infixes.
lack of guarantee to the compliance with the Arabic phonology
and above all:
- The time variance nature of this problem,
 Our approach was then to go statistical at the phoneme level, however, this
would generate a too wide search space and perplexity to get good results.
 To limit the search space, we constrain the search with another NLP model at
the phonology layer: Intra Word Arabic Phonetic Grammar.
24/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
2- Arabic NLP Factorization Models
Disambiguation Lattice of Transliterated Words
After this process we obtain the case ending diacritics of each Arabic
word.
25/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
2- Arabic NLP Factorization Models
Intra Word Arabic Phonetic Grammar
26/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
3- The Hybrid Factorizing/Un-factorizing Transcriptor
Adding the Un-factorizing Phonetic Transcriptor
 The un-factorizing diacritizer simply tests the spelling of each input word
against a dictionary of final-form words; i.e. vocabulary list.
 The possible diacritizations of each word in a sequence of input words (called
henceforth “Segment”) that are all covered by that dictionary are
directly retrieved without any language factorization. The resulting
diacritizations lattice of each segment is then statistically disambiguated.
 Uncovered segments (along with the disambiguated diacritizations of the
covered segments) are then sent to the factorizing transcriptor for inferring the
most likely diacritization of uncovered segments as well as for phonetically
concatenating the words in all segments.
27/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
3- The Hybrid Factorizing/Un-factorizing Transcriptor
The Architecture of the Hybrid Transcriptor
28/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
4- Results Analysis
Experimental Evaluation of both Architectures
Two sets of experiments and result analyses have been performed to
evaluate our Arabic phonetic transcription work:
 Experiments to compare the performance of the purely factorizing
architecture with the hybrid factorizing/un-factorizing one.
 Experiments to compare the performance of the best of our two
architectures, with the best-reported other systems produced by our
rival R&D groups.
While the first set of experiments shows the hybrid architecture to
outperform the purely factorizing one, the second set shows our hybrid
system to be superior to the ones of our rival groups.
29/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
4- Results Analysis
Comparing with Best Rivals; Experimental Setup
The best two reported rival systems reported in the published literature on the
problem of full Automatic Arabic Phonetic Transcription are:
 N. Habash & O. Rambow group in Columbia Univ. whose architecture is
a language factorizing one, with statistical modeling/disambiguation tool of
Support Vector Machine Tool (SVMTool). They also build an open-vocabulary
SLM with Kneser-Ney smoothing using the SRILM toolkit. (2007)
 I. Zitouni, J. S. Sorensen, R. Sarikaya group in IBM’s WRC whose
architecture
is
also
a
language
factorizing
one,
with
statistical
modeling/disambiguation work frame of Maximum Entropy. (2006)
Both of the two groups evaluated their performance by training and testing their
two systems using LDC’s Arabic Treebank of diacritized news stories
(LDC2004T11; text–part 3, v1.0) that is published in 2004.
This Arabic text corpus which includes a total of 600 documents ≈ 340K words
from AnNahar (Lebanese) newspaper text is split into a training data ≈ 288K
words and test data ≈ 52K words.
30/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
4- Results Analysis
Comparing with Best Rivals; Experimental Results
In order to obtain a fair comparison with the work of Habash &
Rambow’s group, and with Zitouni et al.’s group:
 We used the same aforementioned training and test corpus from
LDC’s Treebank.
 We adopted their same metrics at counting the errors while
evaluating our hybrid system vs. theirs.
As each of the other two
groups
deploys
more
sophisticated
statistical
tools than ours, one can
attribute
the
superior
performance of ours to
hybridizing
the
unfactorizing
transcriptor
with the factorizing one in
our system architecture.
31/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
4- Results Analysis
Comparing the Factorizing to the Hybrid Architecture;
Experimental Setup
It is very insightful not only to know how better is the hybrid transcriptor
compared to the purely factorizing one, but also to know how the error margin
evolves in both cases with increasing the size of the training annotated text
corpora.
To this end; a domain-balanced annotated training Arabic text corpora of a total
size of 3,250K words have been developed (over years) so that a manually
supervised full Arabic morphological analysis and diacritization had been applied
to every word.
Another domain-balanced (tough) test set of 11K words had also been prepared
in both the annotated and un-annotated formats.
At approx. log-scale steps of the size of the training corpora, the statistical
models (with the same equivalent h) had been built and the following metrics
have been measured for each of the two architectures:
 Error margin.
 Average execution time per query.
 Average size of the SLM's.
32/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
4- Results Analysis
Comparing the Factorizing to the Hybrid Architecture;
Experimental Results
 Both systems asymptote to the same irreducible error margin.
Justification: Despite being put in two different formats, the SLM’s of both systems are
built form the same data and have hence the same information content.
 The hybrid system has a faster learning curve than the purely factorizing one.
Justification: The un-factorizing component suggests fewer candidate diacritizations (by
looking the dictionary up) than the factorizing component (which generates all the
possibilities) which in turn leads to less ambiguity. Due to the NLP’s Zipfian distribution, a
small dictionary (built up from small training data) can quickly capture the frequent words.
33/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
4- Results Analysis
Comparing the Factorizing to the Hybrid Architecture;
Experimental Results (cont’d)
 The hybrid system has been found to be approx. twice faster than
the purely factorizing one as per the avg. execution time per
transcription query.
Justification: Time needed for extra language factorizations, and
slimmer lattice hence less A* search time.
 The storage needed for the SLM's of the un-factorizing system has
been found to be 8 times smaller (on avg.) than their equivalent
counterparts of the purely factorizing one.
N.B. The storage needed for the SLM's of the hybrid system is the sum
of those needed for the factorizing and un-factorizing components.
Justification: Extra space is needed to store much more lower-order
n-grams in the factorizing system than in the un-factorizing one.
34/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
Relevant Publications by: I- Competing Groups
(Columbia Univ. group)
- N. Habash, O. Rambow, Arabic Diacritization through Full Morphological
Tagging, Proceedings of the 8th Meeting of the North American Chapter of the
Association for Computational Linguistics (ACL); Human Language Technologies
Conference (HLT-NAACL), 2007.
(IBM group)
- I. Zitouni, J. S. Sorensen, R. Sarikaya, Maximum Entropy Based Restoration
of Arabic Diacritics, Proceedings of the 21st International Conference on
Computational Linguistics and 44th Annual Meeting of the Association for
Computational Linguistics (ACL); Workshop on Computational Approaches to
Semitic
Languages;
Sydney
Australia,
July
2006;
http://www.ACLweb.org/anthology/P/P06/P06-1073.
35/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
Relevant Publications by: II- Our Group (RDI’s)
1- Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A., A Stochastic
Arabic Diacritizer Based on a Hybrid of Factorized and Un-factorized Textual Features,
IEEE Transactions on Audio, Speech, and Language Processing (TASLP)
http://www.SignalProcessingSociety.org/Publications/Periodicals/TASLP. (Accepted
but not published yet)
2- Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A., A Stochastic
Arabic Hybrid Diacritizer, 2009 IEEE International Conference on Natural Language
Processing and Knowledge Engineering (IEEE NLP-KE'09);
http://caai.cn:8080/nlpke09/, Dalian-China, Sept. 2009.
3- Al-Badrashiny, M., Automatic Diacritization for Arabic Texts, M.Sc. thesis, Dept. of
Computer Engineering, Faculty of Engineering, Cairo University, June 2009:
http://www.rdi-eg.com/rdi/Downloads/ArabicNLP/Mohamed-Badashiny_MScThesis_June2009.pdf.
Cont. on the next page 
36/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
Relevant Publications by: II- Our Group (RDI’s) “Cont’d”
4- Attia, M., Rashwan, M., Al-Badrashiny, M., Fassieh; a Semi-Automatic Visual
Interactive Tool for the Morphological, PoS-Tags, Phonetic, and Semantic Annotation
of the Arabic Text, IEEE Transactions on Audio, Speech, and Language Processing
(TASLP) http://www.SignalProcessingSociety.org/Publications/Periodicals/TASLP:
Special Issue on Processing Morphologically Rich Languages, Vol. 17 - Issue 5; pp.
916 to pp. 925
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=5067414&arnumber=50757
78&count=21&index=6, July 2009.
5- Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., A Hybrid System for
Automatic Arabic Diacritization, The Proceedings of the 2nd International Conference
on Arabic Language Resources and Tools, Cairo - Egypt
http://www.MEDAR.info/Conference_All/2009/index.php, Apr. 2009.
6- Attia, M., Theory and Implementation of a Large-Scale Arabic Phonetic
Transcriptor, and Applications, PhD thesis, Dept. of Electronics and Electrical
Communications, Faculty of Engineering, Cairo University,
http://www.rdi-eg.com/rdi/technologies/papers.htm, Sept. 2005.
7- Attia, M., A Large-Scale Computational Processor of the Arabic Morphology, and
Applications, M.Sc. thesis, Dept. of Computer Engineering, Faculty of Engineering,
Cairo University, http://www.rdi-eg.com/rdi/technologies/papers.htm, Jan. 2000.
37/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
Conclusions
I- A given statistical disambiguation technique operating on either
factorized or un-factorized sequences of linguistic entities asymptotes
to the same disambiguation accuracy at infinitely huge size of
annotated training corpora.
II- Disambiguating un-factorized sequences is easier-to-develop,
computationally faster, and seems to have a faster “accuracy vs.
training corpora size” learning curve.
III- With highly generative linguistic phenomena (e.g. Arabic
morphology), language factorization is necessary to handle the
problem of coverage.
IV- On the other hand, language factorization costs much R&D efforts,
and is also more computationally expensive.
V- In such cases, the optimal systems can be built as a hybrid of the
two approaches so that the factorizing mode is resorted to only if some
un-factorized entities in the i/p sequence are OOV.
38/39
CL group - Dept. of CS – U of T – Toronto - Canada
Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)
Thank you for your attention.
To probe further, please visit:
http://www.RDI-eg.com/RDI/Technologies/Arabic_NLP.htm
You may also contact:
- Prof. Mohsen Rashwan: [email protected]
- Dr. Mohamed Attia: [email protected]
39/39
CL group - Dept. of CS – U of T – Toronto - Canada
Descargar

Document