Focus: linguistic data
What is ‘Linguistic Data’?
• Printed words - in different
scripts, fonts, platforms &
• Domain-specific texts (e.g.
90-odd ones in current Indian
languages corpora)
• Samples of Spoken Corpus –
telephone talk, public
lectures, formal discussions,
in-group conversations, radio
talks, natural language
queries, etc.
• Hand-written samples
• Ritualistic Use of languages –
scriptures, chanting, etc.
• Language of Performance Reading, recitations,
But this data is of
use only if it comes
with linguistic
‘Cause it must be tagged
and aligned to be of use
How the •
Idea of an•
LDC Came•
The Brown University text corpus was adopted to build statistical language
TI-46 & TI DIGITS databases, of Texas Instruments (early 80's) distributed by
The LDC at U-Penn was established in 1992.
CIIL houses 45 million Word Corpora in 15 Indian lgs with
DoE-TDIL support. CIIL has been distributing it to R&D groups
the world over.
Now converted into UNICODE jointly with the U of Lancaster
and with another 45 million word Corpora from five Indian
languages under Emille project coming in, it has been
released in early 2004.
CIIL is now working with Universities of Uppsala on corpora
of lesser-known languages of India; See www.ciiluppsalaspokencorpus.net
•The giant strides in IT that India has made.
•Because demands were made by several
Software and Telecom giants – Reliance,
IBM, HPLabs, Modular Syetems & Infosys.
•Due to suggestions of the Hindi Committee
•As decided in the 1st ILPC meeting, 2004.
Proposal evolved through discussion held
with many Institutions in India and abroad.
August 13, 2003: 1st presentation at the
MHRD, with the then ES in the chair, and
FA, AS, J.S.(L), Director (L) and experts
from C-DAC and IIT-Kanpur.
August 17 and 18, 2003: An International Workshop on
LDC was held at the CIIL, Mysore in collaboration with
IIIT-Hyderabad and HPLabs, India. It was inaugurated by
Smt. Kumud Bansal (the then AS & now Secretary,
Elementary Ed), and attended by the J.S. (L). Those who
created LDC in USA had participated.
August 19, 2003: a follow up meeting of a smaller group was held
at the Indian Institute of Science to thrash out further details. A
Project Committee was set up.
• The
Committee had top NLP
specialists and linguists
with the Director CIIL as
the Coordinator.
• Five experts from IIT-B,
IIT-M, IISc, IIIT- Hyd, &
CIIL with inputs from the
• All changes were made
through email chats and
exchanges, and after four
during Sept-Oct, 2003.
• Nov 18,’03: Modified
proposal submitted.
• Dec 19, 2003:
representatives of lead
Institutes met in
Mysore to discuss the
draft sent to the
Ministry. Prof. Aravind
Joshi also participated.
• January, 2004: With
additional inputs, the
proposal was modified.
• Feb 24, '04: A number
of suggestions made
(see ndminutes) during
the 2 Presentation for
ES, AS, JS(L), & IFD.
• April 16, 2004: After the
TDIL Advisory Comm.,
DoE offers full support.
• The importance of creation of a large dataarchive of Indian languages is undeniable. In
fact, it is this realization that resulted in
government’s plan for corpora development in
early ’90s.
• Indian languages often pose a difficult
challenge for the specialists in AI/NLP.
• The technology developers building massapplication tools/products, have for long been
calling for availability of linguistic data on a
large scale.
• However, the data should be collected,
organized and stored in a manner that suits
different groups of technology developers.
• These issues require us to involve a number of
disciplines like linguistics, statistics, & CS.
• Further, this data must be of high quality with
defined standards.
• Resources must be shared, so that all R&D
groups are benefited.
• All these are possible with a data consortium.
Spoken language data &
importance of phoneticians
• Numerous Indian languages,
each with so many sound
patterns identified/studied by
phoneticians for centuries.
• The inventory of IPA is
invaluable for spoken language
corpus, but their identification
from speech data requires
• For speech technology,we have
to create both phonetics/
acoustics models of languages
• Even when it is now aided and
eased by Visual Phonetics
technology, as available in CIIL
or TIFR labs, what we need in
addition is trained phoneticians.
•An ideal model of Consortium could be
seen if we consider the Linguistic Data
Consortium (LDC) hosted by the
University of Pennsylvania.
•LDC (USA) is an open consortium of
universities, companies & government
R&D labs that creates, collects and
distributes speech and text databases,
lexicons, and other resources for R&D.
• This ‘LDC’ has 100 plus agencies as its active users and members.
Includes some non-western languages:Arabic,Chinese, Korean.
• The core operations of are self-supporting after ten years.
• The activities include maintaining the data archives, producing and
distributing CD-ROMs, and arranging networked data distribution, etc.
• All these have provided a great impetus to R&D in the field of
language technology for English and other European languages.
• It is proposed to adopt a similar approach in the Indian context.
Who funded LDC in US?
• LDC was supported initially by
US Govt grant IRI-9528587
from the Information and
Intelligent Systems division
• Also by a grant 9982201 from
the Human Computer
Interaction Program of the
National Science Foundation
• Powered in part by Academic
Equipment Grant 7826-990 237US from Sun Microsystems.
• No member institution could
afford to produce this
Who managed?
Who will set up LDC-IL in India?
What will it do actually?
• The Ministry of HRD through the Central Institute of Indian
Languages (CIIL), Mysore along with other institutions
working on Indian Languages technology like Indian
Institute of Science, Bangalore, Indian Institutes of
Technology at Mumbai and Chennai, as well as the
International Institute of Information Technology, Hyderabad
propose to set up this LDC-IL.
• It is proposed that they will be the Lead Institutions in this
initiative, with CIIL as the coordinating body.
•LDC-IL will be an archive plus.
•Besides data, tools and standards of data representation
and analysis must be developed.
•It will create, analyze, segment, tag, align, and upload
different kinds of linguistic resources.
•It will accept electronic resources from authors,
newspapers, publishers, film, TV, radio & process them for
use of the community.
Potential Participants /
Institutions in India
All academic institutes, research organizations and Corporate R&D
groups from India and abroad working on Indian languages will be
encouraged to participate in LDC-IL. The following have already
shown interest:
•IISc Bangalore;
•All Indian Institutes of Technology;
•IIITs at Hyderabad and elsewhere;
•ISI Calcutta/Hyderabad/Bangalore;
•C-DAC, Pune;
•TIFR Mumbai;
•Universities like U of Hyderabad; DU; JNU; NEHU
•HP Labs India;
•IBM; Infosys; Reliance Infocom;
•Language institutions like CIEFL, KHS, NCPUL & RSKS;
Major areas of Linguistic Resource
Development as proposed
• Speech Recognition
and Synthesis
• Character Recognition
• Creation of different
kinds of Corpora
• By-products : Word
finders, lexicons of
different kind, thesauri,
Usage compilations etc.
Other possible applications
• Collocational restrictions for
OCR building
• TTS: Statistical
Probabilities models
• Build a speech recognition
Develop Tree-bank tools
Skeletal parses
Will form a basis of MAT or MT
of MCIT, and will complement it perfectly
Funding & Management
• The core funding from the Government of India. It
will span over two plan periods.
• All activities will be in a project mode and through
CIIL’s PL account.
• All staff will be on contract.
• All receipts and payments through internet
gateways, or through conventional means, will go
to this special bank account.
• Will attempt to leverage expertise already available
to cut avoidable cost and delay.
• As the nodal agency, CIIL will further distribute the
relevant funding for specific sub-components of the
scheme to other academic institutions.
• An annual progress report will be submitted to the
L D C -IL : O p e n to in s titu tio n s , R e s e a rc h O rg a n iz a tio n s, a n d
C o rp o ra te s e c to r fro m a ll o v e r th e w o rld .
W ill e n c o u ra g e m e m b e rs to c o n trib u te d a ta b a s e s a n d
s h a re re v e n u e s fro m s a le o f th e d a ta th e y c o n trib u te .
T h e d a ta b a s e s w ill b e a v a ila b le fo r R & D p u rp o s e s to a ll
m e m b e rs
n o n -m e m b e rs
paym ent
th e
a p p ro p ria te fe e , w ith a lic e n se fo r u s e o n ly .
G e n e ra l m e m b e rs h ip w ill e n title a ll to g e t a la rg e c h u n k o f
ta g g e d / a lig n e d d a ta fo r fre e ; H o w e v e r, fo r sp e c ia liz e d
p a rts, d e p e n d in g o n th e d a ta c o n trib u to rs , th e y w ill h a v e
to p a y a d d itio n a l a m o u n ts .
T h e o rg a n iz a tio n w ill b e a s k e d to s ig n a L ic e n se A g re e m e n t
th a t th e d a ta b a s e s w ill n o t b e d istrib u te d b y it to o th e rs
e ith e r fre e o r fo r a fe e .
T h e IP a n d th e c o p y rig h t o f a n y p ro d u c t d e v e lo p e d a s a
re s u lt o f s u c h a n R & D a c tiv ity s h a ll lie w ith th e
o rg a n iz a tio n th a t h a s c re a te d th e p ro d u c t.
1 . T h e L D C – IL w ill h a v e a P ro je c t A d v is o ry C o m m itte e (P A C ).
2 . P e rm a n e n t m e m b e rs : D ire c to rs o r n o m in e e s o f le a d
in s titu tio n s .
3 . T h e P A C m a y b e e x p a n d e d la te r.
4 . L e a d in s titu tio n s m a y b e m a d e e x p a n d a b le , w ith m a jo r
e n te rp ris e s jo in in g b y p u ttin g in a m a jo r c o rp u s g ra n t.
5 . It is to b e u n d e rs to o d th a t e v e n if in s titu tio n s fro m a b ro a d
jo in th is C o n so rtiu m th e a d m in is tra tio n / g o v e rn a n c e o f it w ill
re m a in w ith In d ia n m e m b e rs o n ly .
6 . A n o ffic ia l o f th e la n g u a g e B u re a u n o m in a te d
b y th e
M H R D a n d a n o m in e e o f th e M C IT w ill b e m e m b e rs o f th e
P A C . T h e F A o f M H R D w ill a lso b e a m e m b e r.
7 . T h e D ire c to r, C e n tra l In s titu te o f In d ia n L a n g u a g e s w ill b e
th e H e a d o f th e L D C -IL . H e w ill b e a s s iste d b y a P ro je c t
D ire c to r n o m in a te d / a p p o in te d fo r th e p u rp o s e .
8 O n e E x p e rt in IP R m a tte rs n o rm a lly d ra w n fro m
In s titu tio n s lik e N a tio n a l L a w S c h o o l U n iv e rs ity , B a n g a lo re
Differential rate of annual fee
• 1. Individual Researchers:
Rs.2000/- per annum
• 2. Educational
Institutions: Rs.20,000/- per
3. Software and related
industry : Rs.2,00,000/- per
Other countries :
• 1. Individual Researchers:
$ 2,000/- per annum
• 2. Educational
Institutions: $ 20,000/- per
• 3. Software and related
industry :
$ 50,000/per annum
• It is estimated that by the third
year, LDC-IL will have 50
Institutional members from
India, and 200 Indian scholars as
individual members, contributing
to Rs. 12 lakh annually.
• In addition, it is estimated to
have at least 20 researchers from
abroad as individual members,
contributing to $ 40,000 or Rs. 20
lakhs more.
• The attempt will be to secure
industrial support from the IT
sector internationally to raise at
least 10 institutional
memberships initially, creating a
corpus of $ 200,000 annually
by/during the third year. Should
that happen, it will generate a
substantial amount for LDC-IL.
Budget: A broad indication*
Rs. 221.60 lakhs per year. Total: Rupees
1772.8 lakhs for the next 8 years.
• 1. Human Resources:
• 2. Tasks:
• 3. Events (Meetings, workshops,
seminars & Training programs) : 50,00,000
• 4. Equipments & maintenance: 27,00,000
• 5. IPR costs & publications:
Total: Rs. 2,21,60,000
•NB: The Director CIIL on the advise of the Project Advisory Committee of the
LDC-IL may be authorized to re-appropriate funds from among the heads
indicated here, without exceeding the overall budget.
•In case the people in service in the Government or Autonomous Institutions in
substantial capacity are selected their service and salary will be protected.
S l.N o .
(a )
(b )
(c )
(d )
(e )
(g )
(h )
(a )
(b )
(c )
H ead
H u m an R esou rces
P r o je c t D ir e c to r (1 ) R s . 3 0 ,0 0 0 (v a r ia b le )3 0 ,0 0 0 x 1 2
S c ie n tis t A (3 ) 2 9 ,0 0 0 x 3 p e r s o n s x 1 2 m a n - m o n th s
S c ie n tis t B (4 ) 2 1 ,0 0 0 x 8 x 1 2 m
S c ie n tis t C (5 ) 1 4 ,0 0 0 x 6 x 1 2 m
S c ie n tis t D (8 ) 1 1 ,0 0 0 x 8 x 1 2 m
P r o je c t te c h n ic ia n s (R s .5 ,0 0 0 x 2 0 x 1 2 m )
M a in t P e r s o n n e l – A c c o u n ts (R s .1 1 ,0 0 0 x 1 1 2 m )
M a in t P e r s o n n e l – S a le s & P r o m o (R s .7 ,0 0 0 x 1 x
12m )
M a in t P e r s o n n e l - G e n e r a l (R s .7 ,0 0 0 x 1 x 1 2 m )
T ask s
T a s k s a t v a r io u s P a r tic ip a tin g I n s titu tio n s (a s in
A nnex)
E v e n ts
A c a d e m ic M e e tin g s in d iff. I n s tt x 2
L D C -I L P A C m e e tin g s a t C I I L x 2
S e m in a r s & E v e n ts in d iff I n s tt – 7 e v e r y y e a r in
p a r tic ip a tin g I n s tt
S e m in a r s (N a tio n a l) in d iff I n s tt x 2
S e m in a r s (R e g io n a l) in d iff I n s tt x 4
S e m in a r s (I n t’l) r o ta tin g in d iff p a r tic ip a tin g I n s tt x
1 per year
(P r o d ) W o r k s h o p s fo r p r o d u c tio n (6 )
T r a in in g P r o g r a m m e s x 4 p e r y e a r
T r a v e l & I n c id e n ta ls
E q u ip m e n ts & M a in te n a n c e
H ardw are
S o ftw a r e /T o o ls
E q u ip m e n t m a in te n a n c e
A m ount
6 9 ,8 4 ,0 0 0
3 ,6 0 ,0 0 0
1 0 ,4 4 ,0 0 0
2 0 ,1 6 ,0 0 0
1 0 ,0 8 ,0 0 0
1 0 ,5 6 ,0 0 0
1 2 ,0 0 ,0 0 0
1 ,3 2 ,0 0 0
8 4 ,0 0 0
8 4 ,0 0 0
6 4 ,7 6 ,0 0 0
6 4 ,7 6 ,0 0 0
5 0 ,0 0 ,0 0 0
2 ,0 0 ,0 0 0
2 ,0 0 ,0 0 0
1 5 ,0 0 ,0 0 0
4 ,0 0 ,0 0 0
6 ,0 0 ,0 0 0
5 ,0 0 ,0 0 0
6 ,0 0 ,0 0 0
2 ,0 0 ,0 0 0
8 ,0 0 ,0 0 0
M a in te n a n c e o f L D C -I L
I P R /C o p y r ig h t p a y m e n ts (v a r ia b le )
P u b lic a tio n s , in c l E -p u b (1 0 a y e a r )
2 7 ,0 0 ,0 0 0
2 0 , 0 0 ,0 0 0
4 ,0 0 ,0 0 0
V a r ia b le (F r o m
O E - N o n -P l)
3 ,0 0 ,0 0 0
1 0 ,0 0 ,0 0 0
5 ,0 0 ,0 0 0
5 ,0 0 ,0 0 0
R s . 2 ,2 1 ,6 0 ,0 0 0
Resource Generation- Details
The first 2 years of the project are
incubation years. It would take time
to set up, and test-run tools and
deliverables & advertise.
It is estimated that from the third
year onwards, the annual revenue
may be 8% to 10% of the annual
investment, i.e. Rs. 17.73 lakhs to Rs.
22.16 lakhs contributing to Corpus
6th year on, it will be around 25% to
35% of the amount invested, i.e.
Rs.55.4 lakhs to Rs.66.48 lakhs
At the end of eight years, there will be
at least Rs. 201.66 lakhs to Rs. 243.76
lakhs plus interests in corpus funds.
Hopefully, there will be new lead
institutions to contribute to corpus
fund further, once LDC-IL works in full
Core Operations to be selfsupporting
• Beyond eight years, Govt may
support only events (Rs.50 lakhs
from CIIL’s OC-Plan), tasks of
software development (Rs.64.76
lakhs from our OE-Plan), and
maintenance of equipments
(Rs.15.24 lakhs from OE-Non-Plan),
i.e. Rs.130 lakhs a years.
• The services of the personnel and
the IPR costs will be paid from 6%
interests of the corpus funds
(Rs.14.63 lakhs) plus anticipated
annual income, i.e. 66.48 lakhs, i.e.
Rs.81.11 lakhs generated annually.
With Rs.130 lakhs as above, the
total comes to Rs.211.11 lakhs
Thank you
Speech Recognition and Synthesis: Objectives
Primarily to build speech recognition and synthesis systems.
Although there are ASR & TTS systems for many western languages,
commercially viable speech systems are unavailable.
Voice User Interfaces for IT applications and services, useful especially in
telephony-based applications.
If such technology is available in Indian languages, people in various semiurban and rural parts of India will be able to use telephones and Internet to
access a wide range of services and information on health, agriculture, travel,
However, for this a computer has to be able to accept speech input in the
user’s language and provide natural speech output.
Also in India, if speech technology is coupled with translation systems
between the various Indian languages.
The main obstacle is to customize this technology for various Indian
languages is the lack of appropriate annotated speech databases.
Focus: (i) to collect data that can be used for building speech enabled
systems in Indian languages and (ii) to develop tools that facilitate collection of
high quality speech data.
Goals – long & short term
L on g T erm G oal:
T he grand vision of this project is to collect data to provide speech -to-speech translation from each and every language
to each and every other language spoken in India (including Indian E nglish). S uch a system w ould include unlim ited
vocabulary speech synthesis and recognition system s for every Indian language coupled w ith m achine translation system s
betw een those languages. T he block diagram given below describes the basic architecture of such a system .
S p eech inpu t
in lang u ag e A
S peech
R ecognition in
L anguage A
R ecognized T ext in L anguage A
S h ort T erm G oal:
T o create databases for building (a) bi-directional speech to speech translation system o f read speech for a pair of
Indian languages, nam ely, H ind i-T elugu, (b) a speech recognition system for Indian E nglish. F urther, it is desired to collect
large vocabulary isolated data for the 22 S cheduled Indian languages.
S peech
R ecognition in
L anguage A
T ext in L anguage A
T ext to S peech
M achine
T ranslation fro m
L anguage A to B
T ranslated T ext in L anguage B conversion in
L anguage B
S peech O utput
in L anguage B
D a ta co llectio n E ffo rt fo r A u to m a tic S p eech R eco g n itio n (A S R )
D a ta req u ired : R ead sp eech co rp o ra fo r tw o In d ian lan g u ag es an d In d ian E n g lish .
C h a n n els:
1 . C lo se talk in g m icro p h o ne, o n a d esk to p o r lapto p .
2 . T elep h o n e, b o th lan d lin e an d m o b ile .
A n n o ta tion : T h e d ata w ill b e an n o tated at p h o n em e, sy llab le, w o rd an d sen ten ce lev els.
D ata C o llectio n fo r Iso lated S p eech R eco g n itio n
C h a n n els:
1 . C lo se talk in g m icro p h on e, o n a d esk to p o r lap to p
2 . T elep h o n e, b o th lan d lin e an d m o b ile
D em o g ra p h y : 1 0 ,0 0 0 w o rd s fro m 3 0 0 sp eak ers (1 5 0 m ale, 1 5 0 fem ale)
D a ta C o llectio n fo r T ex t to S p eech S y n th esis
D a ta R eq u ired : D ata w ill b e co llected in th e form o f read -o u t ph on etically b alan ced tex t w h ich w ill en sure
co v erag e o f all sp eech so u n d s o f th e lan g u ag e co n cern ed in d ifferen t p ro sod ic an d p h o no lo g ical co n tex ts. T h e
p h o n etically b alan ced tex t w ill b e ex tracted fro m a h u g e tex t co rp us.
C h a n n els:
S p eech S y n th esis req u ires h ig h q uality reco rd in g in an an ech o ic ch am b er u sin g h ig h q uality
m icro p h o nes an d reco rd ing eq uip m en t.
D em o g ra p h y : 6 sp eak ers: 3 m ales an d 3 fem ales p er lan g u ag e.
A n n o ta tion : D ata to b e an n o tated at p h o n e, p h o n em e, sy llab le, w o rd , an d p h rase lev el.
Possible Applications:
• Speech to Speech translation for a pair of Indian
languages, namely, Hindi and Telugu.
• Command and control applications.
• Multimodal interfaces to the computer in Indian
• E-mail readers over the telephone.
• Readers for the visually disadvantaged.
• Speech enabled Office Suite.
The effort for both Speech Recognition and Speech Synthesis will be repeated
across all 22 Scheduled languages. For Speech Recognition, spontaneous speech
data will be collected along with read speech. For speech synthesis, data will be
collected from professional speakers, with very good voice quality. Additional
speech data will be collected to come out with models for prosody (intonation,
duration, etc.) to improve the naturalness of synthesized speech. A database
(lexicon) of proper names (of Indian origin) will be created, with the equivalent
phonetic representation for each of the names.
Character Recognition
Character Recognition refers to the conversion of printed or
handwritten characters to a machine-interpretable form.
”Online” handwriting recognition or Online HWR refers to the
interpretation of handwriting captured dynamically using a handheld
or tablet device. It allows the creation of more natural handwritingbased alternatives to keyboards for data entry in Indian scripts, and
also for imparting of handwriting skills using computers.
“Offline” handwriting recognition or Offline HWR refers to the
interpretation of handwriting captured statically as an image.
Optical character recognition or OCR refers to the interpretation of
printed text captured as an image. It can be used for conversion of
printed or typewritten material such as books and documents into
electronic form.
These different areas of language technology require different
algorithms and linguistic resources.
They are all hard research problems because of the variety of writing
styles and fonts encountered.
Of these, OCR has seen some research in a few Indian scripts because
of support from the TDIL program. However the technology is not yet
mature and there is only one commercial offering.
Possible Applications
1 . H a n d w ritin g In te rfa c e to C o m p u te rs
In d ia n scrip ts a re co m p le x an d n o t su ita b le fo r ke yb o a rd -b a se d e n try. R e p la cin g th e
ke yb o a rd w ith a sim p le r a n d m o re n a tu ra l in te rfa ce b a se d o n h a n d w ritin g w o u ld m a ke
co m p u te rs m u ch m o re a cce ssib le to th e co m m o n m a n a n d to e d u ca to rs in p a rticu la r. T h e
so lu tio n w o u ld a lso n e e d to su p p o rt n u m e ra ls, p u n ctu a tio n , a n d e d itin g g e stu re s, a n d
fu n ctio n a lly re p la ce th e ke yb o a rd .
2 . H a n d w ritin g T u to r
3 . M u ltilin g u a l D ig ita l L ib ra rie s fo r E d u c a tio n
A w e a lth o f lite ra tu re a n d o th e r e d u ca tio n m a te ria l in In d ian la n g u a g e s is tra p p e d in
b o o ks, w h ich re q u ire sto ra g e a n d a re su b je ct to p h ysica l d e ca y. O n lin e b o o ks m a y b e
e a sily m a d e a va ila b le to stu d e n ts a ll o ve r in th e ir sch o o ls, h o m e s o r h o ste ls.
T h e p ro p o se d so lu tio n w ill u se a c o m p le te O C R p ip e lin e fo r co n ve rtin g sca n n e d im a g e s o f
b o o k p ag e s in to e le ctro n ic fo rm , w ith se rch u sin g th e lo ca l la n g u ag e , u sin g e ith e r sp o ke n
(u sin g S p e e ch R e co g n itio n ) o r w ritte n (u sin g O n lin e H W R ) q u e rie s.
4 . A u to m a tic F o rm s P ro c e s s in g / E d u c a tio n a l T e stin g
W ith m illio n s o f a p p lica tio n fo rm s fille d in e ve ry ye a r in In d ia n lan g u ag e s e sp e cially in th e
e d u ca tio n se cto r, a so lu tio n fo r au to m a tica lly re ad in g h an d w ritin g fro m sca n n e d im a g e s o f
fo rm s is va lu a b le .
T h e p ro p o se d so lu tio n is a co m p le te fo rm s-p ro ce ssin g syste m .
T h e in te rp re te d re su lts ca n b e sto re d in to a d a ta b a se (fo r a p p lica tio n s) o r co m p a re d w ith
co rre ct re sp o n se s (fo r e d u ca tio n a l te stin g ).
Natural Language Processing
Electronic dictionaries:
Electronic dictionaries are a primary requisite for developing any software in NLP.
ED 1 Monolingual/bilingual dictionaries
25,000 words per year (per language)
ED 2. Transfer Lexicon and Grammar(TransLexGram) (per language)
Transfer Lexicon and Grammar above involves developing a language
resource which would contain
English Headwords
Their grammatical category
Their various senses in Hindi
Corresponding sense in the other Indian language
An example sentence in English for each sense of a word
Corresponding translation in the concerned Indian language
o In case of verbs, parallel verb-frames from English to Indian language.
As is obvious from the above, TransLexGram will be a rich lexicon which will not
only contain the word level information but also the crucial information of verbargument structure and the vibhaktis with specific senses of a verb.
The resource, once created will be a parallel resource not only between
English and Indian languages but also across all Indian languages.
Creation of Corpora
Domain Specific Corpora:
Apart from these basic text corpora creation an attempt will be made to create
domain specific corpora in the following areas :
Newspaper corpora
Child language corpus
Pathological speech/language data
Speech error Data
Historical/Inscriptional databases of Indian languages which is one of the
most important to trace not only as the living documents of Indian History but
also historical linguistics of Indian languages.
Grammars of comparative/descriptive/reference are needed to be
considered as corpus of databases.
Morphological Analyzers and morphological generators.
POS tagged corpora
• Part-of-speech (or POS) tagged corpora are collections of texts
in which part of speech category for each word is marked.
• To be developed in a bootstrapping manner.
• First, manual tagging will be done on some amount of text.
• Then, a POS tagger which uses learning techniques will be used
to learn from the tagged data.
• After the training, the tool will automatically tag another set of
the raw corpus.
• Automatically tagged corpus will then be manually validated
which will be used as additional training data for enhancing the
performance of the tool.
Other kinds of Corpora
Chunked corpora:
Semantically tagged corpora:
• The chunked corpora will
also be prepared in a
manner similar to the POS
tagging. Here also the initial
training set will be a
complete manual effort.
Thereafter, it will be a manmachine effort. That is why,
the target in the first year is
less and double in the
successive years. Chunked
corpora is a useful resource
for various applications.
The real challenge in any NLP and
text information processing
application is the task of
disambiguating senses. In spite of
long years of R & D in this area,
fully automatic WSD with 100%
accuracy has remained an elusive
goal. One of the reasons for this
shortcoming is understood to be
the lack of appropriate and
adequate lexical resources and
tools. One such resource is the
"semantically tagged corpora".
Syntactic tree bank:
Parallel aligned corpora:
Preparation of this resource
requires higher level of linguistic
expertise and needs more human
effort. First, experts will manually
tag the data for syntactic parsing.
Since, a crucial point related to
this task is to arrive at a consensus
regarding the tags, degree of
fineness in analysis and the
methodology to be followed. This
calls for some discussions amongst
the scholars from varying fields
such as Sanskritists, linguistics and
computer scientists . It will be
achieved through conduct of
workshops and meetings.
A text available in multiple
constitutes parallel corpora.
NBT & Sahitya Akademi are some
of the official agencies who
develop parallel texts in different
languages through translation.
Such Institutions have given
permission to CIIL to use their
works for creation of electronic
versions of the same as parallel
The literary magazines and news
language editions will have to be
approached for parallel corpora.
Computer programmes have to be
written for creating
[I] Aligned texts; [II] Aligned
sentences; and [III] Aligned
Corpora Tools
1.Tools for Transfer Lexicon Grammar (including creation of interface for
building Transfer Lexicon Grammar)
• 2. Spellchecker and corrector tools
• 3. Tools for POS tagging. (Trainable tagging tool + an Interface for editing POS
• tagged corpora)
• 4. Tools for chunking (Rule-based language-independent chunkers)
• 5. Interface for chunking (Building an interface for editing and validating the
chunked corpora)
• 6.Tools for syntactic tree bank, incl. interface for developing syntactic tree bank
• 7. Tools for semantic tagging with basic resources are the Indian language
WordNets showing a browser that has two windows – one showing the senses
(i.e., synsets) from the WordNet appear in the other window, after which a
manual selection of the sense can be done
• 8. (Semi) automatic tagger based on statistical NLP (the preliminary version of
which is ready in IITB)
• 9. Tools for text alignment, including Text alignment tool, Sentence alignment tool
and Chunk alignment tool as well as an interface for aligning corpora