Multilingual Information
Access in a Digital Library
Vamshi Ambati, Rohini U, Pramod,
N Balakrishnan and Raj Reddy
International Institute of Information Technology
Hyderabad, India
Context

Digital Library of India
 155,000
English books
 145,000 Other language books

Population of literates
 20%
of India understand English
 80% can not
IIIT Hyderabad - http://dli.iiit.ac.in
2
Multilingual Access to Information

Retrieve a book
 By
metadata
 By keyword / content
 Cross Lingual Information Retrieval

Read a book
 Help
understand sentences in a language
 Help understand sentences across languages
 Machine Translation
IIIT Hyderabad - http://dli.iiit.ac.in
3
Approaches to Multilingual Access

Cross Lingual Retrieval
 Translate
Query to Document Language
 Translate Document to Query Language

Machine Translation
 Knowledge
Based Approaches
 Corpus Based Approaches
 Hybrid Approaches
IIIT Hyderabad - http://dli.iiit.ac.in
4
Challenges in Multilingual Access

Corpus Based Approaches
 Unavailability
of Parallel Corpus for pairs of
languages
 Unavailability of Computational Linguistics
Resources

Dictionary Based Approaches
 Unavailability
of multiple bilingual dictionaries
IIIT Hyderabad - http://dli.iiit.ac.in
5
Resources

Universal Dictionary
 Conceived
and implemented by Michael
Shamos at CMU, USA

ITRANS
A
transcription scheme and associated tool
built by IISc, IIIT and CMU

Corpus
 Data
Entry by TTD and DLI project
 TIDES project
IIIT Hyderabad - http://dli.iiit.ac.in
6
Universal Dictionary
IIIT Hyderabad - http://dli.iiit.ac.in
7
How are we doing it

Cross Lingual Search (Identify Information)




Dictionary lookup
User feedback based
Lucene Search Engine
Machine Translation (Understand Information)



Corpus based technique (EBMT)
Dictionary based word-word lookup
Good-enough translation vs Perfect translation
IIIT Hyderabad - http://dli.iiit.ac.in
8
Cross Lingual Retrieval
IIIT Hyderabad - http://dli.iiit.ac.in
9
Cross Lingual Retrieval
IIIT Hyderabad - http://dli.iiit.ac.in
10
Reading Assistant System
IIIT Hyderabad - http://dli.iiit.ac.in
11
Reading Assistant
IIIT Hyderabad - http://dli.iiit.ac.in
12
Status Today
CLIR for 6 languages
 MT for 3 languages

 Shakti
(a knowledge based MT system)
 Parallel Corpus for Hindi-Eng

UDICT
 About
40 Foreign Languages
 6 Indian Languages
IIIT Hyderabad - http://dli.iiit.ac.in
13
What more is needed?

UDICT



Machine Translation




Improving coverage of existing languages
Adding new languages
Corpus acquisition
State of art techniques applied to Indian Languages
Multi-way parallel corpus development
Textual format for the books


Books currently are in Image formats
OCR should be developed for textual content
IIIT Hyderabad - http://dli.iiit.ac.in
14
Thank You
Questions ?
Descargar

Multilingual Information Access in a Digital Library