Information Access for
the Developing World
Srini Narayanan
[email protected]
Madelaine Plauche’
Joyojeet Pal
Hesperian Press (Tawnia Litwin)
Amrita University, Ettimadai
Talk Outline

Why?
 Introduction

How –
 Past
(motivating) examples
 Current project
What next?
 So what?

Pos
Language
Family
Script(s) Used
Speakers
Where Spoken (Major)
1
Mandarin
Sino-Tibetan
Chinese Characters
1051
China, Malaysia, Taiwan
2
English
Indo-European
Latin
510
USA, UK, Australia, Canada, New Zealand
3
Hindi
Indo-European
Devanagari
490
North and Central India
4
Spanish
Indo-European
Latin
425
The Americas, Spain
5
Arabic
Afro-Asiatic
Arabic
255
Middle East, Arabia, North Africa
6
Russian
Indo-European
Cyrillic
254
Russia, Central Asia
7
Portuguese
Indo-European
Latin
218
Brazil, Portugal, Southern Africa
8
Bengali
Indo-European
Bengali
215
Bangladesh, Eastern India
9
Indonesian
MalayoPolynesian
Latin
175
Indonesia, Malaysia, Singapore
10
French
Indo-European
Latin
130
France, Canada, West Africa, Central Africa
11
Japanese
Altaic
Chinese Characters and 2 Japanese Alphabets
127
Japan
12
German
Indo-European
Latin
123
Germany, Austria, Central Europe
13
Farsi (Persian)
Indo-European
Nastaliq
110
Iran, Afghanistan, Central Asia
14
Urdu
Indo-European
Nastaliq
104
Pakistan, India
15
Punjabi
Indo-European
Gurumukhi
103
Pakistan, India
16
Vietnamese
Austroasiatic
Based on Latin
86
Vietnam, China
17
Tamil
Dravidian
Tamil
78
Southern India, Sri Lanka, Malyasia
18
Wu
Sino-Tibetan
Chinese Characters
77
China
19
Javanese
Malayo-Polynesian
Javanese
76
Indonesia
20
Turkish
Altaic
Latin
75
Turkey, Central Asia
21
Telugu
Dravidian
Telugu
74
Southern India
22
Korean
Altaic
Hangul
72
Korean Peninsula
23
Marathi
Indo-European
Devanagari
71
Western India
24
Italian
Indo-European
Latin
61
Italy, Central Europe
25
Thai
Sino-Tibetan
Thai
60
Thailand, Laos
26
Cantonese
Sino-Tibetan
Chinese Characters
55
Southern China
27
Gujarati
Indo-European
Gujarati
47
Western India, Kenya
28
Polish
Indo-European
Latin
46
Poland, Central Europe
29
Kannada
Dravidian
Kannada
44
Southern India
30
Burmese
Sino-Tibetan
Burmese
42
Myanmar
Languages
# Users (M)
% of Internet
English
235 m
38.3 %
Chinese
69 m
11.2 %
Japanese
61.4 m
10 %
German
42 m
6.8 %
Spanish
32.7 m
5.5 %
Korean
25.2 m
4.1 %
Italian
24 m
3.9 %
French
22 m
3.5 %
Portuguese
19 m
3.1 %
Russian
18.1 m
3%
Dutch
12.4 m
2%
Polish
6.7 m
1.1 %
Swedish
6m
1%
Arabic
5.7 m
1%
Malay
4.8 m
0.8 %
Turkish
4m
0.7 %
Danish
3.4 m
0.6 %
Norwegian
2.5 m
0.4 %
Thai
2.3 m
0.4 %
Czech
2.2 m
0.4 %
Finnish
2.1 m
0.3 %
Catalan
2m
0.3 %
Greek
2m
0.3 %
Hebrew
2m
0.3 %
Romanian
2m
0.3 %
Vietnamese
1.5 m
0.2 %
Hungarian
1.2 m
0.2 %
Iceland
.9 m
0.1 %
Slovak
750 k
0.1 %
Slovenian
700 k
0.1 %
Internet Language Use
No Indian, African language
in the top 30
Indian Languages





08/04/06
Languages: 17 official languages
10 languages spoken by 50 million or more
25 languages spoken by a million or more
325 dialects
States formed based on language
page 6
Africa’s presence
•1 in 4 have a radio (205m)
•1 in 9 have a TV (91m)
•1 in 12 have a mobile phone (68m) (1 in 9 today)
•1 in 30 have a fixed line (28m)
•1 in 130 have a PC (6.3m)
•1 in 160 have direct Internet access(5.5m)
•1 in 400 have pay-TV (2m)
Yes, they want to use technology
Long waits for computer access
08/04/06
page 9
Talk Outline


Why?
The Past– examples
 Being
there
 Serve your target not your technology
 Partner with the community


Current Work
So what?
 Evaluation
and use cases
Being there
Your intuitions (and mine) sitting in a lab
are not very useful.
 Nor are your friends’, your mother’s, or
your school mates’.
 Solutions may be simple once you
immerse yourself in the environment.

 Work
of Joyojeet Pal (TIER group, Berkeley)
http://tier.cs.berkeley.edu/docs/www06/multi_user_computer_education-jp.pdf
Currently implemented


Prototype project (Microsoft Research) where
schools in rural India have multiple mice
Initial results are encouraging
 Learning
in a community helps
 Common cooperative activities have an enhanced
effect
 Children adapt easily to the multiple mice
Main Contact: Joyojeet Pal,
[email protected]
Talk Outline


Why?
The Past– examples
 Being
there
 Serve your target not your technology
 Partner with the community


Current Work
So what?
 Evaluation
and use cases
Serve your target not your
technology

Illiterate users - stats
 481 million illiterate users in South Asia
 289 million female illiterates in South Asia
 189

million illiterates in Africa
User Needs

Health care



Primary care
Where there is no doctor
Agriculture


Crop prices
Weather
Speech Recognition for Illiterate Users
Madelaine Plauce, ICSI and UC Berkeley
Data collection and
annotation is expensive
/s p i tsh v e r i z/
Speech varies
Illiterate speakers are
hardest to record
200 km
It's not easy to wreck a nice beach.
It's not easy to recognize speech.
Simple, Scalable, Speech
Recognition
Triphone Model
ala=
ble
Fixed set of
target phrase
patterns
ple
Small
Vocabulary
MSSRF
VRC
Sempatti
VKC
Panchampatti
Daily news reports at MSSRF
VRCs
Daily postings of weather, news, and market
prices in the Sempatti village knowledge center.
Illiterate villagers require an interpreter to gain
access to this information.
A tamil market application
A Community-Based Speech
App
Community
Rec
Traditional
Rec
Current Status

In use at Sempatti VRC



Current work (ICSI, VST, Berkeley) could
extend this to cellphones with VST API
Prototype application



planned for Coimbatore, Pondicherry, and Madurai
Invaluable data collection tool!!
Lots of scope for bootstrapping/using ML algorithms for
fast identification, clustering word mentions, etc.
Contact Madelaine Plauche’
([email protected])
Primary health care and developing
regions

Question: Can technology help provide
access to basic health care information to
rural populations in the developing world?
Approach

Find someone who knows something
about this area!
 Partner
with a trusted NGO/non-profit
Who is respected as a primary health care
information provider
 has a significant presence in the developing world
 is interested in experimenting with technology
 Will have contacts who can field test the output

Who is working on the project

Hesperian Press



From John Canny’s group (search and digitization)



Divya Ramachandran
Simon Tan (UG)
From ICSI (semantic database, field-testing, speech)



Tawnia Litwin
Digital Advisory Committee (DAC)
Joyojeet Pal
Madelaine Plauche’ (currently in South Africa)
Google

Sponsor and general resource provider (multilingual search)
Hesperian Digital Library Project




Simple PDFs that can be viewed and downloaded easily through an
on-line library; in addition to the English & Spanish editions we
publish ourselves, we will make available PDFs of editions in at
least 35 different languages (these editions are published by our
partners around the world);
Hyper-linked editions (in English and Spanish) on-line and on CDs
that enable easy cross-referencing onscreen within books and
between books;
Digitized audio recording to facilitate use in radio broadcast and by
the illiterate and visually impaired, to be available on-line and on
CDs;
Text-free interactive slide shows, available on-line and on CDs,
using voice commands to access audio-enabled illustrations.
Approach

Work in layers of increasing complexity
1.
Digitize existing materials as needed
1.
2.
2.
Provide search functionality
1.
2.
3.
Initial keyword search (monolingual)
Move toward multi-lingual search
Provide a semantic database
1.
2.
3.
4.
Encoding issues, audio/video format etc.
Hyperlink (cross-references, similar remedies/situations etc.)
Initially select a set of “semantic doors-in”
Use this to structure and index/retrieve information (multi-lingual) and
across specific handbooks/materials
Use this for context sensitive indexing and retrieval
Provide programmatic access to the material
1.
2.
Procedures and protocols for recognizing/treatments etc.
Speech and cell phone access to the information
Semantic “doors-in” to the material

Initial Set from Tawnia Litwin (Hesperian Press)

















injections
diarrhea
fever
vomiting
blood pressure
HIV/ AIDS
malaria
tuberculosis
worms
dehydration
antibiotics
diabetes
nutrition/ malnutrition
family planning
cleanliness/ sanitation
latrines
water
Semantic Index

Disease



Background Conditions
Symptoms
Cause


Infectious



Organism
Method of Spreading
Prevention of Spreading
Response

Individual




Nutrition
Hygiene
Medication
Social/Background


Eradication Techniques
Community Prevention/Policy
Example: Dehydration and
Diarrhea

Disease – Dehydration and Diarrhea
Conditions – Malnutrition
 Symptoms – Dehydration Signs (Spanish)
 Cause – Causes of Diarrhea
 Infectious
 Background
Organism: Bacteria
 Modes of Spreading: Water, Contact, Feces.

 Response

:
Individual – Taking care of a person with Diarrhea
Contextual cross-linking across
sources and languages
GENERAL INFORMATION
WOMEN WITH TB
MEN WITH TB
Language 2
Language 2
Some problems call for new
technology

Much of the health care information is
procedural information
 How
to take temperature
 Steps in treating a disease

How do you acquire and use information
about procedures and processes across
languages
Dealing with Diarrhea
Symptoms
Alternative subevents
Diagnose
Treat
Control
Fever
Administer Cortimoxazole
Adminster Cloramphenicol
Administer Ampicillin
or
Seek
Medical
Attention
Repeat-until sub-events
Concurrent sub-events
Cured
Seek
Medical
Attention
Dehydrated
Rehydrate Fluid
Water
rehydrated
Formalized pathway model
schema
FRAME
Actor
Theme
Instrument
Patient
hasFrame
hasParameter
PATH
EVENT
ISA
RELATION(E1,E2)
Subevent
Enable/Disable
Suspend/Resume
Abort/Terminate
Cancel/Stop
Mutually Exclusive
Coordinate/Synch

Key elements




CONSTRUAL
Phase (enable, start,
finish, ongoing, cancel)
Manner (scales, rate, path)
Zoom (expand, collapse)
COMPOSITE
PATH EVENT
PARAMETER
Inputs
Outputs
Preconditions
Effects
Resources
Grounding
Time
Location
CONSTRUCT
Sequence
Concurrent/Conc. Sync
Choose/Alternative
Iterate/RepeatUntil(while)
If-then-Else/Conditional
preconditions, resources, effects, sub-events, grounding
Event-event relations (enable, suspend, disable, resume,…)
Event frames connect event models to language
Domain independent
Some technical issues
Semantic Database is in RDF
 The procedural aspects use OWL-S

Specific “doors-in” index transitions
of process models

Input:
 How
do I treat Diarrhea
 What is a rehydration drink?

Output
 Each
transition links to a snippet of a
document
 The process model is language independent
Spanish Output Display
English Output Display
Rehydration drink
Current Status

Working on initial doors in
 English
and Spanish
 Iterative design/Hesperian evaluation


First evaluation (March) with three (out of 12) doors-in
Extension to Tamil (same doors-in)
 Tamil,

Telugu
Extension to African Languages
 Madelaine
Plauche’
NLP research questions

Extracting doors-in relevant information

From specific Hesperian books (OLPC is converting them to html, University of
Iowa is building xml for the material).



Involves entity and relation extraction
Building and populating a semantic database
Can we obtain procedural information from textual sources?


Syntactic and semantic patterns
Distributional Information

This would be extremely useful even in specific domains (agriculture, health care,
legal protocols, etc.)
 Modeling procedural information

Longer term question: Can we use (S)MT to support Hesperian’s translation
efforts

Initial work with Microsoft research.
HCI Research

Presentation of Information
 Topic/Concept
navigation
 Audio/Video Synch
 Animation
 Cell-Phone based presentation

Input
 Speech
 Multi-Modal
system)
input (ICSI/VST phrase translation
Initial use cases in the field

Namita Jacob, Saulina Arnold
 Voluntary
health workers in Tamil Nadu
 Translated “Helping Children who are blind”

Aravind Eye Hospital
 Provides
both paid and free health care
services
 Telemedicine set up with TIER
Evaluation

New Hesperian digital library site
 First

site to go up by early 2009
Field testing in village kiosks and in schools –
Namita Jacob
 First test, Summer 2008
 Selected “doors-in” (with WTIND, and WWHND)
 English and Tamil


Aravind Kiosks
Aravind Teacher Training.
Evaluation Metrics

Really hard to measure
 TIER
group and Hesperian have a lot of
experience
Outcome compared to baseline for specific
outbreaks
 Digitized information versus books
 Attitude change
 Behavior change

Lessons learnt


Focus on need and applications
Immerse yourself in the project on the ground



Strong community involvement
Local ownership critical


Too many tech solutions are lab based, not tested and
suboptimal.
Kerala kiosk experiments
Simple technologies effective but sometimes really
advanced technologies are appropriate
Talk Outline



Why?
How – examples
What next?
 Another

So what?
Project
SMT and Internet Content



At most around 10 million pages in Indian
languages
Around 20 billion English language pages
SMT can help
 Translate
English pages into Indian languages
 Wikipedia, Science, Education pages..

Everyone is building search engines in Indian
languages
 But
no web content
So What?

Well, it’s a start!
 Fulfills immediate need
 Tamil market
 Hesperian materials being digitized
 Reuse
 Where there is no doctor is used in 100 countries
 Small scale speech rec. with illiterates can be applied in
africa
 Other African language efforts following similar paths

Links into related efforts
Related Berkeley work at
TIER

Kiosks / Livelihood




Education



Long-distance diagnosis using 802.11b
Teaching


Studies of social impacts of Computer Aided Learning in rural areas
Observations of shared computer usage among children in resource strapped areas
Telemedicine


Cellphones for pricing in rural Rwandan coffee markets
Computers and livelihood development in urban slums in Brazil
E-literacy / Entrepreneurship in rural Kerala
‘Technology and Development’ graduate class design (see reader/syllabus)
Conference

First peer-reviewed IEEE/ACM conference in series
Summary this is
challenging….but doable
Descargar

Information Access for the Developing World