Kovid Kapoor - 08005037
Aashimi Bhatia – 08D04008
Ravinder Singh – 08005018
Shaunak Chhaparia – 07005019
 Examples of ancient languages which were lost
 Motivation : Why should we bother about such




languages?
The Manual process of Decipheration
Motivation for a Computational Model
A Statistical Method for Decipheration
Conclusions
 A language is said to be “lost” when modern
scholars cannot reconstruct text written in it.
 Slightly different from a “dead” language – a language
which people can translate to/from, but noone uses it
anymore in everyday life.
 Generally happens when one language gets
replaced by another.
 For eg, native American languages were replaced
by English, Spanish etc.
 Egyptian Hieroglyphs
 A formal writing system used by ancient Egyptians,
containing of logographic and alphabetic symbols.
 Finally deciphered in the early 19th century, following a
lucky finding of “Rosetta Stone”.
 Ugaritic Language
 Tablets with engravings found in the lost city of Ugarit,
Syria.
 Researchers recognized that it is related to Hebrew, and
could identify some parallel words.
 Indus Script
 Written in and around Pakistan around 2500 BC
 Over 4000 samples of the text have been found.
 Still not deciphered successfully!
 What makes it difficult to decipher?
http://en.wikipedia.org/wiki/File:Indus_seal_impression.jpg
 Historical knowledge expansion
 Very helpful in learning about the history of the place
where the language was written.
 Alternate sources of information : coins, drawings,
buried tombs.
 These sources not as precise as reading the literature of
the region, which gives a clear idea.
 Learning about the past explains the present
 A lot of the culture of a place is derived from ancient
cultures.
 Boosts our understanding of our own culture.
 From a linguistic point of view
 We can figure out how certain languages were developed
through time.
 Origin of some of the words explained.
 Similar to a cryptographic decryption process
 Frequency analysis based techniques used
 First step : identify the writing system
 Logographic, alphabetic or syllaberies?
 Usually determined by the number of distinct symbols.
 Identify if there is a closely related known
language
 Hope for finding bitexts : translations of a text of the
language in a known language, like Latin, Hebrew etc.
http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-theindus-script
 Earliest attempt made by Horapollo in the 5th century.
 However, explanations were mostly wrong!
 Proved to be an impediment on the process for 1000
years!
 Arab historians able to partly decipher in the 9th and
10th centuries.
 Major Breakthrough : Discovery of Rosetta Stone, by
Napolean’s troops.
http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-theindus-script
 The stone has a decree issued by the king in three
languages : hieroglyphs, demotic, and ancient Greek!
 Finally deciphered in 1820 by Jean-François
Champollion.
 Note that even with the availability of a bitext, full
decipheration took 20 more years!
http://upload.wikimedia.org/wikipedia/commons
/thumb/c/ca/Rosetta_Stone_BW.jpeg/200pxRosetta_Stone_BW.jpeg
 The inscribed words consisted of only 30 distinct
symbols.
 Very likely to be alphabetical.
 The location of the tablets found suggested that it is
closely related to Semitic languages
 Some words in Ugaritic had the same origin as words
in Hebrew
 For eg, the Ugaritic word for king is the same as the
Hebrew word.
http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-theindus-script
 Lucky discovery : Hans Bauer assumed that the
writings on an axe found was the word “axe”!
 Led to revision of some earlier hypothesis, and
resulted in decipherment of the entire script!
http://knp.prs.heacademy.ac.uk/images/cuneif
ormrevealed/scripts/ugaritic.jpg
 Very time taking exercise; years, even centuries
taken for the successful decipherment.
 Even when some basic information about the
language is learnt, like the syntax structure, a
closely related languages, long time required to
produce character and word mappings.
 Once some knowledge about the language has
been learnt, is it possible to use a program to
produce word mappings?
 Can the knowledge of a closely related language be
used to decipher a lost language?
 If possible, would save a lot of efforts and time.
 Successful archaeological decipherment has turned out
to require a synthesis of logic and intuition…that
computers do not (and presumably cannot) possess.
– Andrew Robinson
 Notice that manual efforts have some guiding
principles
 A common starting point is to compare letter and word
frequencies with a known language
 Morphological analysis plays a crucial role as well
 Highly frequent morpheme correspondences can be
particularly revealing.
 The model tries to capture these letter/word level
mappings and morpheme correspondences.
http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf
 We are given a corpus in the lost language, and a non-
parallel corpus in a related language from the same
family.
 Our primary goals :
 Finding the mapping between the alphabets of the lost
and known language.
 Translate words in the lost language into corresponding
cognates of the known languages
http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf
 We make several assumptions in this model :
 That the writing system is alphabetic in nature
 Can be easily verified by counting the number of
symbols in the found record.
 That the corpus has been transcribed into an
electronic format
 Means that each character is uniquely identified.
 About the morphology of the language :
 Each word consists of a stem, prefix and suffix, where
the latter two may be omitted
 Holds true for a large variety of human languages
 The inventories and the frequencies in the known
language are given.
 In essence, the input consists of two parts :
 A list of unanalyzed words in a lost language
 A morphologically analyzed syntax in a known related
language
 Consider the following example, consisting of words in
a lost language closely related to English, but written
using numerals.
 15234 --asked
 1525 --- asks
 4352 --- desk
 Notice the pair of endings, -34 and -5, with the same
initial sequence 152 Might correspond to –ed and –s respectively.
 Thus, 3=e, 4=d and 5=s
 Now, we can say that 435=des, and using our
knowledge of English, we can suppose that this word is
very likely to be desk.
 As this example illustrates, we proceed by discovering
both character- and morpheme-level mappings.
 Another intuition the model should capture is the
sparsity of the mapping.
 Correct mapping will preserve phonetic relations b/w
the two related languages
 Each character in the unknown language will map to a
small number of characters in the related language.
 We assume that each morpheme is probabilistically
generated jointly with a latent counterpart in the lost
language
 The challenge: Each level of correspondence can
completely describe the observed data. So using a
mechanism based on one leaves no room for the other.
 The solution: Using a Dirichlet Process to model
probabilities (explained further).
http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf
 There are four basic layers in the generative process
 Structural Sparsity
 Character-edit Distribution
 Morpheme-pair Distributions
 Word Generation
Model Structure (cont…)
http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf
 We need a control on the sparsity of the edit-operation
probabilities, encoding the linguistic intuition that
character-level mapping should be sparse.
 The set of edit operations include character substitutions,
insertions and deletions. We assign a variable λe
corresponding to every edit operation e.
 The set of character correspondences with the variable set
to 1 { (u,h) : λ(u,h) = 1 }conveys a set of phonetically valid
correspondences.
 We define a joint prior over these variables to encourage
sparse character mappings.
 This prior can be viewed as a distribution over binary
matrices and is defined to encourage every row and column
to sum to low values integer values (typically 1)
 For a given matrix, define a count c(u) which is the number
of corresponding letters that u has in that matrix.
Formally, c(u) = ∑h λ(u,h)
 We now define a function fi = max(0, |{u : c(u) = i}| - bi)
For any i other than 1, fi should be as low as possible.
 Now the probability of this matrix is given by
 Here Z is the normalization factor and w is the weight
vector.
 wi is either zero or negative, to ensure that the probability
is high for a low value of f.
 The values of bi and wi can be adjusted depending on the
number of characters in the lost language and the related
language.
 We now draw a base distribution G0 over character edit
sequences.
 The probability of a given edit sequence P(e) depends on
the value of the indicator variable of individual edit
operations λe, and a function depending on the number of
insertions and deletions in the sequence, q(#ins(e), #del(e)).
 The factor depending on the number of insertions and
deletions depends on the average word lengths of the lost
language and the related language.
Example: Average Ugaritic word is 2 letters longer than an
average Herbew word
Therefore, we set our q to be such as to disallow any
deletions and allow 1 insertion per sequence, with the
probability of 0.4
 The part depending on the λes makes the distribution spike
at 0 if the value is 0 and keeps it unconstrained otherwise
(spike-and slab priors)
 The base distribution G0 along with a fixed parameter α
define a Dirichlet process, which provides probability over
morpheme-pair distributions.
 The resulting distributions are likely to be skewed in favor
of a few frequently occurring morpheme-pairs, while
remaining sensitive to character-level probabilities of the
base distribution.
 Our model distinguishes between the 3 kinds of
morphemes- prefixes, stems and suffixes. We therefore use
different values of α
 Also, since the suffix and prefix depend on the part of
speech of the stem, we draw a single distribution Gstm for
the stem, we maintain separate distributions Gsuf|stm and
Gpre|stm for each possible stem part-of-speech.
 Once the morpheme-pair distributions have been drawn,
actual word pairs may now be generated.
 Based on some prior, we first decide if a word in the lost
language has a cognate in the known language.
 If it does, then a cognate word pair (u, h) is produced:
 Otherwise, a lone word u is generated.
 This model captures both character and lexical level
correspondences, while utilizing morphological knowledge
of the known language.
 An additional feature of this multi-layered model structure
is that each distribution over morpheme pairs is derived
from the single character-level base distribution G0.
 As a result, any character-level mappings learned from one
correspondence will be propagated to other morpheme
distributions.
 Also, the character-level mappings obey sparsity
constraints
 Applied on Ugaritic language
 Undeciphered corpus contains 7,386 unique word
types.
 The Hebrew Bible used for known language corpus,
which is close to ancient Ugaritic.
 Assume morphological and POS annotations
availability for the Hebrew lexicon.
http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf
 The method identifies Hebrew cognates for 2,155
words, covering almost 1/3rd of the Ugaritic vocabulary.
 The baseline method correctly maps 22 out of 30
characters to their Hebrew counterparts, and
translates only 29% of all the cognates
 This method correctly translates 60.4 % of all
cognates.
 This method yields correct mapping for 29 out of 30
characters.
 Even with character mappings, many words can be
correctly translated only by examining their context.
 The model currently fails to take the contextual
information into account.
http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf
 We saw how language decipherment is an
extremely complex task.
 Years of efforts required for successful
decipheration of each lost language.
 Depends on the amount of available corpus in the
unknown language.
 But availability does not make it easy.
 Statistical model has shown promise.
 Can be developed further and used for more
languages.
 Wikipedia article on Decipherment of Hieroglyphs
http://en.wikipedia.org/wiki/Decipherment_of_hieroglyph
ic_writing
 Lost Languages: The Enigma of the World's Undeciphered
Scripts by Andrew Robinson (2009)
http://entertainment.timesonline.co.uk/tol/arts_and_ente
rtainment/books/non-fiction/article5859173.ece
 A Statistical Model for Lost Language Decipherment
Benjamin Snyder, Regina Barzilay, and Kevin Knight ACL
(2010)
(http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2
010.pdf)
 A staff talk from Straight Dope Science Advisory Board – How
come we can’t decipher the Indus Script? (2005)
http://www.straightdope.com/columns/read/2206/howcome-we-cant-decipher-the-indus-script
 Wade Davis on Endangered Cultures (2008)
http://www.ted.com/talks/wade_davis_on_endangered_cultures
.html
Descargar

Lost Language Decipheration