Kovid Kapoor - 08005037
Aashimi Bhatia – 08D04008
Ravinder Singh – 08005018
Shaunak Chhaparia – 07005019
 Examples of ancient languages which were lost
 Motivation : Why should we bother about such
The Manual process of Decipheration
Motivation for a Computational Model
A Statistical Method for Decipheration
 A language is said to be “lost” when modern
scholars cannot reconstruct text written in it.
 Slightly different from a “dead” language – a language
which people can translate to/from, but noone uses it
anymore in everyday life.
 Generally happens when one language gets
replaced by another.
 For eg, native American languages were replaced
by English, Spanish etc.
 Egyptian Hieroglyphs
 A formal writing system used by ancient Egyptians,
containing of logographic and alphabetic symbols.
 Finally deciphered in the early 19th century, following a
lucky finding of “Rosetta Stone”.
 Ugaritic Language
 Tablets with engravings found in the lost city of Ugarit,
 Researchers recognized that it is related to Hebrew, and
could identify some parallel words.
 Indus Script
 Written in and around Pakistan around 2500 BC
 Over 4000 samples of the text have been found.
 Still not deciphered successfully!
 What makes it difficult to decipher?
 Historical knowledge expansion
 Very helpful in learning about the history of the place
where the language was written.
 Alternate sources of information : coins, drawings,
buried tombs.
 These sources not as precise as reading the literature of
the region, which gives a clear idea.
 Learning about the past explains the present
 A lot of the culture of a place is derived from ancient
 Boosts our understanding of our own culture.
 From a linguistic point of view
 We can figure out how certain languages were developed
through time.
 Origin of some of the words explained.
 Similar to a cryptographic decryption process
 Frequency analysis based techniques used
 First step : identify the writing system
 Logographic, alphabetic or syllaberies?
 Usually determined by the number of distinct symbols.
 Identify if there is a closely related known
 Hope for finding bitexts : translations of a text of the
language in a known language, like Latin, Hebrew etc.
 Earliest attempt made by Horapollo in the 5th century.
 However, explanations were mostly wrong!
 Proved to be an impediment on the process for 1000
 Arab historians able to partly decipher in the 9th and
10th centuries.
 Major Breakthrough : Discovery of Rosetta Stone, by
Napolean’s troops.
 The stone has a decree issued by the king in three
languages : hieroglyphs, demotic, and ancient Greek!
 Finally deciphered in 1820 by Jean-François
 Note that even with the availability of a bitext, full
decipheration took 20 more years!
 The inscribed words consisted of only 30 distinct
 Very likely to be alphabetical.
 The location of the tablets found suggested that it is
closely related to Semitic languages
 Some words in Ugaritic had the same origin as words
in Hebrew
 For eg, the Ugaritic word for king is the same as the
Hebrew word.
 Lucky discovery : Hans Bauer assumed that the
writings on an axe found was the word “axe”!
 Led to revision of some earlier hypothesis, and
resulted in decipherment of the entire script!
 Very time taking exercise; years, even centuries
taken for the successful decipherment.
 Even when some basic information about the
language is learnt, like the syntax structure, a
closely related languages, long time required to
produce character and word mappings.
 Once some knowledge about the language has
been learnt, is it possible to use a program to
produce word mappings?
 Can the knowledge of a closely related language be
used to decipher a lost language?
 If possible, would save a lot of efforts and time.
 Successful archaeological decipherment has turned out
to require a synthesis of logic and intuition…that
computers do not (and presumably cannot) possess.
– Andrew Robinson
 Notice that manual efforts have some guiding
 A common starting point is to compare letter and word
frequencies with a known language
 Morphological analysis plays a crucial role as well
 Highly frequent morpheme correspondences can be
particularly revealing.
 The model tries to capture these letter/word level
mappings and morpheme correspondences.
 We are given a corpus in the lost language, and a non-
parallel corpus in a related language from the same
 Our primary goals :
 Finding the mapping between the alphabets of the lost
and known language.
 Translate words in the lost language into corresponding
cognates of the known languages
 We make several assumptions in this model :
 That the writing system is alphabetic in nature
 Can be easily verified by counting the number of
symbols in the found record.
 That the corpus has been transcribed into an
electronic format
 Means that each character is uniquely identified.
 About the morphology of the language :
 Each word consists of a stem, prefix and suffix, where
the latter two may be omitted
 Holds true for a large variety of human languages
 The inventories and the frequencies in the known
language are given.
 In essence, the input consists of two parts :
 A list of unanalyzed words in a lost language
 A morphologically analyzed syntax in a known related
 Consider the following example, consisting of words in
a lost language closely related to English, but written
using numerals.
 15234 --asked
 1525 --- asks
 4352 --- desk
 Notice the pair of endings, -34 and -5, with the same
initial sequence 152 Might correspond to –ed and –s respectively.
 Thus, 3=e, 4=d and 5=s
 Now, we can say that 435=des, and using our
knowledge of English, we can suppose that this word is
very likely to be desk.
 As this example illustrates, we proceed by discovering
both character- and morpheme-level mappings.
 Another intuition the model should capture is the
sparsity of the mapping.
 Correct mapping will preserve phonetic relations b/w
the two related languages
 Each character in the unknown language will map to a
small number of characters in the related language.
 We assume that each morpheme is probabilistically
generated jointly with a latent counterpart in the lost
 The challenge: Each level of correspondence can
completely describe the observed data. So using a
mechanism based on one leaves no room for the other.
 The solution: Using a Dirichlet Process to model
probabilities (explained further).
 There are four basic layers in the generative process
 Structural Sparsity
 Character-edit Distribution
 Morpheme-pair Distributions
 Word Generation
Model Structure (cont…)
 We need a control on the sparsity of the edit-operation
probabilities, encoding the linguistic intuition that
character-level mapping should be sparse.
 The set of edit operations include character substitutions,
insertions and deletions. We assign a variable λe
corresponding to every edit operation e.
 The set of character correspondences with the variable set
to 1 { (u,h) : λ(u,h) = 1 }conveys a set of phonetically valid
 We define a joint prior over these variables to encourage
sparse character mappings.
 This prior can be viewed as a distribution over binary
matrices and is defined to encourage every row and column
to sum to low values integer values (typically 1)
 For a given matrix, define a count c(u) which is the number
of corresponding letters that u has in that matrix.
Formally, c(u) = ∑h λ(u,h)
 We now define a function fi = max(0, |{u : c(u) = i}| - bi)
For any i other than 1, fi should be as low as possible.
 Now the probability of this matrix is given by
 Here Z is the normalization factor and w is the weight
 wi is either zero or negative, to ensure that the probability
is high for a low value of f.
 The values of bi and wi can be adjusted depending on the
number of characters in the lost language and the related
 We now draw a base distribution G0 over character edit
 The probability of a given edit sequence P(e) depends on
the value of the indicator variable of individual edit
operations λe, and a function depending on the number of
insertions and deletions in the sequence, q(#ins(e), #del(e)).
 The factor depending on the number of insertions and
deletions depends on the average word lengths of the lost
language and the related language.
Example: Average Ugaritic word is 2 letters longer than an
average Herbew word
Therefore, we set our q to be such as to disallow any
deletions and allow 1 insertion per sequence, with the
probability of 0.4
 The part depending on the λes makes the distribution spike
at 0 if the value is 0 and keeps it unconstrained otherwise
(spike-and slab priors)
 The base distribution G0 along with a fixed parameter α
define a Dirichlet process, which provides probability over
morpheme-pair distributions.
 The resulting distributions are likely to be skewed in favor
of a few frequently occurring morpheme-pairs, while
remaining sensitive to character-level probabilities of the
base distribution.
 Our model distinguishes between the 3 kinds of
morphemes- prefixes, stems and suffixes. We therefore use
different values of α
 Also, since the suffix and prefix depend on the part of
speech of the stem, we draw a single distribution Gstm for
the stem, we maintain separate distributions Gsuf|stm and
Gpre|stm for each possible stem part-of-speech.
 Once the morpheme-pair distributions have been drawn,
actual word pairs may now be generated.
 Based on some prior, we first decide if a word in the lost
language has a cognate in the known language.
 If it does, then a cognate word pair (u, h) is produced:
 Otherwise, a lone word u is generated.
 This model captures both character and lexical level
correspondences, while utilizing morphological knowledge
of the known language.
 An additional feature of this multi-layered model structure
is that each distribution over morpheme pairs is derived
from the single character-level base distribution G0.
 As a result, any character-level mappings learned from one
correspondence will be propagated to other morpheme
 Also, the character-level mappings obey sparsity
 Applied on Ugaritic language
 Undeciphered corpus contains 7,386 unique word
 The Hebrew Bible used for known language corpus,
which is close to ancient Ugaritic.
 Assume morphological and POS annotations
availability for the Hebrew lexicon.
 The method identifies Hebrew cognates for 2,155
words, covering almost 1/3rd of the Ugaritic vocabulary.
 The baseline method correctly maps 22 out of 30
characters to their Hebrew counterparts, and
translates only 29% of all the cognates
 This method correctly translates 60.4 % of all
 This method yields correct mapping for 29 out of 30
 Even with character mappings, many words can be
correctly translated only by examining their context.
 The model currently fails to take the contextual
information into account.
 We saw how language decipherment is an
extremely complex task.
 Years of efforts required for successful
decipheration of each lost language.
 Depends on the amount of available corpus in the
unknown language.
 But availability does not make it easy.
 Statistical model has shown promise.
 Can be developed further and used for more
 Wikipedia article on Decipherment of Hieroglyphs
 Lost Languages: The Enigma of the World's Undeciphered
Scripts by Andrew Robinson (2009)
 A Statistical Model for Lost Language Decipherment
Benjamin Snyder, Regina Barzilay, and Kevin Knight ACL
 A staff talk from Straight Dope Science Advisory Board – How
come we can’t decipher the Indus Script? (2005)
 Wade Davis on Endangered Cultures (2008)

Lost Language Decipheration