Advanced Toolbox
June 29-July 2, 2010
Albert Bickford
SIL International
[email protected]
© 2010 J. Albert Bickford.
May be freely copied for non-profit (educational,
scientific, or humanitarian) use.
10/3/2015 5:43 AM
1
Course goals
•
•
•
•
•
Help experienced users of Toolbox to use it more effectively
Respond to needs of the class
Share with each other what works
Develop ability to help others
Prerequisites: Previous familiarity with Toolbox (at least basic
dictionary and text glossing)
10/3/2015 5:43 AM
2
Outline of topics
• June 29 (Day 1)
– Introduction and getting acquainted
– Toolbox vs. FLEx
– Backup
– Interlinear text
• June 30 – July 2
– Topics to be decided, based on class desires
10/3/2015 5:43 AM
3
Logistics
• Temporary course website:
http://www.und.edu/dept/linguistics/private/toolbox.htm
• See “workremote.doc” on website for detailed
contact information and software instructions
• Skype: audio, screen-sharing
• Bomgar Network Streaming: remote control
10/3/2015 5:43 AM
4
Logistics
• Local coordinator: TBA
• Work together and help each other
• Consultation with me by appointment
10/3/2015 5:43 AM
5
Getting acquainted
• Please fill in “questionnaire.doc” from the
website and email to me.
• Discussion:
– What types of projects have you used Toolbox for?
– What are the main things you want to learn about
it?
• If I go too fast, stop me and ask for further
explanation.
10/3/2015 5:43 AM
6
Toolbox and FLEx
• Both developed within SIL
– Toolbox started (as Shoebox) in the early 1990s
– FieldWorks FLEx started ca. 2005; intended to
bring together the power of LinguaLinks and the
ease of use of Toolbox
• Toolbox development continued in parallel
with FLEx, to fill the gap until FLEx was mature
enough to replace it
• We’ve reached that transition point
10/3/2015 5:43 AM
7
Toolbox and FLEx
• FieldWorks aims to provides features that would
have been difficult to patch onto Toolbox
– Full support for non-Roman writing systems and Unicode
– Collaboration among teams of workers, not just one user
per project
– Integrated suite of tools for field research, with easy datasharing between tools: dictionary, glossed texts, parser,
grammar, discourse analysis, etc.
10/3/2015 5:43 AM
8
Toolbox and FLEx
• Toolbox is nearing the end of its life-cycle
– Usage will likely taper off over the next few years
– Most new users should start with FLEx (InField
recommendation)
– Established users will need to decide:
• Continue with Toolbox (but plan for FLEx)
• Change to FLEx now (convert data or start over)
– Ongoing limited use of Toolbox for an indefinite
period for some tasks (new technology never fully
replaces the old)
10/3/2015 5:43 AM
9
Backup
• Make sure you have a spare copy of your data
before experimenting on it in this workshop!
• Easiest: Just make a copy of the whole folder
where the data is stored.
– How to find out where it is stored
– Store the spare copy in a safe place
10/3/2015 5:43 AM
10
Backup
• Backup strategy
– How often? Answer: How much work are you
willing to lose?
– Best: Make copies on an external device,
preferably stored away from your computer
– Keep it simple and easy to recover; avoid
compression and proprietary backup formats
– Automate it, e.g. Cobian backup
(www.cobian.com)
10/3/2015 5:43 AM
11
Interlinear text
• Many different approaches
– Distribution of three types of lexical information:
dictionary, wordform inventory, morpheme glossary
– Automated vs. manual parsing
– Different transcription systems (practical vs. technical)
– Different encodings (legacy vs. Unicode)
– Glosses/translations at how many levels: sentence/clause,
word, morpheme
– Glosses/translations in multiple languages
– Other fields: grammar, notes, etc.
10/3/2015 5:43 AM
12
Interlinear text (demo)
• Standard Toolbox (e.g. Start New Project):
– Dictionary and morpheme glossary in one file
– Automated parsing
– Single transcription system (Unicode)
– 4+2 lines:
• Aligning: text, morphological parse, morpheme gloss,
part-of-speech (word class)
• Free: free translation, notes
10/3/2015 5:43 AM
13
Interlinear text (demo)
• Alternate approach (SIL-Mexico)
– 3 lexical files: dictionary, wordform inventory,
morpheme glossary
– Manual parsing
– Practical and technical
– Glosses at word and morpheme level, plus free
translation, and in both English and Spanish
10/3/2015 5:43 AM
14
Interlinear text: database
structure
• Structure of an interlinear text
– Metadata
– Units
• ID (reference) line, usually \ref
• One or more “bundles” of aligning lines (lines wrap
automatically)
• Freeform fields: free translations, notes, etc.
10/3/2015 5:43 AM
15
Interlinear text: database
structure
• Interlinear text model: Relationship of aligning lines
to each other and to the lexical databases
• Where all this is controlled (Database Properties,
Interlinear tab)
– (demo, using standard setup)
10/3/2015 5:43 AM
16
Interlinearizing: manual parsing
• Most people shouldn’t try to get Toolbox to
use automatic parsing!
– Toolbox parser not very robust and for most
languages is more trouble than it is worth; few
people succeed.
– Automatic parsing provides no permanent way to
store the correct parse.
– If you use manual parsing, things will import more
smoothly into FLEx.
10/3/2015 5:43 AM
17
Interlinearizing: manual parsing
• Instead use a wordform inventory (a.k.a. a parsing
database)
– Reliable parsing at the expense of some busy work (mostly
at first)
– Permanent, organized record of the parse of each
wordform
– Word glosses (for non-technical audience)
– Easier to use technical as well as practical orthography in
the project
– Representation of extended senses and idiomatic phrases
– Stem and grammatical categories for each wordform (for
searching)
– Etc.
10/3/2015 5:43 AM
18
Interlinear text: manual parsing
• Every word is listed in the wordform inventory
with its parse.
• We parse all words manually (by entering
them in the wordform inventory) rather than
attempting to get Toolbox to parse them
automatically.
• (Demo: Std, then Mex)
10/3/2015 5:43 AM
19
10/3/2015 5:43 AM
20
--- End Day 1 ---
10/3/2015 5:43 AM
21
June 30 (Day 2) plan
• Feedback so far
• More on Toolbox and FLEx
• Interlinear setup
10/3/2015 5:43 AM
22
Feedback
• Your experience so far
– Successes
– Problems
– Questions
– Requests
• Individual consulting: make an appointment
10/3/2015 5:43 AM
23
10/3/2015 5:43 AM
24
Good reasons to stay with
Toolbox (rather than FLEx)
• Established project, comfortable with Toolbox,
production mode, don’t need extra capabilities
• Older computers (if FLEx runs too slowly)
• Don’t have time or resources to convert to FLEx now
(most people need help from conversion specialists,
plus relearning time)
• Most colleagues still use Toolbox (mutual help)
• Specialized database that doesn’t fit FLEx (e.g.
comparative dictionary)
10/3/2015 5:43 AM
25
10/3/2015 5:43 AM
26
Interlinear text: standard setup
• Demo of actual steps in setting up a standard
Toolbox project
10/3/2015 5:43 AM
27
Implementing manual parsing
• Step 1: Make a new database type (Project,
Database Types)
– Call it something like “wordform inventory”
– Include at least the following markers:
• \wf = wordform
• \mb = morpheme break
• \dt = datestamp
– You may also want to add field for word glosses,
but they can be added later.
10/3/2015 5:43 AM
28
Implementing manual parsing
• Step 2: Make a new database, using the
“wordform inventory” database type
– Call it “wordform.db” or something similar
– Setup the template for the wordform database
• In a blank record, make sure you have all of the markers
that you want to appear in every record
• Use Database, Template to save that set of markers to
the template.
• In the \wf field of that record, call that “#pattern for
template” and leave it in the database in case you need
to change the template later
10/3/2015 5:43 AM
29
Implementing manual parsing
• Step 3: Make a new database type for texts that will use
manual parsing
– In Project, Database Types, copy the existing Text type to a new type;
call it e.g. “TextWithManualParse”
– Under Interlinear settings for the new type, find the Parse mapping
from \tx to \mb. Click on “Lexicons”. Remove the dictionary from the
list of databases to search. Instead, search for the parse in the
wordform inventory. “Markers to find” should be \wf; “Marker to
output” should be \mb. Choose OK.
– Under “If parse fails”, tell it to “insert into lexicon” and “Output failure
mark”. Do not output original word or root guess.
– Disable word formulas (bottom of the box.
– Leave other settings alone.
10/3/2015 5:43 AM
30
Implementing manual parsing
• Step 4: Make a new text database
– Use File, New to make a new text database.
• Call it “TextsManualParse.itx” or something similar
• Copy text into it (minimum: the \ref and \tx fields) and
(re)gloss as normal
10/3/2015 5:43 AM
31
Interlinear text: semi-automatic
parsing
• It is possible to combine the manual and automatic
parsing. There are two ways to do so (which can be
used separately or in combination):
– Approach 1: If parse fails, output original word. That way,
no monomorphemic words need to be added to the
wordform inventory.
– Approach 2: In the parse process, choose SH2-style parse.
List the wordform inventory as the Parse database, and the
main dictionary as the Lexicon. Toolbox will only parse the
word if you don’t have a parse in the wordform inventory.
If it parses something wrong, all you have to do is manually
override the parsing by inserting an entry in the wordform
inventory.
10/3/2015 5:43 AM
32
Interlinear text: semi-automatic
parsing
• Some warnings about this:
– You lose the advantage of easy import to FLEx
later
– You can’t do word glossing (because you won’t
have any place to store the word gloss for
wordforms that are automatically parsed)
– I have no experience doing this, so I don’t know
what pitfalls await you. Be ready to experiment,
and be sure to back up your data before you try!
10/3/2015 5:43 AM
33
Interlinear text: Other fields in
wordform inventory
• Other fields to put in a wordform inventory
(and optionally copy to the texts)
– Word-level glosses
•
•
•
•
Translation equivalents without technical terms
Extended senses of words
Contextualized
Meanings of fixed expressions
– Citation form
– Notes
10/3/2015 5:43 AM
34
Interlinear text: Adding fields to
texts
• How to add to the text model (Interlinear settings)
– In database properties for the text file type, add a new
marker that can be used for the new aligning line
– Also, change the interlinear settings so that the new
marker is included. Be sure to position all word-level
annotations before all morphem-level annotations. The
line that parses words into morphemes should be at the
transition point. (If you can’t get things in the right order,
close Toolbox and open the .typ file for the text files with a
text editor and move lines around. Carefully! Make a
backup first!)
10/3/2015 5:43 AM
35
Interlinear text: Adding fields to
texts
• Then, regloss the text.
– It may be necessary to delete all existing lines in order to
get the new lines in the right place.
– In the process, you’ll probably encounter a lot of “Lookup
failure” errors. Usually this means that there is in fact a
record for the word, but there is a field missing so there’s
nothing to copy back to the text.
10/3/2015 5:43 AM
36
10/3/2015 5:43 AM
37
Lexical databases
• How do we use the standard setup for a full
dictionary?
10/3/2015 5:43 AM
38
Lexical databases
• Types of information included in a lexical database
–
–
–
–
Phonological, semantic, grammatical, sociolinguistic, anthropological, etc.
For different audiences: Language communities, linguists, non-linguist outsiders
Dictionary and text annotation (glossing)
Sometimes the same information in multiple formats
• Can grow to include a hundred bits of information for each entry (# of
fields per record)
10/3/2015 5:43 AM
39
Lexical databases
• The Multi-Dictionary Formatter (MDF) system
– Set of fields and standard format markers for
producing dictionaries of different types
– Software for converting it to formatted output
(either within Toolbox or via Lexique Pro)
– Contains fields both for a published dictionary and
for morpheme-glossing
• Tip: don’t use it for wordforms/parsing—do that in a
separate database
10/3/2015 5:43 AM
40
Lexical databases
• Mature, widely-tested, applicable to a variety
of different situations and product types
• A de facto “standard”
– Toolbox and LexiquePro are preconfigured for it
– Imports easily into FLEx
10/3/2015 5:43 AM
41
Lexical databases, templates
• Suggested routine MDF fields to add to the template
for the dictionary database (see MDFields19a.txt for
further info)
– \lx, \ps
– \ge, \re, \xv, \xe
– \cf, \nt, \dt
• How to do it (demo)
–
–
–
–
Add fields to a single typical record
Save to the template
Make a new record, call it “#pattern for template”
Adjust as needed whenever needed, then save it to the
template
10/3/2015 5:43 AM
42
Lexical databases, templates,
mass editing
• Changes to the template only affect new
records
– To fix the old records, you have to edit them
– Mass editing in an external editor (demo)
10/3/2015 5:43 AM
43
10/3/2015 5:43 AM
44
-- End Day 2 –
10/3/2015 5:43 AM
45
July 1 (Day 3) plan
•
•
•
•
•
•
•
Feedback so far
Special characters and Unicode
Learning more about MDF
LexiquePro
Audio and video
Wordlist and concordance
Jumps
10/3/2015 5:43 AM
46
Feedback
• Your experience so far
– Successes
– Problems
– Questions
– Requests
• Individual consulting: make an appointment
• What important topics haven’t we covered
yet?
10/3/2015 5:43 AM
47
Unicode and language
encodings
• What do you need to know?
– General understanding of how special characters work on a computer?
• Font, keyboard, codepoint, encoding…
• What is Unicode, UTF-8 vs. UTF-16, composed vs. decomposed…
– “Legacy” vs. Unicode?
• Advantages/disadvantages
• Converting from legacy to Unicode
• Using Toolbox for each
– What fonts and keyboards are available?
– Setup language encodings in Toolbox?
• Sort orders, punctuation, case pairs
• Installing fonts, keyboards, Unicode
– Cautions about using Unicode with Toolbox?
– Troubleshooting specific problems?
10/3/2015 5:43 AM
48
Learning more about MDF
• Marker properties in Toolbox: right-click on a
marker (demo)
• MDFields19a.txt: accessible in Toolbox (demo)
• Making Dictionaries (MDF_2000.pdf)
10/3/2015 5:43 AM
49
LexiquePro
• Viewer for Toolbox dictionary files
– Uses the same data files as Toolbox
– Has its own settings files, separate from Toolbox
• Good for distribution and creating formatted output
(.doc, .htm)
• Can also be used to edit the file
– Changes made in either program are visible in the other
– WARNING: Don’t use at the same time as Toolbox!
– Limited capabilities compared to Toolbox
10/3/2015 5:43 AM
50
LexiquePro
• Setting up LexiquePro with a Toolbox MDF
database (demo)
• Settings files added by LexiquePro
10/3/2015 5:43 AM
51
10/3/2015 5:43 AM
52
Graphics, Sound, Video
• Both Toolbox and LexiquePro can have links to
outside files
– \pc picture
– \sf Sound file
– \ff Video or other external file
• Tip: Keep them all in the same folder close to
your data (e.g. subfolder “sup”)
10/3/2015 5:43 AM
53
10/3/2015 5:43 AM
54
Wordlist and concordance
• (demo of wordlist and concordance)
• To work, you have to set up Text Corpora
– What files to be searched
– What fields in those files
– What fields to use for referencing (one word
selected from each field)
10/3/2015 5:43 AM
55
10/3/2015 5:43 AM
56
Jumps
• Right-click (Alt-J) on a word to jump to another
database
– Jumps from the selection if something is selected
– Otherwise takes a whole word
– Tip: All characters must be listed correctly in the sort
order!
• Jump paths have to be set up first in Database
Properties (demo, esp. with interlinear text)
• “Jump target”: open in an existing window (rather
than a new window)
10/3/2015 5:43 AM
57
10/3/2015 5:43 AM
58
Range sets, data properties,
consistency checking
• Toolbox allows you to do anything with your data—
that’s not always good
• Also provides tools that allow you to limit your
creativity when appropriate, i.e. to set rules and find
violations of them (demo)
–
–
–
–
–
Range sets (Marker properties)
Data properties (Marker properties)
Data links (Database properties, Jump Path Properties)
Consistency checking (Checks menu)
Interlinear check (Checks menu)
10/3/2015 5:43 AM
59
10/3/2015 5:43 AM
60
-- End Day 3 –
10/3/2015 5:43 AM
61
July 2 (Day 4) plan
•
•
•
•
Feedback so far
Cautions about Unicode
Sort order setup and Find/Replace
Linguistic issues
–
–
–
–
–
–
Idioms
Derivational morphology
Names
Plants and animals
Morphophonemics
Nonlinear morphology (incl. reduplication)
• Comparative dictionaries
10/3/2015 5:43 AM
62
Feedback
• Your experience so far
– Successes
– Problems
– Questions
– Requests
• Individual consulting: make an appointment
10/3/2015 5:43 AM
63
10/3/2015 5:43 AM
64
Cautions about Unicode
• Guide to help you decide when and how to
switch to Unicode:
http://scripts.sil.org/UTConvert2Unicode
• All or none (don’t mix Unicode with legacy
encodings)
• Don’t just check the Unicode box in Language
Encodings—you need to convert the data files
and settings files first
• Composite characters vs. separate diacritics
10/3/2015 5:43 AM
65
10/3/2015 5:43 AM
66
Sort order setup
• Editing a sort order (demo)
• Implications for Find/Replace and Jumps
10/3/2015 5:43 AM
67
10/3/2015 5:43 AM
68
Linguistic problems in text
glossing
• Multi-word idioms and names
– Join the words together with _ on the text line
– Give one word gloss to the whole combination
– Separate them with spaces on the morpheme
break line, and gloss each piece separately
10/3/2015 5:43 AM
69
Linguistic problems in text
glossing
• Derivational morphology
– Suggest breaking off only the most productive and regular
derivational morphology
– Treat less productive derivational morphology as if it was
part of the base of a verb, i.e., only break things down to
the stem, not all the way to the root
– Why?
• It makes texts much more readable
• The stem is the lexical unit that is relevant to the context, not its
internal structure
• Full details of a derived word’s structure can be given in the
lexicon
10/3/2015 5:43 AM
70
Linguistic problems in text
glossing
• Names
– One word or more than one?
– Is the name analyzable into morphemes?
– Is there an equivalent name in the glossing
language?
10/3/2015 5:43 AM
71
Linguistic problems in text
glossing
• Plants and animals
– In general, gloss with common names, e.g. genus
labels
– Aim for the appropriate level of specificity (e.g.
don’t use ‘bird’ for a specific type of bird)
– Specify the scientific name in a note, but only if
the identification is done by an expert
– Be careful of names in English that are used for
diverse organisms (e.g. badger, ironwood)
10/3/2015 5:43 AM
72
Linguistic problems in text
glossing
• Morphophonemics
– In the \mb field, suppress morphophonemic
variation but retain suppletion, e.g. use a
phonological underlying form
– If processes apply across word boundaries, 2
options for the \tx line:
• Don’t write the changes in \tx; explain them elsewhere
• Do write them, and include alternate forms in the
wordform inventory so all variants receive the same
word gloss and parse
10/3/2015 5:43 AM
73
Linguistic problems in text
glossing
• Nonlinear morphology
– Use abstract underlying forms, as if they were
linear
– Use < > or some other convention to flag the
nonlinearity
– Explain the facts in notes
10/3/2015 5:43 AM
74
10/3/2015 5:43 AM
75
Other uses for Toolbox
•
•
•
•
Comparative dictionaries
Specialized analytical research
Ethnologue
Address list, administrative database
10/3/2015 5:43 AM
76
Comparative dictionaries
• Organize either by cognate sets or a common
gloss
• Separate fields for each language
• (demo)
10/3/2015 5:43 AM
77
10/3/2015 5:43 AM
78
Feedback
• Please fill in the course evaluation sheets that
InField has provided.
• Please also send feedback on this workshop to
me at (especially suggestions for the future)
to:
[email protected]
10/3/2015 5:43 AM
79
-- End Day 4 –
10/3/2015 5:43 AM
80
10/3/2015 5:43 AM
81
============================
Other topics
• After this point in the file are topics that may
be of interest to participants (left over from
other workshops that I have given) but which
we didn’t cover in class this time.
10/3/2015 5:43 AM
82
Interlinear text : Overview
Procedure
• Prepare text for glossing (import, break into
units, number the units)
• Pre-load words/morphemes into the lexical
database (if desired)
• Initial glossing
• Revise lexical database with new
words/morphemes/glosses as needed, and
regloss individual words
10/3/2015 5:43 AM
83
Interlinearizing: database
structure
• Special fields in MDF for glossing texts
– Glosses vs. definitions in lexicon
– \lx (lexeme) vs. \lc (citation form) fields
10/3/2015 5:43 AM
84
Interlinearizing: How to do it
• Either put one text per file or one per record
(multiple texts per file)
– If all in one file, use the existing shell “itx.db”
– If in separate files, start a new file with File, New
and choose “EOPASInterlinear” as the database
type.
• New files/records will contain blank metadata
markers and a bunch of shell units
10/3/2015 5:43 AM
85
Interlinearizing: How to do it
• Prepare text for glossing
– Type or paste text into the \tx marker(s). Two
approaches
• Type or paste one sentence or clause per unit. Add
markers for new units as needed.
• Paste the whole text into one marker, use Tools,
Break/Number Text to setup for glossing.
• Delete excess units at the end.
• Add free translations and notes (inserting
markers as needed) to each unit.
10/3/2015 5:43 AM
86
Interlinearizing: How to do it
• Vital step: all symbols must be in appropriate
language encoding
– If glossing isn’t working correctly (words being
split up or not being found), look for symbols that
aren’t yet listed in the sort order
10/3/2015 5:43 AM
87
Interlinearizing: How to do it
• ALT+I to add glosses.
• *** means the word isn’t in the lexical
database yet, or there isn’t a gloss yet.
• Entering new glosses in the database:
– Jump to add new entries to glossing database
– Return From Jump to retry CTRL+R
– Reglossing with ALT+I
10/3/2015 5:43 AM
88
Interlinearizing: How to do it
• Multiple glosses for one item
– List both glosses in lexicon, separated by semicolons
– If homophones, make two entries in lexicon
10/3/2015 5:43 AM
89
Interlinearizing: How to do it
• Revising
– Make changes first in the glossing databases, then
use ALT+I to copy to texts
– Revise by reading through the glossing databases
for consistency
– Revise by reading the texts, e.g. comparing glosses
and free translations
10/3/2015 5:43 AM
90
Interlinearizing: How to do it
• Verifying ALT+C
– How to do it
– Will work better if every word is listed in the
wordform database, even those that don’t divide
further into morphemes
10/3/2015 5:43 AM
91
Interlinear text
• EOPAS: EthnoER Online Presentation and
Annotation System
– See Schroeter and Thieberger 2006:
http://www.itee.uq.edu.au/~eresearch/papers/2006/EOPASpaperRS-NT.pdf
10/3/2015 5:43 AM
92
Learning more about
Toolbox
• Toolbox Help system
• Toolbox Self-Training.doc (accessible from
Start Menu on Windows)
• Toolbox website
http://www.sil.org/computing/toolbox/index.htm
• Discussion list
http://groups.google.com/group/ShoeboxToolbox-Field-Linguists-Toolbox
10/3/2015 5:43 AM
93
Adjusting settings
• Language encodings (demo)
– Appearance: font, size, style, color
– Sort orders
• Primary order
• Secondary order
• Ignore characters
– Case pairs and punctuation
10/3/2015 5:43 AM
94
Adjusting settings
• Marker properties (demo)
– Language encoding for a field
– Appearance of individual fields
– Name or description of a marker
– Caution: Be careful of changing the marker itself—
you can easily make things not work right by doing
so (and you may not discover that you’ve broken
things until weeks or months later, when you’ve
forgotten what you did).
10/3/2015 5:43 AM
95
Adjusting settings
• Margins and text-wrapping (demo)
– Text wrapping is only semi-automatic
– Set margin based on width of window (Database
menu)
– Reshape
• Automatically (Database, Auto Wrap)
• Single field (Database, Reshape or SHIFT+F5)
• Whole database (Database, Reshape Entire File)
– Suppress reshaping for certain markers (Marker
Properties, Data Properties, No Word Wrap)
10/3/2015 5:43 AM
96
Adjusting settings: be
patient…
• More advanced adjustments to be covered
later
– Adding new field markers to hold new types of
information
– Making a whole new database type (e.g. list of
inflected wordforms, comparative dictionary,
bibliography)
– Setting up a new project, especially if it requires
any customization (which most do)
10/3/2015 5:43 AM
97
Revising and refining data
• Find and replace
• Sorting
• Filtering
10/3/2015 5:43 AM
98
Formatted output
• File, Print: straight image of what you see on
screen (no reformatting except page breaks)
• Formatting interlinear text: no standard “off-the-shelf” solution,
requires custom setup and programming
10/3/2015 5:43 AM
99
Formatted output
• MDF output to Microsoft Word RTF
– Different types of output for different audiences
– Omitting records
– Omitting fields
10/3/2015 5:43 AM
100
Formatted output
• LexiquePro (http://www.lexiquepro.com/download.htm).
– Viewer with limited editing capability (demo)
– Best to close Toolbox before using it (only allowed to edit with one
program at a time).
– Can redistribute it bundled with your dictionary
– Export to Word RTF or to web page (HTML)
10/3/2015 5:43 AM
101
Formatted output
• XML: useful mainly for techies, but will make many things
possible for the rest of us
– Toolbox can export in XML format
– How XML differs from SIL standard format
– (demo with ZpChi data)
– Tools for manipulating XML: editors, stylesheets, and
XSLTransformations
10/3/2015 5:43 AM
102
Linguistic problems in text
glossing
• Adjusting text units: (demo in Seri)
– Splitting
– Combining
– Renumbering
10/3/2015 5:43 AM
103
Linguistic problems in text
glossing
• Homophonous morphemes (demo in Seri)
– Include separate records in the lexical database
for homophones
– Toolbox will offer a choice when it glosses
– In a particular word, however, it is often clear
which morpheme is involved, there should be no
need to choose
– This can be specified as part of the parse, as a
“forced gloss”
10/3/2015 5:43 AM
104
Integration with other
software
•
•
•
•
•
ELAN: import/export
Transcriber: import
XML: export
RTF (Word document): export
Lexique Pro (viewer and formatter, plus simple
editing)
10/3/2015 5:43 AM
105
Structure behind the scenes
(settings)
• Database types
– Fields
• Marker
• Language used
• Other information
– Relationships between databases (“jumps”)
– Instructions for processing interlinear text
10/3/2015 5:43 AM
106
Structure behind the scenes
(settings)
• Language encodings (demo)
– Font and keyboard
– Sort order, digraphs, case equivalents
– Letters vs. punctuation
– Natural classes for phonological searching
10/3/2015 5:43 AM
107
Structure behind the scenes
(settings)
• Projects (workspace) (demo)
– Arrangement of windows on the screen
– What database is open in each window
• You can have more than one project for the
same set of database files
10/3/2015 5:43 AM
108
10/3/2015 5:43 AM
109
Special characters
• Whenever possible, use Unicode fonts, rather than custom fonts
designed for specific languages (see further discussion later).
– Unicode is a newer system that allows thousands of characters in a
single font. The computer world is transitioning away from custom
fonts for specific languages to this one common system that covers all
languages.
10/3/2015 5:43 AM
110
Special characters
• Some common Unicode fonts that have many Latin characters
for minority languages
– Charis SIL
– Doulos SIL
– On Windows Vista and later: Times New Roman, Arial, et al.
– Lucida Sans Unicode
– Arial Unicode MS
10/3/2015 5:43 AM
111
Special characters: Keyboarding
• Use Character Map utility (Start, Run, “charmap.exe”) for
keyboarding unless you have something better.
• Some people will be able to use one of the standard Windows
keyboards, such as “US International”.
• Others can have a custom keyboard designed using Microsoft’s
Keyboard Layout Creator or Tavultesoft Keyman.
10/3/2015 5:43 AM
112
Special characters: How to cope
when you have less than the
ideal
• Use practical orthography when possible, not IPA or other
phonetic transcription
– Reduces the number of special characters required
– Possibly modify it to make it more systematic (e.g. use k instead of
c/qu to make morpheme shapes more consistent)
– If you need IPA too, you will need to use Character Map or have a
custom keyboard designed for you.
10/3/2015 5:43 AM
113
Special characters: How to cope
when you have less than the
ideal
• Substitute characters or digraphs, e.g.
– Use @ for ə, S for ʃ
– Use :o for ö, 'u for ú, etc.
• In other words, make do until you can get someone to set you
up properly
– Once you have a good keyboarding system, you can do a search and
replace to “correct” your makeshift transcriptions to the correct ones.
10/3/2015 5:43 AM
114
Special characters: Using legacy
custom fonts (pre-Unicode)
• Viable option for now if you already have everything you need
– all characters available in the custom font
– functioning keyboarding system
• Possibly the best option if the language community still uses the
same system
• Eventually you (and the community) will need to switch to
Unicode
10/3/2015 5:43 AM
115
Special characters: Using legacy
custom fonts with Toolbox
• Need to modify the standard Toolbox project setup. All
language encodings must be adjusted:
– Not Unicode
– Use your custom fonts and keyboards
• Don't mix custom encodings with Unicode in the same project!
– All encodings should be set up for Unicode, or none of them should
be.
10/3/2015 5:43 AM
116
10/3/2015 5:43 AM
117
Special characters
• “Special character”: anything other than what
is normally printed on the keys of an English
keyboard
• It’s an ethnocentric (linguocentric) definition,
but it reflects the way computer technology
developed and the problems that non-English
characters can cause.
10/3/2015 5:43 AM
118
Special characters
• Every character (special or ordinary) is
represented internally as a number, called its
“codepoint”.
LATIN CAPITAL LETTER A WITH ACUTE (Á)
Abstract name
00C1
Codepoint
(hexadecimal)
10/3/2015 5:43 AM
119
Special characters
• A font contains an image (the visible letter)
used to display/print each codepoint.
LATIN CAPITAL LETTER A WITH ACUTE
00C1
10/3/2015 5:43 AM
Times New
Roman
Á
120
Special characters
• Different fonts provide different images for a
given character, but they are all recognizably
the same character.
LATIN CAPITAL LETTER A WITH ACUTE
Arial
00C1
Times New
Roman
Comic Sans MS
10/3/2015 5:43 AM
Á
Á
Á
121
Special characters
• An electronic “keyboard” provides a way of
typing the character.
LATIN CAPITAL LETTER A WITH ACUTE
Arial
',A
US International
00C1
Times New
Roman
Comic Sans MS
10/3/2015 5:43 AM
Á
Á
Á
122
Special characters
• There can be more than one way to type the
same character, depending on the keyboard
used.
LATIN CAPITAL LETTER A WITH ACUTE
A,CTRL+'
',A
BU Keyboard
US International
ALT+0193
10/3/2015 5:43 AM
Windows
built-in
Arial
Times New
Roman
00C1
Comic Sans MS
Character map
(pick from a chart
with the mouse)
Á
Á
Á
123
Special characters
– Keyboard: maps keystrokes to codepoint
– Font: maps codepoint to image
– There are (potentially) many options for each
LATIN CAPITAL LETTER A WITH ACUTE
A,CTRL+'
',A
BU Keyboard
US International
ALT+0193
10/3/2015 5:43 AM
Windows
built-in
method
Arial
Times New
Roman
00C1
Comic Sans MS
Character map
(pick from a chart
with the mouse)
Á
Á
Á
124
Special characters
• Most important issue: what codepoint
represents each character
– Secondary: how it is typed, what font is used
– These can be changed without disturbing the
data, as long as they are designed with the same
characters and codepoints in mind
10/3/2015 5:43 AM
125
Special characters
• Encoding: a system for representing a set of
characters with codepoints
– If you change encodings, you must also change
keyboards and fonts
• Wrong font: data is displayed incorrectly
• Wrong keyboard: what you type comes out wrong
– When we talk of custom fonts and keyboards,
what is really significant is the encoding that
underlies them, not the font or keyboard itself.
10/3/2015 5:43 AM
126
Special characters
• Common encodings
– Windows ANSI
• about 220 characters used in major Western European languages
• standard in all Windows fonts from the start (ca. 1990)
– Standard encodings for particular languages (ISO standards)
• Cyrillic
• Japanese
• Arabic
10/3/2015 5:43 AM
127
Special characters
• Custom encodings for specific languages ("custom” or “legacy”
fonts):
– about 220 custom characters
– often based on Windows ANSI with some substitutions (a given
codepoint represents a custom character rather than what it would
normally represent in Windows ANSI)
10/3/2015 5:43 AM
128
Special characters
• Unicode
– Over 100,000 characters already, with more to come—a little over a
million possible
– Strong support by the entire computer industry
– Intended to handle all the world’s languages in one common system,
without conflicts
– A genuine Unicode font might not have the character you need for a
particular codepoint, but it will never have the wrong character
10/3/2015 5:43 AM
129
Special characters
• Quick guided tour of Unicode
• (demo using Insert Symbol in PowerPoint)
10/3/2015 5:43 AM
130
Special characters
• Unicode is the only viable long-term choice
– Large inventory of characters—practically anything
a linguist would ever want
– Everyone worldwide can use the same system (no
problems with data getting garbled by using the
wrong fonts)
– Custom/legacy fonts may cease to work with
future software
10/3/2015 5:43 AM
131
To make Unicode work
• Unicode-capable operating system and
software
– Windows 2000, XP, Vista
– Recent versions of Mac OS X and Linux
– Toolbox (NB: not Shoebox) and FLEx
– Most newer commercial software and much
shareware/freeware
– See partial list at
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=UnicodeSupport
10/3/2015 5:43 AM
132
To make Unicode work
• Unicode fonts that contain the characters you
need
– Arial Unicode MS (with Microsoft Office, some
versions don't include characters added to
Unicode in the last few years)
– Lucida Sans Unicode (standard in Windows)
– Doulos SIL and Charis SIL (http://scripts.sil.org)
– standard fonts in Windows Vista
– Other sources, see
http://www.sil.org/computing/fonts/Lang/archives.html
10/3/2015 5:43 AM
133
To make Unicode work
• A way to input the characters you need (see
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=InputResources)
– Character map utility (built-in to Windows) and Insert Symbol (built-in to
Microsoft Office)
– Standard Windows keyboards, e.g. US International
• It is helpful to use them together with the Windows On-Screen Keyboard (Start, Programs,
Accessories, Accessibility), which will show you how to type each character
– Custom Windows keyboards, made with Microsoft’s Keyboard Layout Creator
(SIL has one for IPA characters)
– Tavultesoft Keyman (http://www.tavultesoft.com)
10/3/2015 5:43 AM
134
Legacy encodings vs.
Unicode
• Some people have older custom-encoded data that should be
converted to Unicode
– To decide whether and when to change from legacy encodings to
Unicode, see advice at
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=UTConvert2Unicode
– For advice on how to proceed, see
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=Conversion)
10/3/2015 5:43 AM
135
Legacy encodings vs. Unicode
• When copying from a document that uses custom fonts into one that uses
Unicode (or vice versa), special characters may get garbled.
– If the problem is infrequent, just fix them manually.
– If the problem happens frequently, have a techie friend get the Unicode
conversion tools available at http://scripts.sil.org and set them up for you to use
when you need to convert data.
– Better yet: convert all your data at once and leave custom fonts in the past.
10/3/2015 5:43 AM
136
Descargar

Using Toolbox