Synonyms & Taxonomies
Thesaurus Design for Information Architects
an ACIA Seminar
by Peter Morville & Samantha Bailey
1
Introductions
Peter Morville ([email protected])
• CEO, Argus Associates
• Co-author, Information Architecture
for the World Wide Web
• Director, ACIA
• LIS background
• Fortune 500 consulting
2
Introductions
Samantha Bailey ([email protected])
• VP of Operations, Argus Associates
• LIS background
• Fortune 500 consulting
• VC experience
3
Seminar Outline
I.
II.
III.
IV.
V.
VI.
VII.
VIII.
IX.
Thesauri in Context
Value of Thesauri
Methodology
Metadata
Vocabulary Control
Structure & Relationships
Thesaurus Management
Case Study
Related Topics
Instructional Methods
Exercises, Quizzes, Discussions, Breaks
4
Our Approach
Assumptions
• Understanding of IA Basics
• Interest in Thesauri and the Web
Philosophy
• Reality is Important
• Technology has Limitations
• Success takes Time
• Tension can be Healthy
5
Thesauri in Context
What is IA?
The art and science of structuring and
organizing information systems to
help people achieve their goals.
6
Thesauri in Context
An Ecological Approach
Business
Context
Content
Books:
Users
Information Ecologies by Bonnie Nardi and
Information Ecology by Thomas Davenport
7
Thesauri in Context
IA From Top to Bottom
Top-Down
portal
strategy
hierarchy
primary path
Bottom-Up
sub-site
objects
metadata
multiple paths
portal
Object X
Name:
Product Category:
Topic:
Stale Date:
Author:
Security:
local subsites
(HR, Engineering, R&D…)
8
Thesauri in Context
Where Does IA Fit?
http://www.jjg.net/ia/elements.pdf
The Elements of
User Experience
Jesse James Garrett
9
Thesauri in Context
What is Vocabulary Control?
Controlled Vocabulary
A list of preferred and variant terms.
A subset of natural language.
Preferred Variants
Authority
AZ
Ariz, Arizona, 85XXX US Postal
Service
IBM
Intl Bus Machines,
Big Blue
Nyctalopia Night blindness
Moon blindness
NY Stock
Exchange
National Library
of Medicine
10
Thesauri in Context
Why Control Vocabulary?
Language is Ambiguous
• Synonyms, homonyms, antonyms,
contronyms, etc.
In the Oxford English Dictionary:
• “Round” takes 7 ½ pages or 15,000
words to define.
• “Set” has 58 uses as a noun, 126 as a
verb, 10 as an adjective.
The Mother Tongue:
English & How It Got That Way
by Bill Bryson
11
Thesauri in Context
Why Control Vocabulary?
Users
Communication Chasm
Documents and Applications
Example
Personal Digital Assistant
Synonyms
Handheld Computer
"Alternate" Spellings
Persenal Digitel Asistent
Abbreviations / Acronyms
PDA
Broader Terms
Wireless, Computers
Narrower Terms
PalmPilot, PocketPC
Related Terms
WindowsCE, Cell Phones
So Your Users
Don’t Have To!
12
Thesauri in Context
Semantic Relationships
Types
1.
2.
3.
Equivalence
Hierarchical
Associative
(Broader)
United States
2
(Variant)
Vt
1
(Variant)
Green
Mountain State
(Preferred)
Vermont
3
(Related)
(Narrower)
(Related)
Skiing
Burlington
Maple Syrup
13
Thesauri in Context
Levels of Control
(Vocabularies)
Synonym
Rings
Authority
Files
Classification
Schemes
Simple
Equivalence
Thesauri
Complex
Hierarchical
Associative
(Relationships)
14
Thesauri in Context
What is a Thesaurus?
Traditional Use
• Dictionary of synonyms (Roget’s)
• From one word to many words
Information Retrieval Context
• A controlled vocabulary in which
equivalence, hierarchical, and
associative relationships are identified
for purposes of improved retrieval
• Many words to one concept
15
Thesauri in Context
Terminology
Preferred Terms (UF subject headings, descriptors)
SN Scope Notes
UF Used For
BT Broader Term
NT Narrower Term
RT Related Terms (“See Also”)
Variant Terms (UF non-preferred, entry terms)
USE (“See”)
16
Thesauri in Context
Types of Thesauri
Used in Indexing
No
No
Yes
Natural
Language
Indexing
Thesaurus
Searching
Thesaurus
Classic
Thesaurus
Used in
Searching
Yes
17
Thesauri in Context
Visibility
Classic Use
• Both indexers and searchers explicitly
map natural language terms onto
controlled vocabularies
Web Environment
• Able to choose level of visibility
(implicit use, thesaural browsers)
• Opportunity to educate users
(terminology, associative learning)
18
Thesauri in Context
Niche Applications (hypothetical example)
Product Catalog:
m ultiple
views enabled by thesaurus
Technical Support Database:
entry vocabulary maps
problems to solutions
Searching Thesaurus:
implicit term explosion
manages synonyms
19
Thesauri in Context
Thesaurus Standards
Mono-Lingual Thesauri
•
•
•
•
•
ISO 2788 (1974, 1985, 1986, International)
BS 5723 (1987, British)
AFNOR NFZ 47-100 (1981, French)
DIN 1463 (1987-1993, German)
ANSI/NISO Z39.19 (1994, United States)
Multi-Lingual Thesauri
•
ISO 5964 (1985, International)
20
Thesauri in Context
ANSI/NISO Standard
Z39.19-1993
Guidelines for the Construction, Format, and
Management of Monolingual Thesauri.
84 pp. ISBN: 1-880124-04-1 Price: $49.00
http://www.niso.org/stantech.html
Reasons to Follow Standard
•
Significant thinking behind guidelines
•
Technology integration
•
Cross-database compatibility
21
Thesauri in Context
Oracle’s Perspective
“The
phrase…thesaurus standard is somewhat
misleading. The computing industry considers a
‘standard’ to be a specification of behavior or
interface. These standards do not specify
anything. If you are looking for a thesaurus
function interface, or a standard thesaurus file
format, you won't find it here. Instead, these are
guidelines for thesaurus compilers -- compiler
being an actual human, not a program.
What Oracle has done is taken the ideas in these
guidelines and in ANSI Z39.19…and used them
as the basis for a specification of our own
creation…So, Oracle supports ISO-2788
relationships or ISO-2788 compliant thesauri.”
22
Thesauri in Context
A World in Transition
“The majority of basic problems of thesaurus
construction had already been solved by 1967.”
(Krooks and Lancaster, 1993)
Traditional Thesauri
Web Thesauri
Print
Online
Academic / Library
Business
Expert / Repeat Users
Novice / Infrequent Users
Visible
Invisible
Accepted Value
Unknown Value
23
Section Break
I. Thesauri in Context

II. Value of Thesauri
III. Methodology
IV. Metadata
V. Vocabulary Control
VI. Structure & Relationships
VII. Thesaurus Management
VIII. Case Study
IX. Related Topics
24
Value of Thesauri
IA Metrics
• Cost of finding (time, clicks, frustration,
precision).
• Cost of not finding (success, recall,
frustration, alternatives).
• Cost of development (time, budget, staff,
frustration).
• Value of learning (related products, services,
projects, people).
25
Value of Thesauri
KM Metrics
• Revenue Generation (% revenues spent on
KM, new revenue generation)
• Opportunity Cost (staff time, customers lost)
• Knowledge Efficiency (faster product
development, # mistakes made twice)
• Data Quality (% knowledge on intranet, % email
with attachments)
• Intranet Usage (# hits, # contributions)
• Individual Behavior (# citations)
• Technical Performance (uptime, search
response time)
Working Council for Chief Information Officers
Basic Principles of Information Architecture
(http://www.cio.executiveboard.com)
26
Value of Thesauri
Web Site Statistics
Wasted expense: most sites will waste between
$1.5M and $2.1M on redesigns next year.
Forfeited revenue: poorly architected retailing
sites are underselling by as much as 50%.
Lost customers: the sites we tested are driving
away up to 40% of repeat traffic.
Eroded brand: people who have a bad
experience, typically tell 10 others.
Forrester Research
Why Most Web Sites Fail (Sept 98)
27
Value of Thesauri
Intranet Statistics
Employees spend 35% of productive time
searching for information online.
Working Council for Chief Information Officers
Basic Principles of Information Architecture
(http://www.cio.executiveboard.com)
Managers spend 17% of their time
(6 weeks a year) searching for information.
Information Ecology
Thomas Davenport and Lawrence Prusak
(http://argus-acia.com/content/review001.html)
28
Value of Thesauri
Intranet Statistics
Sun Microsystems’ usability experts
calculated that 21,000 employees were
wasting an average of six minutes per day
due to inconsistent intranet navigation
structures. When lost time was multiplied by
staff salaries, the estimated productivity loss
exceeded $10 million per year.
Jakob Nielsen
Web Design and Development
September 1997
29
Value of Thesauri
Intranet Statistics
After spending two years and $3 million on
development and usability testing, Bay Networks
expects to see $10 million in productivity gains
and a 10 percent cycle-time reduction for new
product development as a result of its new
information architecture.
Working Council for Chief Information Officers
Basic Principles of Information Architecture
(http://www.cio.executiveboard.com)
30
Value of Thesauri
Intranet Statistics
40% of corporate users can’t find the information
they need on their intranet.
Prior to intranet reengineering in 1997, Ford
conducted a survey of its 100,000+ user base.
Employees stated they could only find 15% of the
information they needed to do their jobs.
Under-investment in (unstructured) information.
80% spending on 20% (structured) data.
Working Council for Chief Information Officers
Basic Principles of Information Architecture
(http://www.cio.executiveboard.com)
31
Value of Thesauri
Searching Problems
“Most of the complaints we get are due
to the way users search – they use the
wrong keywords.”
- a manufacturing company
“We have problems with the way
customers enter queries. Capitalizations
and misspellings give us headaches.”
- a software company
Forrester Research
Must Search Stink? (June 2000)
32
Value of Thesauri
Searching Statistics
“Search will become the center piece of
navigation.”
90% of firms rate search as very or
extremely important.
52% don’t measure search
effectiveness.
Forrester Research
Must Search Stink? (June 2000)
33
Value of Thesauri
CV Statistics
Researchers at Bell Labs found the probability
that two people would choose the same word to
describe an object to be less than 20%.
Furnas, Landauer, et. al.,
Bell Labs (1987)
30% of corporations systematically utilize
metadata to classify information, while only one
to three percent of companies populate those
metadata tags using controlled vocabularies.
71% don’t account for misspellings or synonyms.
Forrester Research
Building an Intranet Portal (Jan 1999)
34
Value of Thesauri
CV Statistics
Principle of unlimited aliasing: by leveraging
synonyms, recall went from 20% to 80%
(in a small collection).
The Trouble with Computers
Research study at Bellcore (Furnas et al. 1987)
“The findings indicate that a hypertext index with
multiple access points for each concept…led to
greater effectiveness and efficiency of retrieval
on almost all measures.”
A Usability Assessment of Online Indexing Structures
By Carol A. Hert, Elin K. Jacob, and Patrick Dawson
Journal of the American Society for Information Science
(September 2000)
35
Value of Thesauri
Complementary Approaches
Basic
• Navigation Design (Browsing)
• Full Text Indexing (Searching)
Advanced
• Collaborative Filtering
• Lexical Databases
• Automated Hierarchy-Generation
36
Value of Thesauri
Navigation Design
Relationships
• Global & Local (hierarchical)
• Contextual (associative)
Content is here,
with contextual
navigation
embedded or
separate.
Where am I?
W hat's nearby?
Local N avigation
Global Navigation
What's related to
what's here?
37
Value of Thesauri
Full Text Indexing
Strengths
• Enables high precision (exact phrase)
• Enables high recall (word occurrence)
Weaknesses
• Often results in low precision (“aboutness”)
• Often results in low recall (synonyms)
Complementary Use
• Provide users with option (search CV, full text)
• Intelligent next step (no hits on CV > full text)
• Full text search within CV search zones
38
Value of Thesauri
Collaborative Filtering
SN. Approaches that leverage knowledge about
preferences or behaviors of people or
organizations to facilitate information retrieval.
Popularity / Importance
• Direct Hit (analysis of searcher behavior)
• Amazon (cross-title purchasing habits)
• Google (citation indexing)
Considerations
• Favors established materials
• Lacks benefits of vocabulary control
• User-centric (ignores content, context)
39
Value of Thesauri
Lexical Databases
Scope Notes
• Broad term banks or semantic networks
that specify lexical variants and term
relationships.
• General-interest, off-the-shelf thesauri.
Examples
• Roget’s Thesaurus
• WordNet
• Plumb Design Visual Thesaurus
40
Value of Thesauri
Lexical Databases
Number of Terms (General, Niche)
Importance of Context (Bug in Software, Espionage)
WordNet
# of
Terms
# of
Meanings
50,000
70,000
Oxford English 615,000
Dictionary
2.4M
Notes
> 20,000 New
Terms Per Year
Named Insect
Species
1.4M
Drosophila UF
Fruit Fly
Square D
Products
300,000
Electrical
Distribution
41
Value of Thesauri
Hierarchy-Generation Software
An Intimidating Vocabulary
• Multivariate regression models, probabilistic
Bayesian models, neural networks, symbolic rule
learning, computational semiotics, and support
vector machines
General Techniques
• Clustering (similarity, word co-occurrence)
• Vector Space (extract “meaning” from terms,
teach by example)
42
Value of Thesauri
Hierarchy-Generation Software
Examples
• Autonomy (http://www.autonomy.com/)
• Semio (http://www.semio.com/)
• Cartia (http://www.cartia.com/)
Hyperbole
Autonomy claims their software eliminates "the
need for any manual labor in the process."
43
Value of Thesauri
Hierarchy-Generation Software
Considerations
• No business context
• No consideration of users
• No planning for future
• Mixed category schemes
• Hidden costs
Business
Context
X
Content

integration
 rule design
 training
X
Users
Trends
• Niche use (e.g., news, web search results)
• Integration with manual classification schemes
44
Section Break
I. Thesauri in Context
II. Value of Thesauri
III. Methodology
IV. Metadata
V. Vocabulary Control
VI. Structure & Relationships
VII. Thesaurus Management
VIII. Case Study
IX. Related Topics


45
Methodology
Overview
Strategy
Process
Deliverables
Design
Build


Consulting

 indicates special emphasis during this phase
46
Methodology
Strategy x Process
Information Architect’s Toolbox *
Business
Context
strategy
meetings
opinion leader
interviews
technology
assessment
Content &
Applications
content
inventory
content
analysis
metadata
evaluation
log analysis
observation /
usability testing
interviews /
affinity
modeling
heuristic
evaluation
classification
scheme
analysis
benchmarking
Users
Existing IA
* select right mix for project; this is a partial list of tools
47
Methodology
Design x Deliverables
Information Architect’s Toolbox *
Organization
& Labeling
Navigation
(Embedded)
Navigation
metadata
specifications
controlled
vocabularies
thesaurus
primary
taxonomy
classification
schemes
blueprints and
wireframes
search system
sitemap /
indexes
personalization
/ customization
design /
authoring
guidelines
content
management
policies
functional
specifications
(Supplemental)
Synthesis
* select right mix for project; this is a partial list of tools
48
Methodology
Consulting x Build
Information Architect’s Toolbox *
Metadata
Application
object-level
indexing guides
support
indexers
support
thesaurus
managers
Point of
Production
support
designers /
developers
usability testing
input / analysis
fix problems
metrics
evaluation
improvement
Post Launch
* select right mix for project; this is a partial list of tools
49
Methodology
Thesaurus Construction
Strategy
1. Define Thesaurus Strategy
2. Develop Project Plan
Design
3. Gather Candidate Terms / Variants
4. Select Preferred Terms
5. Develop Facet Hierarchies
6. Identify ‘See Also’ Links
7. Write Design / Functional Specifications
8. Build / Buy Software Applications
Build
9. Launch Indexing Operation
10. Refine Controlled Vocabularies
50
Methodology
Strategy Questions
•
•
•
•
•
•
Does vocabulary control make sense?
Where and for what purposes?
How will it align with business goals?
How will it support users’ goals?
How will it impact content management?
Will we buy, borrow, or build?
51
Section Break
I. Thesauri in Context
II. Value of Thesauri
III. Methodology
IV. Metadata
V. Vocabulary Control
VI. Structure & Relationships
VII. Thesaurus Management
VIII. Case Study
IX. Related Topics



52
Metadata
Definition
Information about information
Purposes
1. Document surrogate (abstract)
2. Provides context (date, publisher)
3. Facilitates retrieval (subject)
53
Metadata
Ways to Leverage
User Interface
• Generate browsable indexes
(site-wide, sub-site, specialized authority files)
• Enable field-specific searching
(filters, zones, sorting)
• Support personalization
(map profile to tags)
Behind the Scenes
• Enable efficient content management
• Support decentralized tagging
54
Metadata
Types of Indexing
Manual
Full Text
Automated
x
complete text
minus stop words
Keyword
(Natural
Language)
humans assign
“relevant” words
and phrases
software assigns
“relevant” words
and phrases
Controlled
Vocabulary
humans map
variants to
preferred terms
software maps
variants to
preferred terms
55
Metadata
Full Text Indexing
56
Metadata
Keyword Indexing
<HTML><HEAD>
<TITLE>STARTREK.COM:The Official Star Trek Web
Site!</TITLE>
<META NAME='description'
CONTENT='STARTREK.COM:The Official Star Trek
Web Site! The starting point for all Star Trek
information on the web.'>
<META NAME='keywords' CONTENT='star trek,
enterprise, james kirk, mister spock, seven of nine,
doctor mccoy, captain sulu, borg, klingon, romulan,
ferengi, human, starfleet command, delta quadrant,
alpha quadrant, gamma quadrant, excelsior,
paramount, voyager, deep space nine, captain sisko,
jean luc picard, kathryn janeway, starfleet academy,
united federation of planets'>
<META NAME='author' CONTENT='Paramount Digital
Entertainment'>
57
Metadata
CV Indexing
Partners/Competitors
UI
ACCEPTED
TERM
PC0004
Bell
Atlantic
BellAtlantic; Bell
Atlantic / North;
NYNEX; Nynex
PC0091
NLG
National Leisure
Group
PC0076
VH1
Video Hits 1; VH-1
LRID
Variant Terms
58
Metadata
Indexing Guidelines
Considerations
• Specificity: rule of specific entry
• Exhaustivity: number of terms per document
• Aboutness: strive for consistent interpretation
• Consistency: can be more important than quality
• Quality: balance against speed and consistency
59
Metadata
Comparative Analysis
Full Text (extraction)
• High specificity enables precision (sometimes)
• Exhaustivity allows for high recall (sometimes)
Keyword (assignment or extraction)
• Relatively low level of investment
• Selection of more relevant words / phrases may
increase recall and precision (sometimes)
Controlled Vocabulary (assignment)
• Synonym management increases recall
• Disambiguation increases precision
(value increases with size, Medline > 6M documents)
• Enables hierarchical and “see also” browsing
60
Metadata
Cost Analysis
Searching Costs
# users, usage volume,
user value, success value,
size, complexity
Thesaurus
Costs
complexity,
vocabulary
stability,
technology
Indexing Costs
content volume, #
fields, time per field,
rate of growth /
churn
61
Metadata
Automated Indexing
Primary Benefit
• Save money (cost of manually classifying 1
journal article = $1.70)
Approaches
• Term Extraction: extraction of “important”
words and phrases (proximity, stemming)
• Latent Semantic Indexing: vector space
approach (extracts meaning, training required)
Desired Features
• Assign terms from controlled vocabularies
• Integrate with thesauri, database tools, etc.
• Handle multi-lingual collections
62
Metadata
Automated Indexing
Software Categories & Labels
Search Engines, Data Mining, Text Extraction,
Knowledge Management, Automatic
Classification, Meta-Tagging
Leading Products
Metacode’s Metatagger (http://www.metacode.com/)
Mohomine (http://www.mohomine.com/)
Oingo (http://www.oingo.com/)
InXight Categorizer (http://www.inxight.com/)
Semio Taxonomy (http://www.semio.com/)
Inktomi / Ultraseek CCE (http://www.inktomi.com/)
63
Metadata
Selecting a Strategy
Factors to Consider
Manual
Automated
Cost (per document)
High
Low
Speed
Slow
Fast
Consistency
Variable
High
Quality
Variable
Variable
Multimedia-Capable
Yes
No
Intelligent
Yes
No
(understand text and guidelines)
64
Section Break
I. Thesauri in Context
II. Value of Thesauri
III. Methodology
IV. Metadata
V. Vocabulary Control
VI. Structure & Relationships
VII. Thesaurus Management
VIII. Case Study
IX. Related Topics




65
Vocabulary Control
Getting Started
Types
1.
2.
3.
Equivalence
Hierarchical
Associative
(Broader)
United States
2
(Variant)
Vt
1
(Variant)
Green
Mountain State
(Preferred)
Vermont
3
(Related)
(Narrower)
(Related)
Skiing
Burlington
Maple Syrup
66
Vocabulary Control
Identify Terms
Published Reference Materials
Thesauri, classification schemes, encyclopedias,
dictionaries, glossaries, indexes
Content
Representative sample of web site / intranet
Users
Search log analysis, surveys, interviews
Experts
Authors, subject experts
67
Vocabulary Control
Organize Terms
1.
2.
3.
4.
5.
Define preferred terms
Link synonyms and variants
Group preferred terms by subject
Identify broader and narrower terms
Identify related terms
Note: steps 3-5 are tentative designations and
part of iterative process.
68
Vocabulary Control
Form of Preferred Terms
Grammatical Form (noun, adjective, verb)
Spelling (defined authority, house style)
Singular & Plural Form (count nouns)
Abbreviations & Acronyms (popular use)
Considerations
• Stemming helps (but not for mouse/mice)
• Global guidelines / term-specific decisions
• Rules simplify decision-making
• Consistency enhances usability
69
Vocabulary Control
Selection of Preferred Terms
ANSI/NISO Z39.19-1993
3.0 “Literary warrant (occurrence of terms in
documents) is the guiding principle for selection
of the preferred (term).”
5.2.2 “Preferred terms should be selected to
serve the needs of the majority of users.”
70
Vocabulary Control
Definition of Terms
The meaning of the term must be
deliberately restricted.
Qualifiers (manage homographs)
Cells (biology) / Cells (electric)
Scope Notes (restrict meaning)
Hamburger. SN: includes burgers made with
beef. Otherwise use “Turkey Burger” or
“Veggie Burger”
Definition (clarify and educate)
Trend towards integration of glossaries
71
Vocabulary Control
Variant Terms
Variant terms provide the users with entry
points into the vocabulary.
Synonyms (same meaning)
cats USE felines, helicopters USE whirlybirds
Lexical Variants (different word forms)
paediatrics USE pediatrics, BK USE Burger King
Quasi-Synonyms (treated as equivalent)
generic posting: beagle USE dog
antonyms/continuum: wetness USE dryness
72
Vocabulary Control
Recall and Precision
Recall
Devices
Word Stemming
Variants (OR)
Generic Posting
Relationships
Precision
Devices
Specificity
Coordination (AND)
Compound Terms
Term Definition
Proximity
Costs
Time to Find
Failure to Find
Development
73
Vocabulary Control
Term Specificity
Assuming a good entry vocabulary,
increased term specificity allows for
improved precision without hurting recall
(but costs grow fast).
Vocabulary A
United States
Vocabulary B
United States
California
San Diego
74
Vocabulary Control
Compound Terms
ANSI/NISO Z39.19.
“Each descriptor…should represent a
single concept.”
ISO 2788.
“It is a general rule that…compound
terms should be factored (split) into
simple elements.”
75
Vocabulary Control
Compound Terms
Article: “Software for Information Architecture”
H ig h R ecall
H ig h P recisio n
One Term
 Information Architecture Software
Two Terms
 Information Architecture
 Software
Three Terms
 Architecture
 Information
 Software
76
Section Break
I. Thesauri in Context
II. Value of Thesauri
III. Methodology
IV. Metadata
V. Vocabulary Control
VI. Structure & Relationships
VII. Thesaurus Management
VIII. Case Study
IX. Related Topics





77
Structure & Relationships
Types
• Bottom-up (semantic, term to term)
• Top-down (shape, classification)
Semantic Relationships (reciprocity)
• Equivalence
• Hierarchical
• Associative
78
Structure & Relationships
Semantic Relationships
(Broader)
Cultural
Landscapes
(Synonym)
(Preferred)
(Variant)
Human
Settlements
Inhabited
Places
Settlements
(Related)
(Narrower)
(Related)
Housing
Ghost Towns
Dwellings
79
Structure & Relationships
Semantic Relationships
Equivalence
• Use/Used For (USE/UF)
• Leads from variants to preferred
e.g., prams: USE baby carriages
A=B
80
Structure & Relationships
Semantic Relationships
Hierarchical
• Broader Term/Narrower Term (BT/NT)
Types
• Generic (class/species, inheritance)
Vertebrata NT Amphibia
• Whole-Part (associative unless exclusive)
Ear NT Vestibular Apparatus
• Instance (proper name)
Seas NT Mediterranean Sea
A
B
81
Structure & Relationships
Semantic Relationships
Associative
• Related Term (RT, See Also)
• Non-hierarchical and non-equivalent
• Relation should be “strongly implied”
e.g., hammers RT nails
A
B
82
Structure & Relationships
Associative Relationships
Examples
Field of Study and Object of Study
• Forestry RT Forests
Process and its Agent
• Temperature Control RT Thermostat
Concepts and their Properties
• Poisons RT Toxicity
Action and Product of Action
• Weaving RT Cloth
Concepts Linked by Causal Dependence
• Bereavement RT Death
83
Structure & Relationships
Classification Schemes
SN
Hierarchical arrangement of terms.
In navigation context, use Hierarchy.
UF
Categorization
Taxonomy
Ontology
RT
Hierarchy
84
Structure & Relationships
Pre- & Post-Coordination
Enumerative Classification Schemes
• Pre-coordinate (more compound terms)
• All terms are enumerated (listed) in their
entirety in the scheme.
Library of Congress Classification Scheme
Synthetic Classification Schemes
• Post-coordinate (more uni-terms)
• New terms can be created by combining
terms during a search (AND).
Art & Architecture Thesaurus
85
Structure & Relationships
Pre- & Post-Coordination
• In the highly enumerative LC Classification,
“Groundwater - - Pollution” and “Soil pollution” are
dispersed at indexing (high precision, low recall).
• Keyword searching improves recall, hurts precision
(a synthetic band-aid, potential false drop on
“soil purification standards”).
86
Structure & Relationships
Polyhierarchy
Strict Hierarchies
• Each term appears in only
one place in the hierarchy.
• Essential for placement
of physical objects.
Polyhierarchies
• Terms cross-listed
in multiple categories.
• Accepts complex
nature of reality.
87
Structure & Relationships
Polyhierarchy
Medical Subject Headings (MeSH)
• Compound terms needed
to manage 6 million
documents in Medline.
• High level of
pre-coordination
forces polyhierarchy.
Virus
Dis e as e s
• Terms may have
than one BT.
Dis e as e s
Res piratory
Tract
Dis e as e s
more
Viral
Pne um onia
88
Structure & Relationships
Faceted Classification
Overview
• Invented by S.R. Ranganathan (1930s)
• Handle complex subjects (reality)
• One principle of division at a time
• Multiple “pure” taxonomies
• UF analytico-synthetic scheme, fielded database
Facets
• Fundamental facets: personality, matter, energy,
space, time
• Common facets: subject (about), geography (in),
author (by whom)
Art & Architecture Thesaurus, ASIS Thesaurus
89
Structure & Relationships
Facets, Coordination, Specificity
Partial List of Potential Combinations
Entities
Apples
Pears
Peaches
Processes
Canning
Freezing
Drying
Forms
Canned
Frozen
Fresh
Apples
Pears
Peaches
Canning
Freezing
Drying
Canned
Frozen
Fresh
Canning of Apples
Canning of Pears
Canning of Peaches
Freezing of Apples
Freezing of Pears
Freezing of Peaches
Drying of Apples
Drying of Pears
Drying of Peaches
Canned Apples
Canned Pears
Canned Peaches
Frozen Apples
Frozen Pears
Frozen Peaches
Fresh Apples
Fresh Pears
Fresh Peaches
Freezing of Canned Apples
Canning of Dried Pears
Drying of Fresh Peaches
90
Structure & Relationships
Yahoo
Characteristics
• Single Facet (a topical hierarchy)
• Fairly Enumerative (search on “Boston” finds
45 categories including: Boston Celtics, Boston
Tea Party, Anonymous Account of the Boston
Massacre)
• Polyhierarchical (Computer Science@ listed
under Computers & Internet and Science)
Observations
• Huge number of categories and levels (unwieldy)
• Fits user expectations (where do I find this?)
91
Structure & Relationships
ASIS Thesaurus
Characteristics
• Faceted (16 facets including document types,
fields and disciplines, organizations, qualities)
• Fairly Synthetic (large percentage of one or two
word single-concept descriptors)
• Polyhierarchical (machine aided indexing
BT computer applications, BT indexing)
Observations
• Faceted approach allows small number of terms
to be combined in large number of unexpected
ways (e.g., ambiguity and informatics)
• Presentation is not accessible to typical user
92
Structure & Relationships
A Unification Theory
Taxonomy
single facet, enumerative
Thesaurus
faceted, synthetic
fits user expectations
(where did they put this?)
fits content complexity
(how can I describe this?)
use for top few levels
(familiar gateway to site)
populate the hierarchy
(combinations, see also)
early user tests
(best primary hierarchy)
ongoing user tests
(leverage power, flexibility)
application of
human expertise
human-software hybrid
(facet-specific solutions)
Hypothesis: This hybrid information architecture will
become a common model for web sites and intranets
over the next several years.
93
Section Break
I. Thesauri in Context
II. Value of Thesauri
III. Methodology
IV. Metadata
V. Vocabulary Control
VI. Structure & Relationships
VII. Thesaurus Management
VIII. Case Study
IX. Related Topics






94
Thesaurus Management
What’s Involved?
• Software, workflow, quality control
• Vocabularies evolve over time
• Impacts authors, indexers, users
Vocabulary Maintenance Tasks
• Add, delete, enhance, normalize terms
• Overall evaluation
95
Thesaurus Management
Software: What to Look For
• Traditional database functionality
• Compliant with standards (ANSI, ISO)
• Relationship control (reciprocity, validation,
orphan identification)
• Term status (proposed, provisional, accepted)
• Flexible output (alphabetical, hierarchical)
• Integration with related tools and tasks
(indexing, searching, browsing)
Willpower’s List of Thesaurus Software
http://www.willpower.demon.co.uk/thessoft.htm
96
Thesaurus Management
Software: What You’ll Find
Thesaurus Management Software
• Standards-compliant, sophisticated,
• Poor integration (library-centric)
• Examples: Lexico, MultiTes
Database Management Software
• Strong integration
• Less thesaurus-specific functionality
• Examples: Oracle (interMedia),
Sybase (English Wizard)
97
Thesaurus Management Software
What You’ll Find
Search Engines
• Watch for casual use of “thesaurus”
• Look for integration with browsing.
Ultraseek
Thesaurus Expansion for Queries: Administrators may
put sets of synonyms in the thesaurus.txt file…When a
query matches one of the terms in that file, the
synonyms will automatically appear, so the user has the
option to add it to the query.
Verity
Verity's core search products include the following
advanced knowledge retrieval capabilities: advanced
query expansion and disambiguation tools, including
linguistic stemming and thesaurus expansion.
98
Section Break
I. Thesauri in Context
II. Value of Thesauri
III. Methodology
IV. Metadata
V. Vocabulary Control
VI. Structure & Relationships
VII. Thesaurus Management
VIII. Case Study
IX. Related Topics







99
Case Study
Call Center Intranet
Introduction
• KM application
• 6,000 users (customer care associates)
• 8,000 documents (hierarchy, search)
• 6 month project (10/97 to 4/98)
• $500K of $10M redesign
Goals
• Reduce training time / time to find
• Increase use / customer satisfaction
100
Case Study: Call Center Intranet
Process Overview
Strategy
• Background, vocabulary, meetings, observation
• 4 weeks x 2.5 PM + 1 IA
Design
• Bottom-up focus (doc types, fields, templates)
• 4 weeks x 2 PM + 2 IA
• 4 weeks x 1 IA (during implementation)
Implementation
• Indexing / develop controlled vocabularies
• Specifications (authors, indexers, developers)
• 16 weeks x 4 indexers + 1 IA + 2 PM
+ 1 subject expert
101
Case Study: Call Center Intranet
Controlled Vocabularies
Primary Vocabularies
• Partners/Competitors (122)
• Plans/Promotions (173)
• Products/Services (151 / 184 variants)
• Geographic Codes (51)
Secondary Vocabularies
• Adjustment Codes (36)
• Corporate Terminology (70)
• Time Codes (12)
102
Case Study: Call Center Intranet
Primary Vocabularies
Partners/Competitors
UI
ACCEPTED
TERM
PC0004
Bell
Atlantic
BellAtlantic; Bell
Atlantic / North;
NYNEX; Nynex
PC0091
NLG
National Leisure
Group
PC0076
VH1
Video Hits 1; VH-1
LRID
Variant Terms
103
Case Study: Call Center Intranet
Primary Vocabularies
Products/Services
UI
Accepted
Term
PS0135
Access
Dialing
10-288; 10-322;
dial around
PS0006
Air
Miles
AirMiles
PS0151
XYZ
Direct
USADirect; XYZ
USA Direct;
XYZDirect card
LRID
Variant Terms
104
Case Study: Call Center Intranet
Primary Vocabularies
Geographic Codes
CT
Connecticut
DE
Delaware
DC
District of Columbia; Dist. of
Columbia; Dist. Columbia
Note:Continental U.S. is equivalent
to the lower 48 states.
105
Case Study: Call Center Intranet
Secondary Vocabularies
Adjustment Codes
DAK
Denies All
Knowledge
-
MOS
Monthly Service
Charge
Mnthly. Service
Charge; Mnthly. Svc.
Charge; Monthly Svc.
Charge
WNO
Wrong Number
-
WTN
Working
Telephone Number
Working Tele. Number
106
Case Study: Call Center Intranet
Secondary Vocabularies
Corporate Terminology
Billed Telephone
Number (BTN)
Billed Tele. Number
Cross Boundary
Account
Foreign Account
Fraud
-
Multi Level
Marketing
Multi-Level Marketing;
MultiLevel Marketing;
MLM
World Wide Web
WWW; WorldWideWeb
107
Case Study: Call Center Intranet
Blueprints
Customer
Care
Search
Interface
Browse by
Topics
Browse by
Partners &
Competitors
Express
Links
(Top 10)
Advanced
Search
Browse by
Products &
Services
Browse by
Geography
Express
Links
Browse by
Plans &
Promotions
Browse by
What's New
108
Case Study: Call Center Intranet
Wireframes: Content
109
Case Study: Call Center Intranet
Wireframes: Browsable Index
Provides ability to view all documents tagged
with same preferred term. Ability to combine
fields for powerful search/browse.
110
Case Study: Call Center Intranet
Deliverables Overview
•
•
•
•
•
•
Blueprints and Wireframes
Controlled Vocabularies
Authoring & Indexing Guidelines
Indexed Documents (4,000)
Functional Specifications
Documentation & Training
111
Section Break
I. Thesauri in Context
II. Value of Thesauri
III. Methodology
IV. Metadata
V. Vocabulary Control
VI. Structure & Relationships
VII. Thesaurus Management
VIII. Case Study
IX. Related Topics








112
Related Topics
Multi-Lingual Thesauri
Concepts
• Source / Target Language
• Degrees of Equivalence
• Localization, not Globalization
Facts (from The Mother Tongue by Bill Bryson)
• There are now more students of English in China
than there are people in the United States
• The French can’t distinguish house and home
• Finnish has 15 case forms (noun variants)
• The Eskimos have 50 words for types of snow
but no word that just means snow
• A blizzard in England is a flurry in Nebraska
113
Related Topics
The List Goes On…
Thesauri AND
• Business Strategy
• Content Management
• Markup Languages
• Notation
• XML
114
Seminar Review
I. Thesauri in Context
II. Value of Thesauri
III. Methodology
IV. Metadata
V. Vocabulary Control
VI. Structure & Relationships
VII. Thesaurus Management
VIII.Case Study
IX. Related Topics









115
How To Learn More
Argus Center for Information Architecture
Web Site
http://argus-acia.com
Email Newsletter
Strange Connections, Events, Interviews
Thesaurus Resources & Examples
http://argus-acia.com/seminars/
user name and password both = “lajolla”
116
Contact Us
Argus Associates, Inc.
912 North Main Street
Ann Arbor, Michigan 48104
(734) 913-0010
Sales
[email protected]
Employment
http://argus-inc.com/recruiting/
Web Sites
http://argus-inc.com/
http://argus-acia.com/
117
Descargar

Argus -Seminar