Thesaurus Construction and Use
University of California, Berkeley
School of Information
IS 245: Organization of Information In
Collections
IS 257 – Fall 2007
2007.04.04 - SLIDE 1
Lecture Overview
• Review
– Facetted Classification
• Traditional vs. Facetted Classification
• Designing Facetted Classifications
• Today
– Thesaurus design
– Steps in Thesaurus development
– Indexing
IS 257 – Fall 2007
2007.04.04 - SLIDE 2
Hierarchical Classification
Literature
English
French
Spanish
...
... Prose Poetry Drama ... Prose Poetry Drama ...
...
16th 17th 18th 19th
16th 17th 18th 19th
Slide author: Marti Hearst
IS 257 – Fall 2007
2007.04.04 - SLIDE 3
Labeled Categories for Hierarchical
Classification
• LITERATURE
– 100 English Literature
• 110 English Prose
–
–
–
–
English Prose 16th Century
English Prose 17th Century
English Prose 18th Century
...
• 111 English Poetry
– 121 English Poetry 16th Century
– 122 English Poetry 17th Century
– ...
• 112 English Drama
– 130 English Drama 16th Century
– …
– 200 French Literature
Slide author: Marti Hearst
IS 257 – Fall 2007
2007.04.04 - SLIDE 4
Facetted Categories
• Mutually exclusive
– Non-overlapping, distinct categories
• Relational
– Relations between facets, subfacets, and foci
(elements) are not restricted to hierarchical
generalization-specialization relations
• Composable
– Combined using grammars of order and
relation to form compound descriptions
IS 257 – Fall 2007
2007.04.04 - SLIDE 5
Facetted Classification Along With Labeled
Categories
• A Language
– a English
– b French
– c Spanish
• B Genre
– a Prose
– b Poetry
– c Drama
• C Period
–
–
–
–
a 16th Century
b 17th Century
c 18th Century
d 19th Century
• Aa English Literature
• AaBa English Prose
• AaBaCa English Prose
16th Century
• AbBbCd French Poetry
19th Century
• BbCd Drama 19th
Century
Slide author: Marti Hearst
IS 257 – Fall 2007
2007.04.04 - SLIDE 6
Ranganathan
• PMEST Facets
– P(ersonality)
• WHO: The most important types or names of things for the
particular discipline
– M(atter)
• WHAT: Constituent materials
– E(nergy)
• HOW: Action or activity terms
– S(pace)
• WHERE: Where things occur
– T(ime)
• WHEN: When things occur
IS 257 – Fall 2007
2007.04.04 - SLIDE 7
“Classical” CRG/BC2 Facet Analysis
•
•
•
•
•
•
•
Entity
Kind
Part
Property
Material
Process
Operation
IS 257 – Fall 2007
•
•
•
•
•
•
Patient
Product
By-Product
Agent
Space
Time
2007.04.04 - SLIDE 8
“Classical” Facet Analysis
• What is being done?
–
–
–
–
Entity
Kind
Product
By-Product
• What are its parts?
– Part
• How is this achieved?
– Process
• By what means?
– Operation
• By whom?
– Agent
– Patient
• What are its
properties?
• Where?
– Property
– Material
• When?
IS 257 – Fall 2007
– Space
– Time
2007.04.04 - SLIDE 9
“Classical” Facet Analysis
• Nouns
– Entity
– Kind
– Part
– Patient
– Product
– By-Product
– Agent
• Intransitive Verb
– Process
• Transitive Verb
– Operation
• Adverb
– Space
– Time
• Adjectives
– Property
– Material
IS 257 – Fall 2007
2007.04.04 - SLIDE 10
Semantic and Syntactic Relationships
• Semantic
relationships
– Is-A (thing/kind,
genus/species)
• Mammals
– Primates
» Humans
• Syntactic
relationships
– Compounds
• Wheat + harvesting =
“wheat harvesting”
• Object + operation =
operation on object
– Has-Parts
• Human
– Head
» Eyes
IS 257 – Fall 2007
2007.04.04 - SLIDE 11
Facetted Classification
• Clearly distinguishes between semantic
relationships and syntactic relationships
– Semantic relationships
• Within a facet
• Containment relations
– Syntactic relationships
• Across facets
• Combinatoric relations
• Have a “syntax” for syntactic combination
of semantic terms
IS 257 – Fall 2007
2007.04.04 - SLIDE 12
Power of Facet Combinations
• The syntactic relations of facetted
classifications enable a small controlled
vocabulary to produce
– Many, many structured descriptions
– Complex, but formally structured descriptions
using nested compound descriptions
– Descriptions for things we do not have words
for
IS 257 – Fall 2007
2007.04.04 - SLIDE 13
Today
• More on thesaurus standards and
examples
IS 257 – Fall 2007
2007.04.04 - SLIDE 14
Types of Indexing Languages
• Uncontrolled keyword indexing
• Indexing languages
– Controlled, but not structured
• Thesauri
– Controlled and structured
• Classification systems
– Controlled, structured, and coded
• Facetted classification systems
IS 257 – Fall 2007
2007.04.04 - SLIDE 15
Thesauri
• A Thesaurus is a collection of selected
vocabulary (preferred terms or descriptors)
with links among synonymous, equivalent,
broader, narrower and other related terms
IS 257 – Fall 2007
2007.04.04 - SLIDE 16
Thesaurus Standards
• National and International Standards for
Thesauri
– ANSI/NISO z39.19-1994 — American National
Standard Guidelines for the Construction, Format and
Management of Monolingual Thesauri
– ANSI/NISO Draft Standard Z39.4-199x — American
National Standard Guidelines for Indexes in
Information Retrieval
– ISO 2788 — Documentation — Guidelines for the
establishment and development of monolingual
thesauri
– ISO 5964 — Documentation — Guidelines for the
establishment and development of multilingual
thesauri
IS 257 – Fall 2007
2007.04.04 - SLIDE 17
Thesaurus Examples
• Examples
– Non-Facetted
• The ERIC Thesaurus of Descriptors
– Semi-Facetted
• The Medical Subject Headings (MESH) of the
National Library of Medicine
– Facetted
• The Art and Architecture Thesaurus
IS 257 – Fall 2007
2007.04.04 - SLIDE 18
ERIC Thesaurus – Entry
IS 257 – Fall 2007
2007.04.04 - SLIDE 19
ERIC Thesaurus – Alphabetic
IS 257 – Fall 2007
2007.04.04 - SLIDE 20
ERIC Thesaurus – KWIC Index
IS 257 – Fall 2007
2007.04.04 - SLIDE 21
ERIC Thesaurus – Hierarchies
IS 257 – Fall 2007
2007.04.04 - SLIDE 22
ERIC Thesaurus – Groups
IS 257 – Fall 2007
2007.04.04 - SLIDE 23
ERIC Thesaurus – Online
http://www.ericfacility.net/extra/pub/thessearch.cfm
IS 257 – Fall 2007
2007.04.04 - SLIDE 24
MESH – Entry
IS 257 – Fall 2007
2007.04.04 - SLIDE 25
MESH – Alphabetic
IS 257 – Fall 2007
2007.04.04 - SLIDE 26
MESH – Tree Structures
IS 257 – Fall 2007
2007.04.04 - SLIDE 27
MESH – KWOC Index
IS 257 – Fall 2007
2007.04.04 - SLIDE 28
MESH - Online
http://www.nlm.nih.gov/mesh/meshhome.html
IS 257 – Fall 2007
2007.04.04 - SLIDE 29
AAT – Facets
IS 257 – Fall 2007
2007.04.04 - SLIDE 30
AAT – Hierarchies (print)
IS 257 – Fall 2007
2007.04.04 - SLIDE 31
AAT – Hierarchies (online)
http://www.getty.edu/research/tools/vocabulary/aat/
IS 257 – Fall 2007
2007.04.04 - SLIDE 32
AAT – Entry (online)
IS 257 – Fall 2007
2007.04.04 - SLIDE 33
Lecture Overview
• Thesaurus Design and Development
– Controlled Vocabularies for topical description
– Thesaurus Design
– Steps In Thesaurus Development (intro)
IS 257 – Fall 2007
2007.04.04 - SLIDE 34
Why Develop a Thesaurus?
• To provide a conceptual structure or
“space” for a body of information
– To make it possible to adequately describe
the topical content of information resources at
an appropriate level of generality or specificity
– To provide enhanced search capabilities and
to improve the effectiveness of searching (i.e.,
to retrieve most of the relevant material
without too much irrelevant material)
IS 257 – Fall 2007
2007.04.04 - SLIDE 35
Why Develop a Thesaurus?
• To provide vocabulary (or terminological)
control
– When there are several possible terms
designating a single concept, the thesaurus
should lead the indexer or searcher to the
appropriate concept, regardless of the terms
they start with
IS 257 – Fall 2007
2007.04.04 - SLIDE 36
Preliminary Considerations
• What is used now?
– Continue using an existing thesaurus?
– Ad hoc modification of existing thesaurus?
– Develop a new well-structured thesaurus?
• What is the scope and complexity of the
subject field?
• What kind of retrieval objects or data will
be dealt with?
• How exhaustive and specific is the desired
description of objects?
IS 257 – Fall 2007
2007.04.04 - SLIDE 37
Preliminary Considerations
• The scope and complexity of the field will
provide some indication of the scope and
complexity of the thesaurus
– It is better to plan for a larger and more
comprehensive system than a smaller system
that rapidly will become inadequate as the
database grows
• Development of a good thesaurus requires
a major intellectual effort as well as clerical
operations like data entry and production
of sorted lists
IS 257 – Fall 2007
2007.04.04 - SLIDE 38
Development of a Thesaurus
• Term Selection.
• Merging and Development of Concept
Classes.
• Definition of Broad Subject Fields and
Subfields.
• Development of Classificatory structure
• Review, Testing, Application, Revision.
IS 257 – Fall 2007
2007.04.04 - SLIDE 39
1. Term Selection
• Select sources for the
collection of terms.
– Prearranged Sources
– Open-ended Sources
• Assign codes to each
source.
IS 257 – Fall 2007
• Selection of terms
– For part of prearranged and for all
open-ended sources
• Enter terms into
database with all
information.
2007.04.04 - SLIDE 40
1.1 Kinds of Sources
• Prearranged Sources
– Existing descriptor lists, classification schemes
thesauri. This includes universal schemes like DDC or
LCSH.
– Nomenclatures of single disciplines
– Treatises on the terminology of a field
– Encyclopedias, lexica, dictionaries and glossaries.
– Tables of contents of textbooks and handbooks
– Indexes of journals or abstracting journals
– Indexes of other publications in the field
IS 257 – Fall 2007
2007.04.04 - SLIDE 41
1.1 Kinds of Sources
• Open-ended sources
– Lists of search requests or interest profiles
– Description of projects/activities to be served by the
information retrieval system.
– Discussion with specialists in the field
– Sample of documents in the field
• Ask users why and how these documents relate to the field.
• Have documents indexed by experts in the field
– Lists of titles of documents in the field
– Abstracts and reviews of documents
– Your own knowledge
IS 257 – Fall 2007
2007.04.04 - SLIDE 42
Selection of sources
• Prearranged sources require less effort in
gathering the material, and may already indicate
some relationships between terms and concepts
and relationships among terms.
• Open-ended sources can reflect current
terminology and may provide more complete
coverage.
• Choose a set of sources that are current, as
complete as possible, and considered
authoratative.
IS 257 – Fall 2007
2007.04.04 - SLIDE 43
Selection of Sources
• Each selected source is assigned an ID for
tracking its use in the development of the
thesaurus.
– Useful when making decisions about which
terms to prefer
– Useful for backtracking when questions arise
(where did this come from?)
IS 257 – Fall 2007
2007.04.04 - SLIDE 44
Selection of Terms
• Terms can be transferred directly from
prearranged sources to the recording
medium (cards or database)
– Have to decide which terms and references to
include, or to take the whole source
IS 257 – Fall 2007
2007.04.04 - SLIDE 45
Selection of Terms
• In open-ended sources you read through
the source and pick out terms (I.e. words
and phrases) that might be useful in
retrieval or as references to other terms.
• Alternatively, use keyword and phrase
extraction software to create lists of terms
and select from those.
• Transfer selected terms to the recording
medium (cards or database).
IS 257 – Fall 2007
2007.04.04 - SLIDE 46
2. Merging and Development of Concept
Classes
• Sort Term DB into
alphabetical order.
• First Round: Merge
information for
Identical terms -possibly pulling info
from additional
sources.
IS 257 – Fall 2007
• Second Round:
Merge synonyms or
terms in the same
concept class.
2007.04.04 - SLIDE 47
3. Definition of Broad Subject Fields and
Subfields
• Define Broad Subject
fields and sort terms
into these broad fields
• Define subfields
within each broad
field and sort terms
into these subfields.
IS 257 – Fall 2007
• Work out the detailed
structure
– Select Preferred
Terms
– Merge information for
terms in the same
concept class
• Repeat these steps
– for each subfield within
a broad field
– and for each broad
field
– Until all terms have
been consolidated and
preferred terms
selected
2007.04.04 - SLIDE 48
4. Development of Classificatory Structure
• Produce preliminary
version of classified
index and update the
working database.
• Improve classificatory
structure
IS 257 – Fall 2007
• Reality check:
produce and
distribute a version of
the classified index.
Distribute to
users/experts.
2007.04.04 - SLIDE 49
5. Final Stages
•
•
•
•
Review
Testing
Application
Revision
IS 257 – Fall 2007
2007.04.04 - SLIDE 50
Review
• Discuss classified index with
users/experts.
– Select descriptors and checklist descriptors.
• Assign Notational Symbols
• Produce Main Thesaurus & Indexes
IS 257 – Fall 2007
2007.04.04 - SLIDE 51
Review (cont.)
• Check cross references and insert where
needed
• Produce Test Version
• Test by Indexing
• Modify as needed
• Produce Production Version.
IS 257 – Fall 2007
2007.04.04 - SLIDE 52
Testing a Thesaurus
• Assign descriptors to a sample set of NEW
documents (use enough to get an idea of
any gaps in the thesaurus.
• Test retrieval using sample questions and
seeing how effectively the thesaurus maps
to the appropriate descriptor
IS 257 – Fall 2007
2007.04.04 - SLIDE 53
Flow of Work in Thesaurus Construction
Select Sources
Define Broad Subject
Fields
Improve Class Structure
Assign codes
Sort Terms into Broad
Subject Fields
Print Classified Index
and review
Select Terms
Define Subfields within
one Subject Field
Discuss with Experts and
Users
Record Selected Terms
Work out detailed structure
of the Subject Field
Select descriptors and
checklist items
Sort Terms
Select Preferred Terms
Many
Modifications?
Yes
Revise as
needed
No
Merge identical Terms
All Subfields of Broad
Subject finished?
No
Assign Notation
Yes
Merge Terms in Same
Concept class
All Broad
Subjects finished?
Yes
No
Produce Full Thesaurus
and Check references
Review and Test
Based on Soergel, pp 327-333
IS 257 – Fall 2007
2007.04.04 - SLIDE 54
The Indexing Process
• Concept identification
• term selection (via thesaurus)
• term assignment
IS 257 – Fall 2007
2007.04.04 - SLIDE 55
Application: The Indexing Process
(Manual)
Start
Examine Document
and Identify
Significant
Concepts
Does
Thesaurus
contain term
for
Concept
YES
Consider
First
Concept
Can Concept
NO
be expressed
combining
terms?
YES
End
NO
Is
There
Another
Concept
NO
Select
Preferred
Term
NO
Establish Term
Denoting
Concept
NO
Is
Term
suitable
Preferred
Term?
YES
Consider
Preferred
Term
Assign Terms
to
Document
Prefer
Alternative
Term(s)
Select Alternative
term to represent
Concept
YES
Consider Each of
These Terms
YES
Admit New Term
Into Thesaurus
NO
YES
Would
Concept be
better represented
by one of
these
terms
Consider any
associated terms in
Thesaurus (NT,BT)
Adapted from ISO 5963, p.5
IS 257 – Fall 2007
2007.04.04 - SLIDE 56
Thesaurus Revision and Updates
• There will always be new concepts,
products, or expressions that need to be
added to the thesaurus.
– Set a regular schedule of reviews and
revisions.
– Collect complaints, problems, etc. and fold
into revision of the thesaurus
IS 257 – Fall 2007
2007.04.04 - SLIDE 57
References
• Soegel, D. Indexing Languages and Thesauri:
Construction and Maintenance. Los Angeles : Melville
Publishing Co., 1974
• Foskett, A.C. The Subject Approach to Information.
London: Clive Bingley, 1982.
• Standards:
– ANSI/NISO z39.19--1994 -- American National Standard
Guidelines for the Construction, Format and Management of
Monolingual Thesauri
– ANSI/NISO Draft Standard Z39.4-199x -- American National
Standard Guidelines for Indexes in Information Retrieval
– ISO 2788 -- Documentation -- Guidelines for the establishment
and development of monolingual thesauri
– ISO 5964-- Documentation -- Guidelines for the establishment
and development of multilingual thesauri
IS 257 – Fall 2007
2007.04.04 - SLIDE 58
Descargar

Document