A New Unsupervised Approach to Automatic Topical
Indexing of Scientific Documents According to Library
Controlled Vocabularies
Arash Joorabchi & Abdulhussain E. Mahdi
Department of Electronic and Computer Engineering
University of Limerick, Ireland
ALISE 2013
Work Supported by:
Subject (Topical) Metadata in Libraries
•
Un-controlled
Unrestricted author and/or reader-assigned keywords and keyphrases,
such as:
– Index Term-Uncontrolled (MARC-653)
•
Controlled
Restricted cataloguer-assigned classes and subject headings, such as:
– DDC (MARC-082)
– LCC (MARC-050)
– LCSH/FAST (MARC-650)
The Case of Scientific Digital
Libraries & Repositories
Archived Material Include: Journal articles, conference papers, technical
reports, theses & dissertations, books chapters, etc.
•
Un-controlled Subject Metadata:
– Commonly available when enforced by editors, e.g., in case of published
journal articles & conf. proceedings, but rare in unedited publications.
–
•
Inconsistent
Controlled Subject Metadata:
–
Rare due to the sheer volume of new materials published and high cost of
cataloguing.
– High level of incompleteness and inaccuracy due to oversimplified classification
rules, e.g., IF published by the Dept. of Computer Science THEN DDC: 004,
LCSH: Computer science
Automatic Subject Metadata Generation in
Scientific Digital Libraries & Repositories
Aims to provide a fully/semi automated alternative to manual
classification.
1. Supervised (ML-based) Approach:
–
utilizing generic machine learning algorithms for text classification (e.g., NB, SVM, DT).
–
challenged by the large-scale & complexities of library classification schemes, e.g., deep
hierarchy, skewed data distribution, data sparseness, and concept drift [Jun Wang ’09].
2. Unsupervised (String Matching-based) Approach:
–
String-to-string matching between words in a term list extracted from library thesauri &
classification schemes, and words in the text to be classified.
–
Inferior performance compared to supervised methods [Golub et al. ‘06].
A New Unsupervised Concept-to-Concept
Matching Approach - An Overview
Paper/Article
(Full Text)
Wikipedia Concepts
Paper/Article
(MARC Rec.)
Ranking
Key Concepts
653: {…}
082: {…}
650: {…}
DDC
FAST
Inference
WorldCat
Database
MARC records sharing a
key concept(s) with the
paper/article
Wikipedia as a Crowd-Sourced
Controlled Vocabulary
Extensive topic/concept coverage (4m < English articles)
Up-to-date (lags Twitter by ~3h on major events [Osborne et al.’12])
Rich knowledge source for NLP (semantic relatedness, word sense
disambiguation)
Detailed description of concepts
Paper/Article
(MARC Rec.)
653: {Wikipedia: HP 9000}
650: {FAST: HP 9000 (Computer)}
Alternative Label
Related Term
Wikipedia Concepts – Detection In Text
Wikification using WikipediaMiner – an open source toolkit for mining
Wikipedia [Milne, Witten ‘09]
Descriptor: String (computer science)
Non-descriptors:
String (theory)
String (rope)
String (music)
…
– character string
– text string
– binary string
Block Edit Models for Approximate String Matching
Abstract
In this paper we examine the concept of string block edit distance, where two strings A and B are compared by
extracting collections of substrings and placing them into correspondence. This model accounts for certain phenomena
encountered in important real-world applications, including pen computing and molecular biology. The basic problem
admits a family of variations depending on whether the strings must be matched in their entireties, and whether overlap
is permitted. We show that several variants are NP-complete, and give polynomial-time algorithms for solving…
.
.
.
Wikipedia Concepts – Ranking Features
1. Occurrence Frequency
2. First Occurrence
3. Last Occurrence
4. Occurrence Spread
5. Length
6. Lexical Diversity
7. Lexical Unity
8. Avg Link Probability
9. Max Link Probability
10. Generality
11. Speciality
12. Distinct Links Count
13. Links Out Ratio
14. Links In Ratio
15. Avg Disambiguation Confidence
16. Max Disambiguation Confidence
17. Link-Based Relatedness to Other Topics
18. Link-Based Relatedness to Context
19. Cat-Based Relatedness to Other Topics
20. Translations Count
Key Wikipedia Concepts – Rank & Filtering
Un-supervised
|F |
Pros:
Score( topic j ) 
 easy to implement & fast

f ij
i 1
 plug & play, i.e., no training needed
Cons (naïve assumptions):
 Assumes all features carry the same weight
 Assumes all features contribute to the importance probability of candidates linearly
Supervised
1. Initial population - a set of ranking functions with random weight and degree parameter values within a preset range
2. Evaluate fitness of each ranking function.
3. (selection, crossover, mutation) -> new generation
4. Repeat steps 2 & 3 until threshold is passed
|F |
Score( topic j ) 

w i f ij
di
i 1
Genetic algorithm (ECJ) settings
Species Population Genome Chunk Min Max Elites Crossover Selection Mutation Mutation Threads
Size
Size
Size Gene Gene
Type
Method
Type Probability
Float
40
40
2
0.0
2.0
1
two points Tournament
Reset
0.05
2
Key Wikipedia Concepts – Evaluation
Dataset & Measure
Wiki-20 dataset [Medelyan, Witten ‘08]:




20 Computer Science related papers/articles.
Each annotated by 15 Human Annotator (HA) teams independently.
HAs assigned an average of 5.7 topics per Doc.
an Avg. of 35.5 unique topics assigned per Doc.
MA
Rolling’s inter-indexer consistency (=F1) :
Inter - indexer
consistenc
y (A,B) 
2c
HA2
HA3
ab
VK
HA1
Key Wikipedia Concepts – Evaluation
Results
Performance comparison with human annotators and rival machine annotators
Number of Keyprases Avg. inter consistency with
human annotators (% )
Assgined per
document, nk
Max.
Avg.
Min.
Method
Learning Approach
TFIDF (baseline)
KEA++ (KEA-5.0)
Grineva et al.
Maui
Maui
n/a - unsupervised
Naïve Bayes
n/a - unsupervised
Naïve Bayes (all 14 features)
Bagging decision trees (all 14 features)
5
5
5
5
5
5.7
15.5
18.2
22.6
25.4
8.3
22.6
27.3
29.1
30.1
14.7
27.3
33.0
33.8
38.0
Human annotators (gold
standard)
n/a - senior CS students
Varied, with an average of
5.7 per document
21.4
30.5
37.1
5
5
5
5
5
5
22.7
19.1
23.6
12.3
13.9
14.0
30.6
30.7
31.6
32.8
32.9
33.5
38.3
37.9
37.9
58.1
56.7
58.1
n/a - unsupervised
CKE
n/a - unsupervised
Current work
Bagging decision trees (13 best features)
Maui
GA, threshold=800, unique bests method
Current work (LOOCV)
GA, threshold=200, unique bests method
Current work (LOOCV)
Current work (LOOCV) GA, threshold=400, unique bests method
–
–
Joorabchi, A. and Mahdi, A. Automatic Subject Metadata Generation for Scientific Documents Using Wikipedia and Genetic
Algorithms. In Proceedings of the 18th International Conference on Knowledge Engineering and Knowledge Management
(EKAW 2012)
Joorabchi, A. and Mahdi, A. Automatic Keyphrase Annotation of Scientific Documents Using Wikipedia and Genetic
Algorithms. To appear in the Journal of Information Science
Querying WorldCat Database
http://worldcat.org/webservices/catalog/search/sru?query=
srw.kw = Doc_Key_Concept_Descriptor
Top
30
Key
Concepts
in the
document
AND srw.ln exact eng
//Language
AND srw.la all eng
//Language Code (Primary)
AND srw.mt all bks
//Material Type
AND srw.dt exact bks
//Document Type (Primary)
WorldCat
Database
&servicelevel = full
&maximumRecords = 100
&sortKeys = relevance,,0
//Descending order
&wskey = [wskey]
≤100
potentially
related MARC
records
Refining Key Concepts Based on WorldCat
Search Results
doc_key_conceptsi ≤30
Doc_Key_Concepts =
Marc_Recsi=
marc_recsi , j ≤100
total_matchesi
ncepts 
Refined _ Doc_Key_Co
 doc _ key _ concepts
 total

 OR


IF
 OR



 OR

i

e.g., “Logic”(72,353): 13.7>10.3
 Doc _ Key _ Concepts :
 0
e.g., “Logical conjunction”


log e total _ matches i  1   InDoc_Scor e  doc _ key _ concepts i 


 InDoc_Scor e  doc _ key _ concepts i   InDoc_Scor e  doc _ key _ concepts 1   0 . 8  


AND



 
 Refined _ Doc_Key_Co ncepts  10



Refined _ Doc_Key_Co ncepts  20

_ matches
i
THEN
Discard
ELSE
Refined _ Doc_Key_Co
Doc_Key_Co
vs. “Linear logic”(17): 2.83 < 8.6
doc _ key _ concept
i
ncepts : Refined _ Doc_Key_Co
ncepts : Refined _ Doc_Key_Co
ncepts
ncepts  doc _ key _ concept
i

MARC Records Parsing, Classification,
Concept Detection
total_matchesi
Doc_Key_Concepts=
Marc_Recsi=
001
Control Number
245($a)
Title Statement (Title)
doc_key_conceptsi ≤20
marc_recsi , j ≤100
DDCi,j FASTi,j Marc_Conceptsi,j
OCLC
Classify
505($a, $t) Formatted Contents Note
520($a, $b) Summary, Etc.
650($a)
Subject Added Entry-Topical Term
653($a)
Index Term-Uncontrolled
Wikipedia-Miner
*OCLC Classify finds the most popular DDC
& FASTs for the work using the OCLC FRBR
Work-Set algorithm.
Measuring Relatedness Between MARC
Records and the Article/Paper
doc_key_concepts i ≤20
total_matchesi
Doc_Key_Concepts=
marc_recsi , j ≤100
Marc_Recsi=
Marc_Conceptsi,j DDCi,j FASTi,j
Relatedness?
Relatednessi,j
Shared_Con cepts  x  Marc _ Concepts
Normalized
_Freq  shared _ concepts
All_Unique _Mark_Recs
Inverse_Ma
rc_Freq




 shared
k

i, j
 Doc _ Key _ Concepts
i 1

k

Marc _ Concepts
 Marc_Recs
k
i
InMarc_Fre q  shared _ concepts
Doc _ Key _ Concepts
_ concept
 : x  doc _ key _ concepts 
i




All_Unique _Mark_Recs
marc
_ recs
Shared _ Concepts
 log  Normalized
2
i, j
 All_Unique _Mark_Recs
_Freq  shared _ concepts
k
: shared _ concept
  1   log 2  Inverse_Ma
k 1
Relatednes
s MC i , j , DKC
   InDoc_Scor e  shared
_ concepts
k
2
Marc _ Concepts
k
 Marc _ Concepts
rc_Freq
 shared
i, j

_ concept
k

Weighting DDC Candidates
All _ Unique_Mar

c_Recs  


Unique _ DDCs  x  DDC
 unique _ ddcs
k
Doc _ Key _ Concepts
 Marc_Recs
i
i 1
i, j
:  marc _ recs




1  i 
 All_Unique _Marc_Recs
i, j
Doc _ Key _ Concepts ,1  j  Marc_Recs

 Unique _ DDCs :
Freq unique _ ddcs

k
Doc _ Key _ Concepts
Marc _ Re cs i
 unique

i 1
Highest _ ValidDDCs
_ ddcs
 DDC
k
i, j

j 1

 max  x  N :  doc _ key _ concepts

_ Count _ PerConcept
i
Marc _ Re cs i
Normalized
i
_Freq unique _ ddcs
k
 unique
Doc _ Key _ Concepts


_ ddcs
k
 DDC
i, j
j 1
 DDC
Marc _ Re cs i
i 1

i, j
 0

 Doc _ Key _ Concepts  x 



i, j
j 1
 
 0 

 


 Highest _ ValidDDCs

 DDC
Marc _ Re cs i
_ Count _ PerConcept
j 1
unique
Inverse_Co ncept_Freq
_ ddcs
k

| Doc _ Key _ Concepts |
doc
_ key _ concepts
Doc _ Key _ Concepts
Average_Re
Inverse_Av
latedness
unique
erage_Tota
_ ddcs
l_Matches
k

unique

i
 Doc _ Key _ Concepts :  unique _ ddcs
Marc _ Re cs i
 Relatednes
i 1

s i , j unique _ ddcs
k
 DDC
i, j
 DDC
i, j
1 
j  Marc_Recs
i


j 1
Freq unique _ ddcs
_ ddcs
k
k

k

Freq unique _ ddcs
Doc _ Key _ Concepts
 total
_ matches
i
 unique
_ ddcs
k
k

 DDC
i, j
1 
j  Marc_Recs
i

i 1
Weight
unique
_ ddcs
k

log
 log
2
 Freq unique _ ddcs k   log 2  Normalized _Freq unique _ ddcs k   log 2  Inverse_Co
 Inverse _ Average _ Total _ Matches unique _ ddcs k   1   Inverse_Av erage_Tota
2
ncept_Freq
l_Matches
unique
unique
_ ddcs
_ ddcs
k
k


Weighting FAST Candidates
All _ Unique_Mar

c_Recs  


Unique _ FASTs  x  FAST
 unique _ fasts
k




Doc _ Key _ Concepts
 Marc_Recs
i
i 1
:  marc _ recs
i, j
i, j
1  i 
 All_Unique _Marc_Recs
Doc _ Key _ Concepts ,1  j  Marc_Recs

 Unique _ FASTs :
Freq unique _ fasts

k
Doc _ Key _ Concepts
Marc _ Re cs i
 unique

i 1
Highest _ ValidFASTs
_ fasts
 FAST
k
i, j

j 1

 max  x  N :  doc _ key _ concepts

_ Count _ PerConcept
i
Marc _ Re cs i
Normalized
i
_Freq unique _ fasts
k
 unique
Doc _ Key _ Concepts


_ fasts
k
 FAST
i, j
j 1
 FAST
Marc _ Re cs i
i 1

i, j
 0

 Doc _ Key _ Concepts  x 



 FAST
i, j
j 1
 
 0 

 


 Highest _ ValidFASTs

Marc _ Re cs i
_ Count _ PerConcept
j 1
unique
Inverse_Co ncept_Freq
_ fasts
k

| Doc _ Key _ Concepts |
doc
_ key _ concepts
Doc _ Key _ Concepts
Average_Re
Inverse_Av
latedness
unique
erage_Tota
_ fasts
l_Matches
k

unique

i
 Doc _ Key _ Concepts :  unique _ fasts
Marc _ Re cs i
 Relatednes
i 1

s i , j unique _ fasts
k
 FAST
i, j
 FAST
i, j
1 
j  Marc_Recs
i


j 1
Freq unique _ fasts
_ fasts
k
k

k

Freq unique _ fasts
Doc _ Key _ Concepts
 total
_ matches
i
 unique
_ fasts
k
k

 FAST
i, j
1 
j  Marc_Recs
i

i 1
Weight
unique
_ fasts
k

log
 log
 Freq unique _ fasts k   log 2  Normalized _Freq unique _ fasts k   log 2  Inverse_Co ncept_Freq unique
 Inverse _ Average _ Total _ Matches unique _ fasts k   1   Inverse_Av erage_Tota l_Matches unique _
2
2
_ fasts
fasts
k

k

DDCs Weight Aggregation & Outlier Detection
Sort Unique_DDCs set based on DDCs depth in descending order
For each DDCi ∈ Unique_DDCs Do :
For each DDCj ∈ Unique_DDCs Do :
IF subclass(DDCi, DDCj) THEN
IF weight(DDCi) > highest_DDC_weight/10 THEN
weight(DDCi) = weight(DDCi) + weight(DDCj)
Discard DDCj
ELSE Discard DDCi
Example:
DDCi
006.312 : 10.991574176537037
+ 006.31 : 19.614959248944054 = 30.60653342548109
+ 006.3 : 12.77908859025236 = 43.385622015733446
DDCi+1
*BoxPlot Outliers - DDCs whose weights
lie an abnormal distance from the others’,
i.e., mild and extreme outliers
Upper + 1
Outlier
s.t. weight(DDCi) > (upper inner fence = Q3 + 1.5*IQ)
FASTs Weight Aggregation & Outlier Detection
Unique_FASTs := {x ∈ Unique_FASTs : weight(x) > highest_FAST_weight/10}
For each FASTi ∈ Unique_FASTs Do :
For each FASTj ∈ Unique_FASTs Do :
IF related(FASTi , FASTj) AND WC_SubjectUsage(FASTi) < WC_SubjectUsage(FASTj)
THEN weight(FASTi) = weight(FASTi) + weight(FASTj)
Example: E x p e r t
->
->
->
->
systems (Computer science) 4.224295291384108
seeAlsoHeading: Artificial intelligence
seeAlsoHeading: Computer systems
seeAlsoHeading: Soft computing
subjectUsage: 14685.0
+ Artificial intelligence(subjectUsage:36145.0) weight : 5.214271611745798 = 9.4385669
FASTi
FASTi+1
FASTi+2
Outlier1 + Outlier2 + 1
DDCs Binary Evaluation
Wiki-20 dataset [Medelyan, Witten ‘08] containing 20 Computer Science related papers/articles.
Recall 
Number
of correctly
Total possible
Doc ID
287
7183
7502
9307
10894
12049
13259
16393
18209
19970
20287
23267
23507
23596
25473
37632
39172
39955
40879
43032
Overall
assigned
correct

TP
TP  FN
Precision

Number
of correctly assigned
Total assigned
Predicted DDC (by current method)
519.542
Decision theory
006.35
Natural language processing
006.333
Deduction, problem solving, reasoning
005.131
Symbolic logic
005.757--0218
Object-oriented databases--Standards
621.3815--0287 Components and circuits--Testing and measurement
005.43
Systems programs
001.6443
(invalid in DDC22 & DDC23)
004.53
Internal storage (Main memory)
005.115
Logic programming
511.322
Set theory
005.275
Programming for multiprocessor computers
004.35
Multiprocessing
004.33
Real-time processing
005.117
Object-oriented programming
495.6--5
Japanese--Grammar
658.4036--028546 Group decision making--Computer communications
515.2433
Fourier and harmonic analysis
below threshold
005.14
Verification, testing, measurement, debugging
006.4--015116
Computer pattern recognition--Combinatorics
005.117
Object-oriented programming
004
Computer science
005.262
Programming in specific programming languages
TP= 14, FP=9, FN= 10, Pr= 0.61, Re= 0.58, F1= 0.60

TP
TP  FP
F1 
2Pr  Re
Pre  Re
Imbalanced
Training
Set
True DDC
✓
✓
✓
006.333 Deduction, problem solving, reasoning
005.757 Object-oriented databases
005.14 Verification, testing, measurement, debugging
005.453 Compilers
001.4226 Presentation of statistical data
005.435 Memory management programs
✓
✓
✓
✓
✓
✓
006.35 Natural language processing
✓
✓
006.37 Computer vision
✓
✓
✓
006.31 Machine learning
005.26 Programming for personal computers
004: 78k
005: 100
006: 403
Predicted
DDC
(by ACT-DL*)
004
004
004
004
004
004
000
004
004
004
004
004
400
150
004
004
510
150
004
004
F1=[0.05, 0.75]
*Automatic Classification Toolbox for Digital Libraries (ACT-DL) by Bielefeld University Library and deployed at Bielefeld Academic Search Engine (BASE)
ACT-DL
ACT-DL
Current Work
(BASE
(Wiki-20 dataset) (Wiki-20 dataset)
dataset)
DDCs Hierarchical Evaluation
TP
FP
FN
Pr
Re
F1
TP
FP
FN
Pr
Re
F1
Pr
Re
F1
L1
L2
L3
L4
L5
L6
L7
Facet
Avg.
21
2
3
0.91
0.88
0.89
21
2
3
0.91
0.88
0.89
18
5
6
0.78
0.75
0.77
17
5
7
0.77
0.71
0.74
15
5
8
0.75
0.65
0.70
10
4
4
0.71
0.71
0.71
2
2
1
0.50
0.67
0.57
2
3
0
0.40
1.00
0.57
0.72
0.78
0.73
L1
L2
L3
L4
L5
L6
L7
Facet
Avg.
16
4
4
0.80
0.80
0.80
16
4
4
0.80
0.80
0.80
1
19
19
0.05
0.05
0.05
L1
L2
L3
0.90
0.75
0.81
0.78
0.56
0.63
0.77
0.55
0.62
0.55
0.55
0.55
L4
L5
L6
L7
Facet
Avg.
0.82
0.62
0.69
FASTs
Binary Evaluation
Doc ID
287
7183
7502
9307
10894
12049
13259
Doc ID
Predicted FAST
7183
Bayesian statistical decision theory
18209
Bayesian statistical decision theory--Industrial applications
Maximum entropy method
19970
Econometric models
20287
Model-based reasoning
Knowledge acquisition (Expert systems)
23267
Expert systems (Computer science)
23507
23596
25473
TP= 40, FP= 24, FN= 24
37632
F1= 0.625
True FAST
✓
Natural language processing (Computer science)
Information retrieval
Machine learning
✓
✓
✓
Conceptual structures (Information theory)
✓
✓
Computer software—Development
Computer-aided software engineering
✓
✓
Object-oriented programming (Computer science)
✓
Computer software--Quality control
✓
✓
Compiling (Electronic computers)
✓
✓
Information visualization
✓
Memory management (Computer science)
✓
✓
✓
✓
✓
✓
✓
Real-time data processing
✓
✓
Object-oriented methods (Computer science)
Object-oriented programming (Computer science)
Computer software--Reusability
✓
✓
✓
Computational linguistics
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
Combinatorial analysis
Object-oriented programming languages
Object-oriented programming (Computer science)
Machine learning
Classification
✓
Software localization
User interfaces (Computer systems)
Computer interfaces
✓
True FAST
16393
287
Predicted FAST
Bayesian statistical decision theory
Bayesian statistical decision theory--Industrial applications
Maximum entropy method
Econometric models
Model-based reasoning
Knowledge acquisition (Expert systems)
Expert systems (Computer science)
Semantics
Case-based reasoning
Object-oriented databases
UML (Computer science)
Booch method
Software patterns
Object-oriented methods (Computer science)
Object-oriented databases--Standards
Regression analysis
Struts framework
Application software--Testing
Yacc (Computer file)
Assembling (Electronic computers)
Three-dimensional display systems
Interactive computer systems
Interactive multimedia
Distributed shared memory
Intel i860 (Microprocessor)
Cache memory
Virtual storage (Computer science)
Predicate (Logic)
Modality (Logic)
Set theory
Sorting (Electronic computers)
Parallel algorithms
Data transmission systems
Virtual computer systems
Parallel computers
Modula-3 (Computer program language)
ML (Computer program language)
Object-oriented databases
Abstract data types (Computer science)
English language--Noun phrase
Grammar, Comparative and general--Noun phrase
Automatic speech recognition
Teams in the workplace--Data processing
Data compression (Telecommunication)
Image compression
Signal processing--Mathematics
Wavelets (Mathematics)
Video compression
Digital video
Data compression (Computer science)
Software visualization
Debugging in computer science
Matching theory
Text processing (Computer science)
Graphical user interfaces (Computer systems)
Smalltalk (Computer program language)
Objective-C (Computer program language)
Automatic speech recognition
Speech processing systems
Supervised learning (Machine learning)
HP-UX
Hewlett-Packard computers--Programming
HP 9000 (Computer)
C (Computer program language)
39172
39955
40879
43032
Overall
✓
Natural language processing (Computer science)
Information retrieval
Machine learning
✓
✓
✓
TP= 40, FP= 24, FN= 24, Pre= Re= F1= 0.625
Semi-Supervised Classification
12049: Occam's Razor: The Cutting Edge for Parser Technology
1.
2.
3.
4.
5.
6.
7.
8.
005.43
005.453
005.12
510.7808
005.26
415
001.6425
004
>449.17978755450434 (Systems programs)
>429.04491205387495 (Compilers)
>144.3981891584036
>138.0169127750601
>105.58801291194308
>79.72358747591275
>39.024619737391866
>36.433436906359425
287: Clustering Full Text Documents
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
.
.
.
41.
Bayesian statistical decision theory
>252.41740965808467
Bayesian statistical decision theory--Industrial applications >223.09281028013865
Maximum entropy method
>223.09281028013865
Econometric models
>189.47706031373122
Economics, Mathematical
>188.4336672427764
Natural language processing (Computer science)
>176.13905753628868
Econometrics
>156.6469274464959
Distribution (Probability theory)
>120.64195152106359
Parsing (Computer grammar)
>102.72834662505807
Lexicology--Data processing
>101.39771816337012
Machine translating
>99.39171867148306
Text processing (Computer science)
>96.65689215290195
Information retrieval
>79.01359045012737
Semantic Web
>73.12618493349078
Probabilities
>70.99695859769267
Computational linguistics
>65.00474591701948
Machine learning
>60.14168210721469
Decision making
>50.302190572189424
Inference
>49.142891911243986
Interactive computer systems
>49.04810095707191
Mathematical physics
>25.256694185393123
Future Work

Detecting Wikipedia topics in documents is computationally expensive.

Eliminate the need for sending queries to WorldCat and repeating the process
of topic detection on matching MARC records by performing topic detection on
a locally held FRBRized version of WorldCat DB.

Complementing topics extracted from MARC records of a work
catalogued in WorldCat with Common terms and phrases from its
content (as extracted by Google Books)

Probabilistic Mapping of Wikipedia concepts/articles to their
corresponding DDCs and FASTS (already initiated by OCLC research
via developing VIAFbot for mapping Wikipedia biography articles to VIAF.org)
Thank You!
Questions…
For more information, please contact:
[email protected]
[email protected]
This work is supported by:
OCLC/ALISE Library & Information Science Research Grant Program
Irish Research Council 'New Foundations' Scheme
Descargar

A New Unsupervised Approach to Automatic Topical …