Department of Electronic and Computer Engineering
Automatic Subject Classification of Textual Documents Using Limited or No
Training Data
Arash Joorabchi
Supervised by Dr. Abdulhussain E. Mahdi
Submitted for the degree of Doctor of Philosophy
10/11/2010
Outline
 Introduction to ATC
 Motivation, Aim, and Objectives
 Bootstrapping ML-based ATC systems (Ch.3)
 Bibliography-Based ATC method (BB-ATC) (Ch.4)
 Enhanced BB-ATC for Automatic Classification of
Scientific Literature in Digital Libraries (Ch.5)
 Citation Based Keyphrase Extraction (CKE) (Ch.6)
 Conclusion & Future Work
2
Introduction
• Automatic Text Classification/Categorization (ATC)
– Automatic assignment of natural language text documents to one or
more predefined classes/categories according to their contents.
• Applications include:
– Spam filtering
– Web information retrieval, e.g., filtering, focused crawling, web
directories, subject browsing
– Organising digital libraries
• Common Methods:
– Rule-based Knowledge Engineering (until late 1980s)
– Machine Learning (since 1990s)
3
ML Approach to ATC
• Common ML algorithms used for ATC
– Naïve Bayes (based on Bayes' theorem)
– k-Nearest Neighbors (k-NN)
– Support Vector Machines (SVM) [Vapnik, V 1995]
•
•
SVM is reported to yield the best prediction accuracy [Joachims, T 1998].
However, the accuracy of ML-based ATC systems depend on many
parameters such as:
–
Quantity and quality of training documents
–
Document representation models, e.g., bag-of-words vs. bag-of-phrases
–
Term weighting mechanisms, e.g., binary vs. multinomial (burstiness phenomenon)
–
Feature reduction and selection methods, e.g., document frequency vs. information gain.
Therefore, the choice for the best classification algorithm highly depends on
the characteristics of the ATC task at hand [Hand, D. J. 2006].
4
Motivation, Aim, and Objectives
• What if there is limited/no training data?(e.g., 100 classes & 200
samples per class)
• Our aim was to alleviate this problem by pursuing two lines of
research:
i.
Investigating bootstrapping methods to automate the process of
building labelled corpora for training ML-based ATC systems.
ii.
Investigating a new unsupervised ATC algorithm which does not
require any training data.
• In order to realise this aim, we have focused on utilising two sources
of data whose application in ATC had not been fully explored before:
a)
Conventional library organisation resources such as library
classification schemes, controlled vocabularies, and catalogues
(OPACs).
b)
Linkage among documents in form of citation networks.
5
AnDevelopment
Overview of Developed
Syllabus
Repository
System
of a National
Syllabus
Repository
for
Higher Education in Ireland
•
Goal: Collecting unstructured
electronic syllabus documents from
participating higher education
institutes into a metadata-rich
central repository.
•
Extended the ISCED scheme
•
482B - Science, Mathematics and
Computing/Computing/Information
Systems/Databases
•
•
Naïve Bayes Classification algorithm
[Tom Mitchell 1997]
A New Web-based bootstrapping
method
FTP server
Repository
Database
zip
packages
Hot-Folder
Application
Open
Office
Pre-processing
Xpdf
PDFTK
Thesaurus
Information
Extractor
Program Document Segmenter
Segment
Headings
Module Syllabus Segmenter
Entity
names
Named Entity Extractor
Post
Processing
Classification
Scheme
GATE
Classifier
Web
Search API
Meta-data generator module
Web
Web
6
Web-based
process
Web-based Bootstrapping
Bootstapping process
1.
A list of subject fields (leaf nodes) in the classification scheme is compiled.
2.
For each subject filed in the list a web search query is created including
the caption of the subject field and the keyword “syllabus” and submitted to
the Yahoo search engine using Yahoo search SDK.
3.
The first hundred URL’s in the returned results for each query are passed
to the Gate toolkit [Cunningham et al. 2002], which downloads all
corresponding files (in HTML, TXT, PDF, or MS-word formats) and extracts
and tokenizes their textual contents.
4.
The tokenised texts are converted to feature/word vectors are then used to
train the classifier for classifying syllabus documents at the subject-field
level.
5.
The subject-field word vectors are also used in a bottom-up fashion to
construct word vectors for the fields which belong to the higher levels of
hierarchy (p.52).
7
Evaluation and Experimental Results
• Test dataset contains 100 undergraduate syllabus documents and 100
postgraduate syllabus documents from 5 participating HE institutes in Ireland
• The micro-average precision achieved by the classifier for undergraduate
syllabi is 0.75, compared to 0.60 for postgraduate syllabi.
•
Mico-avg.
Mico-avg.
Mico-avg.
Precision
Recall
F1
Named Entities
0.94
0.74
0.82
Topical Segments
0.84
0.72
0.77
Results published in:
– The proceedings of the 12th European Conference on Research and
Advanced Technology for Digital Libraries, ECDL 2008; and
– The Electronic Library, 27, 4 (2009).
8
Overview
of Developed
ATC system
Bootstrapping
ML-based
ATC Systems
Utilizing Public
Library Resources
• A dynamic ML-based ATC system
that can be adopted for wide range
a ATC tasks with minimum
customization required.
LOC OPAC
Internet
• Dewey Decimal Classification
(DDC) scheme.
• small parts of books such as back
cover, and editorial reviews for
training
• Transformed Weight-normalized
Complement Naive Bayes
(TWCNB) [Rennie et al., 2003].
• A linear SVM classifier called
LIBLINEAR [Lin, C.-J et al., 2008]
Z3950
API
HttpClient
API
Unlabeled
Texts
Classification
Scheme
Training Corpus Builder
Stop Words
Training Dataset
Builder
Classifier
NB Corpus Builder
TWCNB
SVM Corpus Builder
LIBLINEAR
General
Stop Words
Specific
Stop Words
GATE
9
Bootstrapping module
Data Mining Process
Retrieve a list of books
from LOC’s catalogue
which are Classified into
this category.
Extract a list of ISBN’s
and use them to retrieve
the books descriptions
from Azmazon.
http:/amazon.com/gp/product
/ISBN-VALUE
10
Parsed Book Description Text
• Product Description
– Editorial Reviews:
The Deitels' groundbreaking How to Program series offers unparalleled breadth and
depth of object-oriented programming concepts and intermediate-level topics for further
study. The Seventh Edition has been extensively fine-tuned and is completely up-todate with Sun Microsystems, Inc.’s latest Java release — Java Standard Edition 6
(“Mustang”) and several Java Enterprise Edition 5 topics. Contains an extensive
OOD/UML 2 case study on developing an automated teller machine. Takes a new
tools-based approach to Web application development that uses Netbeans 5.5 and
Java Studio Creator 2 to create and consume Web Services. Features new AJAXenabled, Web applications built with JavaServer Faces (JSF), Java Studio Creator 2
and the Java Blueprints AJAX Components. Includes new topics throughout, such as
JDBC 4, SwingWorker for multithreaded GUIs, GroupLayout, Java Desktop Integration
Components (JDIC), and much more. A valuable reference for programmers and
anyone interested in learning the Java programming language.
http:/amazon.com/gp/product/0132222205
11
Evaluation and Experimental Results
•
20-Newsgroup-18828 dataset - a collection of 18,828 newsgroup articles,
partitioned across 20 different newsgroups.
•
Eight classes in 20-Newsgroup were mapped to their corresponding classes
in Dewey Decimal Classification scheme. (the remaining were inapplicable
e.g., misellaneous.forsale)
sci.space
Dewey
Number
520
Astronomy and allied sciences
rec.sport.baseball
796.357
Baseball
997
rec.autos
796.7
Driving motor vehicles
587
rec.motorcycles
796.7
Driving motor vehicles
587
soc.religion.christian
230
Christian theology
1043
sci.electronics
537
Electricity and electronics
713
rec.sport.hockey
796.962
Ice hockey
270
sci.med
610
Medicine and health
1653
newsgroup
Dewey Caption
No. of training
texts collected
810
12
Evaluation and Experimental Results (Cont.)
B o o ts tra p p e d T W C N B
P re c is io n %
S ta n d a rd T W C N B
P re c is io n %
s c i.s p a c e
6 9 .1 9
9 4 .9 4
re c .s p o rt.b a s e b a ll
9 6 .7 8
9 3 .9 6
re c .a u to s
7 4 .7 4
9 1 .9 1
re c .m o to rc y c le s
7 1 .0 2
9 4 .9 7
s o c .re lig io n .c h ristia n
8 9 .3 6
9 6 .0
s c i.e le c tro n ics
6 9 .9 2
7 8 .1 7
re c .s p o rt.h o c k e y
7 5 .7 7
9 8 .5
s c i.m e d
7 6 .2 3
9 6 .9 6
A v g . 7 7 .8 7
A v g . 9 3 .1 7
N e w s g ro u p
•
Accuracy of Bootstrapped TWCNB is 15% Lower than standard TWCNB
•
The LIBLINEAR classifier with achieved average precision of 68% turned out to
be considerably less accurate than TWCNB in this task.
•
Results published in:
–
The proceedings of the 19th Irish Conference on Artificial Intelligence and Cognitive
Science (AICS08).
13
Leveraging the Legacy of Conventional Libraries for
Organizing Digital Libraries
Can we utilize the
classification metadata
of books referenced in a
syllabus document to
classify it?
Tapping into:
They are already classified
by expert library cataloguers
according to DDC and LCC
classification schemes.
The intellectual work that has been put into developing and
maintaining library classification systems over the last century.
The intellectual effort of expert cataloguers who have manually
classified millions of books and other resources in libraries.
14
Bibliography-based ATC method
BB-ATC is based on Automating the following processes:
1. Identifying and extracting references in a given document.
2. Searching catalogues of physical libraries for the extracted
references in order to retrieve their classification metadata.
3. Allocating a class(es) to the document based on retrieved
classification category(ies) of the references with the help of a
weighting mechanism.
Similarities to the k Nearest
Neighbour K-NN algorithm
15
Bibliography-based ATC
Advantages Over ML-based ATC systems:
Library classification schemes are regularly updated and
contain thousands of classes in every field of knowledge.
No training data is needed.
performance is not adversely affected by the large number of
classes in DDC and LCC.
New books are catalogued everyday and hence no concept
drift.
16
BB-ATC Implementation
DDC classification Scheme
was adopted because of its
worldwide usage and
Hierarchical structure.
Syllabi DB
*.PDF
*.HTM
*.DOC
Multi-label classification by
assigning weights (0<w≤1) to
candidate DDC classes and
LCSHs.
Pre-processing
Xpdf
Document’s content in plain text
JZkit Java API is used to
communicate with the libraries’
OPAC catalogues through
Z39.50 protocol.
JRegex/JAPE for extracting
ISBNs/ISSNs.
Open
Office
Information
Extractor
LOC
Catalogue
GATE
Extracted reference identifiers (ISBNs/ISSNs)
Catalogue-Search
BL
Catalogue
References‘ DDC class numbers and LCSHs
Classifier
Weighted list of chosen DDC Class(es) & LCSHs
17
BB-ATC Evaluation
Test dataset: 100 computer science related syllabus documents.
L e v e r a g in g t h e L e g a c y o f C o n v e n t io n a l L ib r a r ie s f o r
O r g a n iz in g D ig it a l L ib r a r ie s

A ra s h J o o ra b c h i , A b d u l h u s s a in E . M a h d i
D e p a rtm e n t o f E l e c tro n ic a n d C o m p u te r E n g in e e ri n g , U n iv e rs it y o f L im e ric k , R e p u b lic o f Ire la n d .
T h is d o c u m e n t c o n ta in s th e f u ll e x p e rim e n ta l r e s u lts o f o u r B B -A T C s y s te m .
Full results available online at: www.csn.ul.ie/~arash/PDFs/1.pdf

T h e p ro p o s e d A T C s y s te m w a s u s e d to a u to m a tic a ll y c l a s s if y 1 0 0 s yll a b u s d o c u m e n ts w h ic h
m a in l y b e lo n g to th e file d o f c o m p u te r s c ie n c e .

T h e v a lid it y a n d c o rre c t n e s s o f e a c h a s s ig n e d D D C c la s s la b e l is e x a m in e d m a n u a ll y b y a n
e x p e rt c a ta lo g u e r. W h e n n e c e s s a r y , a d d itio n a l n o te s a re p ro v id e d to h e lp c l a rif y th e r e s u lts .

E a c h tim e a n e w c la s s a p p e a rs in th e r e s u lts if th e c a p tio n o f th e c la s s in n o t s e lf e x p la n a to r y
th e n s o m e a d d itio n a l in fo rm a tio n a b o u t th a t c l a s s is p ro v id e d in fo rm o f fo o tn o te s . T h e s o u rc e
fo r th e s e c la s s d e s c rip ti o n s is th e W e b D e w e y w e b s ite (h ttp ://c o n n e x io n .o c lc .o rg ) w h ic h
p ro v id e s a c c e s s to th e l a t e s t v e rs io n o f D D C s c h e m e (D D C 2 2 a t th e tim e o f c re a tin g th is
d o c u m e n t).
T ru e
F a ls e
P o s itiv e
P o s itiv e
210
19
F a ls e
P r e c is io n
R e c a ll
F1
0 .9 1 7
0 .8 8 9
0 .9 0 2
N e g a tiv e
26
C la s s if ic a t io n r e s u lt s s u m m a r y
LEGEND
Micro-averaged performance measures
TP
T ru e P o s itiv e
FP
F a ls e P o s itiv e
FN
F a ls e N e g a tiv e
NC
N o t C a ta lo g u e d : th e re f e r e n c e d ite m is n o t c a ta lo g u e d in e ith e r L ib ra r y o f C o n g re s s o r
B ritis h L ib r a r y c a ta lo g u e s .
CE

C a ta lo g u e r ’s E r ro r: T h e c a ta lo g u e rs in e ith e r th e L ib r a r y o f C o n g r e s s o r B ritis h L ib ra r y
h a v e c la s s ifie d th e ite m in to th e w ro n g c la s s (m a n u a l c la s s ific a tio n e rro r) o r th e y h a v e
la b e lle d th e ite m w ith a n in v a lid c la s s n u m b e r (d a t a e n tr y e rro r).
C o r r e s p o n d in g a u t h o r . T e l.: ( + ) 3 5 3 -6 1 -2 1 3 4 9 2 ; F a x :( + ) 3 5 3 -6 1 -3 3 8 1 7 6 .
E - m a il a d d r e s s e s : a r a s h .j o o ra b c h i@ u l.ie ( A . J o o r a b c h i) , H u s s a in . m a h d i@ u l.ie ( A . E . M a h d i ) .
TP
FP
FN
Precision
Recall
F1
210
19
26
0.917
0.889
0.902
Author
Method Data Set
Pong et al.
(2007)
K-NN
505 training & 254 testing
documents (web pages)
67 classes from LCC
0.80
Pong et al.
(2007)
NB
505 training & 254 testing
documents(web pages)
67 classes from LCC
0.54
1889 training & 623 test documents
Economic related web pages
575 subclasses of the
DDC main class of
0.92
economics
Chung et al.
K-NN
(2003)
Classification Scheme F1
BB-ATC 100 computer science related Syllabi Full DDC scheme
0.90
Results published in:
• The proceedings of the 13th European Conference on Research and Advanced Technology for
Digital Libraries, (ECDL 2009). (Granted the best student paper award)
18
Enhanced BB-ATC method for Automatic Classification
of Scientific Literature in Digital Libraries
1800 publication a day in
biomedical science!
eXist-DB
The CiteSeer digital library is
used as the experimental
platform (~1 million records).
CiteSeer
OAI &
BibTex
Records
CiteSeer infrastructure is fully
open source and supports
(OAI-PHM).
Using Google Book Search
database for mining citations
networks.
Using OCLC’s WorldCat - a
union catalogue of 70,000
libraries around the world.
Pre-processing
Chosen document’s metadata records and list of references
Google
Book
Search
Data mining
WorldCat
Catalogue
Pool of DDC numbers potentially related to the document
Inferring
Probabilistically chosen DDC number for the document
19
Data mining process
List of publications citing Rn
Document’s Metadata:
Title:
Authors:
Abstract:
.
.
.
Reference #1 (R1): title
.
.
.
Reference #n (Rn): title
List of publications citing R1
Google
Book
Search
Publication #1 (P1): ISBN
.
.
.
Publication #n (Pn): ISBN , DDC No.
WorldCat
Catalogue
20
Sample Data mining Results (Cont.)
Document’s Title: Statistical Learning, Localization, and Identification of
Objects. (has only one reference)
This work describes a statistical approach to deal with learning and recognition
problems in the field of computer vision
Citing publications:
No. ISBN
1.
0123797721
2.
0123797772
3.
0769501648
4.
0780350987
5.
0780399781
6.
0792378504
7.
0818681845
8.
1558605835
DDC No.
006.3/7
006.3/7
Null
006.3/7
Null
621.36/7
621.367
Null
No.
9.
10.
11.
12.
13.
14.
15.
ISBN
3540250468
3540629092
3540634606
3540639314
3540646132
3540650806
389838019X
DDC No.
629.8932
006.4/2
006.4/2
621.36/7
006.3/7
006.3
005.1/18
Level 1
0 Computer science, information & general works
Level 2
00 Computer science, knowledge & systems
Level 3
006 Special computer methods
Level 4
006.3 Artificial intelligence
Reference’s Title: Learning Object Recognition Models from Images
Citing publications:
No.
ISBN
DDC No.
No.
ISBN
DDC No.
1.
0120147734
537.5/6
8.
3540433996
629.8/92
2.
0195095227
006.3/7
9.
3540617507
006.3/7
3.
0780399773
Null
10.
3540634606
006.4/2
4.
0818638702
621.39/9
11.
3540636366
006.7
5.
1586032577
006.3
12.
3540667229
006.3/7
6.
1848002785
621.367
13.
389838019X
005.1/18
7.
3540282262
006.3
14.
3540404988
006.3/7
DDC No.
Freq
0
17
6
7
00
17
006
15
005
2
0063
11
0064
3
00637
8
621367
4
006.4 Computer pattern recognition
Level 5
006.31
Machine learning
006.33
Knowledge-based
systems
006.32
Neural nets
(Neural networks)
006.35
Natural language
Processing (NLP)
006.37
Computer vision
006.42
Optical pattern
recognition
006.45
Acoustical pattern
recognition
21
Inference & Visualization
depth


CW ( DDC i )  GF ( DDC i )  NLF ( DDC i )  ULF ( DDC i ) 10
m
NLF ( DDC i ) 

j 1
1
Freq ( DDC i , j )
|Rj |
m
 GF ( DDC i )   DDC i  R j 
j 1
m
 ULF ( DDC i )   Freq (DDC i , j )
j 1
 20  CW ( cn )  D
S
( cn )  CW ( pn )
 The same concept
TFIDF weighting
as
22
Evaluation Results
Test dataset contains 1000 research
documents divided into 5 groups
according to their number of references.
Mico-avg. Pr
Mico-avg. Re
Mico-avg. F1
0.84
0.78
0.81
Micro-Avg. precision
Micro-Avg. recall
Micro-Avg. F1
No. of
references
Mico-avg.
Mico-avg.
Mico-avg.
Pr
Re
F1
0
0.718
0.523
0.605
4
0.842
0.820
0.831
8
0.843
0.829
0.836
16
0.880
0.860
0.870
32
0.891
0.880
0.886
0.9
0.85
0.8
F1
0.75
0.7
0.65
0.6
0.55
23
0.5
0
4
8
Number of references
16
32
Evaluation Results (cont.)
Number of documents classified in each
level of DDC hierarchy and corresponding
averaged performance Measure.
Level No. of
Docs
% of
Docs
Mico-avg. Mico-avg. Mico-avg.
Pr
F1
Re
1
1000
100%
0.94
0.91
0.89
100%
2
1000
100%
0.92
0.89
0.87
90%
3
1000
100%
0.84
0.82
0.80
4
1000
100%
0.81
0.79
0.77
5
950
95%
0.75
0.70
0.66
6
394
39.4%
0.68
0.65
0.63
7
50
5%
0.59
0.58
0.57
30%
8
20
2%
0.62
0.58
0.55
20%
9
4
0.4%
0.59
0.69
0.83
No. of Docs (%)
Micro-Avg. precision (%)
Micro-Avg. recall (%)
Micro-Avg. F1 (%)
80%
70%
60%
50%
40%
10%
0%
1
2
3
4
5
6
7
8
9
DDC hierarchy level
http://www.skynet.ie/~arash/BB-ATC1/HTML/
Article under review in the journal Information Processing & Management Elsevier
24
BB-ATC Approach Applied to the Problem of Keyphrase
Extraction form Scientific Literature
 keyphrases (multi-word units) describe the content of research documents
and they are usually assigned by the authors.
 The task of automatically assigning keyphrases to a document is called
keyphrase indexing
 Considered a form of ATC ((ML-based multi-label) and approached as such
 Free indexing vs. indexing with a controlled vocabulary (e.g., LCSH , MeSH)
 Extraction indexing vs. assignment indexing
25
Citation Based Keyphrase Extraction (CKE)
1. Reference extraction using ParsCit [Councill, I. G, et al. 2008] (CRF,F1=0.93)
2. Mining the Google Book Search (GBS) database (>10 million archived items)
for candidate terms (i.e., Google word clouds)
3. Term weighting & selection
List of publications citing t
Document’s Metadata:
Title: t
Authors: …
Abstract: …
.
.
.
Reference #1 (R1): title
.
.
.
Reference #n (Rn): title
List of publications citing Rn
List of publications citing R1
1
Google
Book
Search
2
3
Publication #1 (P1): ISBN
.
.
.
Publication #n (Pn): ISBN , Key terms
4
26
Term Weighting and Selection
Google Word Cloud (GWC): Google uses TFIDF + some heuristic rules to emphasize on
proper nouns (names, locations, etc.)
GWC for a book titled: “Data mining: practical machine learning tools and techniques”:
Normalization including: stopword removal, punctuation removal,
abbreviation expansion, case-folding, and stemming (Porter2 [Porter 2002])
Keyphraseness score of each candidate term measured using:
K ( t )  log
log
2
2
( GF ( t )  1)  log
( FO ( t )  1)  2
2
( LF ( t )  1)  2
NW ( t )
 log
2
RF ( t )
NC ( t )  2

ADI ( t )
27
Evaluation & Experimental Results
wiki-20 Test dataset [Medelyan et al., 2009]
20 computer science research papers each manually indexed by 15 different
human teams (teams of 2).
Rolling’s inter-indexer consistency formula adopted which is equivalent to F1
measure:
Inter-inde xer consis tency 
2C
A B
28
Evaluation & Experimental Results (cont.)
Performance of the CKE algorithm compared to human indexers and competitive methods.
Method
No. of keyphrases assigned to
each document
Inter-consistency (%)
Min.
Avg.
Max.
Manual
Supervised
Unsupervised
Human indexing (gold standard)
Varied
21.4
30.5
37.1
KEA (Naïve Bayes)
Static - 5
15.5
22.6
27.3
Maui (Naïve Bayes & all features)
Static - 5
22.6
29.1
33.8
Maui (Bagged Decision Trees & all features)
Static - 5
25.4
30.1
38.0
Maui (Bagged Decision Trees & best features)
Static - 5
23.6
31.6
37.9
Grineva et al.
Static - 5
18.2
27.3
33.0
CKE (condition A)
Static - 5
22.7
30.6
38.3
CKE (condition B)
Static - 6
26.0
31.1
39.3
CKE (condition C)
Varied - the same as assigned
by human indexers
22.0
30.5
38.7
To appear in Journal of Information Science 36, 6 ( December 2010).
Published online before print November 5, 2010
29
Conclusion & Future Work
The main contribution of this work is the design, development, and evaluation
of an alternative approach to ATC by utilizing two new knowledge/data
sources:
i.
Conventional library classification schemes.
ii.
Citation networks among documents.
The proposed approach addresses two major issues
a)
Lack of a standard and comprehensive classification scheme for ATC
b)
Lack of training data
Future work includes:

BB-ATC: Using mining the citing documents as well as the cited ones, Multi-label
classification

CKE: utilizing LCSH and user assigned keyphrases of cited and citing documents.

Applying the underlying theory of BB-ATC to ACM DL and ACM’s Computing Classification
System (ACM-CSS).

BB-ATC & CKE as an automatic metadata generator plug-in for scientific DLs such as
Ryan, Ireland’s National Research Portal and NDLTD (Networked Digital Library of Thesis
30
and Dissertations)
Development of a National Syllabus Repository for
Higher Education in Ireland
• Goal: Collecting unstructured electronic syllabus documents from
participating higher education institutes into a metadata-rich central
repository.
• Challenges:
– Information Extraction:
• Syllabus documents have arbitrary sizes, formats, and layouts;
• contain multiple module descriptions (e.g., programme documents);
• contain complex layout features (e.g., hidden/nested tables).
– Automatic Classification:
• Lack of a suitable standard education classification scheme for higher
education in Ireland.
• Lack of training data
31
Classifier
•
Classification scheme:
–
an enhanced version of International Standard Classification of Education
(ISCED).
–
3 levels of classification: broad field (9), narrow field (25), and detailed field (80)
each represented by a digit in a hierarchical fashion.
–
We have extended this by adding a forth level of classification, subject field,
represented by a letter in the classification coding system from Australian
Standard Classification of Education (ASCED).
–
“482B” Science, Mathematics and Computing/Computing/Information
Systems/Databases
•
Naïve Bayes Classification algorithm [Tom Mitchell 1997]
•
A Web-based bootstrapping method
32
Programme Document Segmenter (PDS)
Definitive Programme Document
Module Syllabus
MSc in Business Management
Introduction
.
.
.
Programme Structure
.
.
.
Module 1: BM3222
Leadership Management
.
.
.
33
Module Syllabus Segmenter (MSS)
Extracting topical segments
of each individual syllabus.
Module Syllabus
Header segment
Aim & Objectives
segment
Learning Outcomes
segment
34
Named Entity Extractor (NEE)
• extracts a set of common named entities/attributes such as module
code, module name, module level, number of credits, pre-requisites
and co-requisites from the header segment of syllabi .
CODE: CE 4701
GRADING TYPE
TYPE Core
Module: Computer Software 1
Normal
CREDITS
3
PRE_REQUISITES:None
AIMS/OBJECTIVES
To familiarise the student with the
use of a computer and typical applications software. To introduce a high-level language,
typically Pascal, as a concrete formalism for the representation of algorithms in a machinereadable form.
35
Bootstrapping ML-based ATC Systems Utilizing Public
Library Resources
•
Developing a dynamic ML-based ATC system that can be adopted for wide
range a ATC tasks with minimum effort required from users.
•
Users will select a set of categories from a comprehensive standard
classification scheme, and a bootstrapping method is used to automatically
build a training dataset accordingly.
•
Three main components:
Universal
Classification
Scheme
Training Corpus
Builder
(bootstrapper)
ML-based
Classification
Algorithm
36
ATC System Components
•
Universal Classification Scheme
– Acts as a pool of categories/classes that can be selectively adopted by the users to create their own
classification scheme.
– Dewey Decimal Classification (DDC) with thousands of classes has been used in conventional libraries
for over a century to categorize library materials.
– DDC is used in about 80% of libraries around the world and has a fully hierarchical structure (vs. LCC)
•
Training Corpus Builder
– Textual item classified according to DDC are not available in an electronic format and/or are copyrighted.
– Alternatively, we use the small parts of books such as topics covered, the back cover, and editorial
reviews publicly available on books sellers’ websites such as Amazon.
– Short text (~500 words) containing semantically-rich terms used to summarize the book.
•
Classification algorithms
– We implemented an optimized version of NB called Transformed Weight-normalized Complement Naive
Bayes (TWCNB) [Rennie et al., 2003].
– A linear SVM classifier called LIBLINEAR [Lin, C.-J et al., 2008] which is an optimised implementation of
SVM suitable for large linear classification tasks with thousands of features, such as ATC.
37
BB-ATC Performance Compared to Similar Reported
Experiments
Author
Method Data Set
Pong et al.
(2007)
K-NN
505 training & 254 testing
documents (web pages)
67 classes from LCC
0.80
Pong et al.
(2007)
NB
505 training & 254 testing
documents(web pages)
67 classes from LCC
0.54
1889 training & 623 test documents
Economic related web pages
575 subclasses of the
DDC main class of
0.92
economics
Chung et al.
K-NN
(2003)
Classification Scheme F1
BB-ATC 100 computer science related Syllabi Full DDC scheme
•
0.90
Results published in:
– The proceedings of the 13th European Conference on Research and Advanced Technology for
Digital Libraries, (ECDL 2009). (Granted the best student paper award)
38
Evaluation & Experimental Results
wiki-20 Test dataset [Medelyan et al., 2009]
20 computer science research papers each manually indexed by 15 different
human teams (teams of 2).
Rolling’s inter-indexer consistency formula adopted which is equivalent to F1
measure:
Inter-inde xer consis tency 
2C
A B
The number of extracted references per document range between 10 to 79
with an average value of 25.9 references per document
The number of retrieved GWCs per document ranges between 62 to 766
with an average value of 271 GWCs per document.
In total, the data mining unit has retrieved the metadata records of 5,576
publications from GBS, which either cite one of the documents in the wiki-20
collection or one of their references, and almost all of these records (5421,
97.14%) contain a word cloud.
39
Pr ( c i ) 
Number
of correctly
Total
assigned
class labels

assigned
TP i
TP 0
TP i  FP i
TP
Re ( c i ) 
Number
of correctly
Total
assigned
possible
class labels

correct
0
4
TP i
TP i  FN
FP
i
3
TP 6
Assigned class 006.3
Correct class
006.4
4 FN
F-score
The weighted harmonic mean of precision and recall, the traditional F-measure
or balanced F-score is:
2Pr ( c ) Re ( c )
F1 ( c i ) 
i
i
Pre ( c i )  Re ( c i )
This is also known as the F1 measure, because recall and precision are evenly
weighted.The general formula for non-negative real β is:
40
[Joachims, 1997]
41
Outline
 Introduction to Automatic Text Classification (ATC)
 Motivation, Aim, and Objectives
 Bootstrapping ML-based ATC systems
 Leveraging Conventional Library resources for Organizing Digital
Libraries (BB-ATC)
 Enhanced BB-ATC for Automatic Classification of Scientific
Literature in Digital Libraries
 BB-ATC Approach Applied to the Problem of Keyphrase Extraction
form Scientific Literature
 Conclusion & Future Work
42
Descargar

Slide 1