Computer-aided generation of
multiple-choice tests
Ruslan Mitkov
School of Humanities, Languages and Social Sciences
University of Wolverhampton, WV1 1SB
Email [email protected]
Structure of the presentation
Introduction
Premises
NLP-based methodology for construction of
multiple-choice test items

term extraction, distractor selection, question
generation
In-class experiments
Evaluation

efficiency, item analysis
Discussion
Forthcoming work
Introduction
Multiple-choice test: an effective way to
measure student achievements.
Computer-aided multiple-choice test
generation: an alternative for the labourintensive task.
Novel NLP methodology employing shallow
parser, automatic term extraction, word sense
disambiguation, corpora, and WordNet.
Premises
Questions should focus on key concepts
Distractors should be as semantically close to
the correct answer as possible
Example

Syntax is the branch of linguistics which studies
the way words are put together into sentences 
Which branch of linguistics studies the way words
are put together into sentences?




Pragmatics
Syntax
Morphology
Semantics
NLP-based methodology (1)
term
extraction
transformational
rules
narrative
texts
question
generation
distractors
wordnet
distractor
selection
terms (key
concepts)
test items
NLP-based methodology (2):
term extraction
Nouns and noun phrases are first identified
using FDG shallow parser
Nouns with frequency over threshold defined
as ‘key terms’
NPs featuring key terms as heads and
satisfying specific regular expressions
considered as terms (phrase  adjectival
phrase/verb phrase…)
Terms serve as ‘anchors’ for generating test
questions
NLP-based methodology (3):
selection of distractors
Semantic closeness: WordNet consulted for
close terms
If too many returned, those appearing in the
corpus given preference
Example: the electronic textbook contains the
following noun phrases with modifier as
head: modifier that accompanies a noun,
associated modifier, misplaced modifier.
Alternative: corpus search (NPs with same
head but different modifiers selected)
NLP-based methodology (4):
generation of test questions

Eligible sentences:
 containing domain-specific terms
 having SVO or SV structure type.

Examples of generation rules:
 S(term)VO =>“Which HVO” where H is a
hypernym of the term
 SVO(term) => “What do/does/did S V”


Agreement rules
Genre-specific heuristics
In-class experiments



Controlled set of 36 test items introduced
(24 generated with the help of the
program, 12 manually produced)
45 undergraduate students took the test
System operates via the Questionmark
Perception web-based testing software
Example of a test item generated
29 of 36
Which kind of pronoun will agree with the
subject in number, person, and gender?
o second person pronoun
o indefinite pronoun
o relative pronoun
o reflexive pronoun
Post-editing
Automatic generation 



Test items classed as “worthy” (57%) or
“unworthy” (43%)
About 9% of the automatically generated
items did not need any revision
From the revisions needed: minor (17%),
fair (36%), and major (47%)
Evaluation
Efficiency of the procedure
Quality of the test items
Evaluation (2):
efficiency of the procedure
Efficiency:
items
produced
time
average
time
per item
computer-aided
300
540'
1' 48''
manual
65
450'
6' 55''
Evaluation (3):
quality of the test items
Item analysis



Item Difficulty (= C/T)
Discriminating Power (=(CU-CL):T/2)
Usefulness of the distractors (comparing no.
of students in upper and lower groups who
selected each incorrect alternative)
Evaluation (4)
item difficulty
item discriminating power
avg
item
difficulty
too
easy
computer
aided
0.75
3
0
0.4
1
manual
0.59
1
0
0.25
2
usefulness of distractors
average
negative
too
discriminating discriminating poor
difficult
power
power
not
useful
total
avg
difference
6
3
65
1.92
10
2
33
1.18
Discussion
Computer-aided construction of multiplechoice test items is much more effective
than purely manual construction
Quality of test items produced with the help
program is not compromised in exchange
for time and labour savings
Forthcoming work: extensions to
other genres
Current project delivered a prototype in
the area of Linguistics, but system will
be tuned to cover Chemistry, Biology,
Mathematics and Computer Science.
Forthcoming work:
other types of questions
Questions about properties or information
associated with the term (e.g. colour,
location, time, discoverer/author) will also be
generated.
‘Uranium was discovered in 1798 by Martin
Klaproth.’ 
‘When was uranium discovered?’ or ‘Who
discovered uranium?’
‘Carbon dioxide is a colourless gas.’ 
‘What is the colour of the gas carbon
dioxide?’
Forthcoming work:
other suitable types of distractors
• Distractors of the same semantic category
would be features placed close on a specific
property scale (e.g. time, colour) to the
correct answer.
• General (e.g. WordNet, Roget’s Thesaurus)
and/or domain-specific resources can be used
to provide such scales.
• Additional heuristics: preference for selecting
time expressions, colours, human proper
names etc. that also appear in the same
corpus/document.
Forthcoming work: extraction of
domain-specific feature patterns
Extraction of domain-specific patterns of the
form <term, feature1, … featureN>
Example: <chemical element (proper name
element), weight (number), colour (value
from colours set), location/found in (proper
name place), discoverer (proper name
human), time of discovery (temporal
expression)>.
Represented as typed-feature structures;
basis for restricted domain ontologies.
Forthcoming work: more
sophisticated term extraction
Based on the statistical and linguistic
properties of the terms
Statistical scores: (relative) frequency, tf.idf,
mutual information
Different types of term variations (Jacquemin
1999, Ha 2003b)
Part-of-speech patterns (Justeson and Katz
1996)
“Knowledge patterns” (Meyer 2001; Ha
2003a)
Machine learning methods will be employed
Forthcoming work: wider coverage
question generation grammar
ML methods will be experimented with
to improve the variety of the
transformational rules
End-of-chapter questions and sentences
containing these answers will be
automatically aligned
(Semi-)automatic alignment at word
and phrase level will also be performed
Forthcoming work: experiments with
other similarity measures
Statistical, corpus-based methods to
mine for close concepts/words (Pekar
2002, 2003)
Recent thesaurus-based similarity
approaches (Budanitsky and Hirst 2001,
Jarmasz and Szpakowicz 2003).
By-products
Bank of test items
Restricted domain-specific ontologies
Other future plans
Offer the option to generate a long list of
distractors with the user choosing among
them
Impact of the program on professional test
developers
Agreement among post-editors
Computer-Aided Generation of
Multiple-Choice Tests
Thank you
Descargar

Computer-Aided Generation of Multiple