Measurement, test, evaluation
• Measurement: process of quantifying the
characteristics of persons according to explicit
procedures and rules
• Quantification: assigning numbers, distinguishable
from qualitative descriptions
• Characteristics: mental attributes such as aptitude,
language, fluency
• Rules and procedure: observation must be
replicable in other context and with other
• Carroll: a procedure designed to elicit certain
behavior from which one can make inferences
about certain characteristics of an individual
• Elicitation: to obtain a specific sample of behavior
• Interagency Language Roundtable (ILR):oral
interview: a test of speaking consisting of (1) a set
of elicitation procedures, including a sequence of
activities and sets of question types and topics; (2)
a measurement scale of language proficiency
ranging from a low level of 0 to a high level of 5.
• Years’ informal contact with a child to rate the
child’s oral proficiency: the rater did not follow
the procedures
• A rating based on a collection of personal letters to
indicate an individual’s ability to write effective
argumentative editorials for a news magazine.
• A teacher’s rating based on informal interactive
social language use to indicate the student’s ability
cognitive/academic language functions.
• Definition:
information for the purpose of making
• Evaluation need not be exclusively
quantitative: verbal descriptions,
performance profiles, letters of reference,
overall impressions
Relation Between Evaluation,
Test, and Measurement
Relation Between Evaluation,
Test, and Measurement
• 1: qualitative descriptions of student performance
for diagnosing learning problems
• 2. teacher’s ranking for assigning grades
• 3. achievement test to determine student progress
• 4. proficiency test as a criterion in second
language acquisition research
• 5.assigning code numbers to subjects in second
language research according to native language
What Is It, Measurement, Test,
Evaluation ?
placement test
classroom quiz
grading of composition
rating of classroom fast reading exercise
rating of dictation
Measurement Qualities
• A test must be reliable and valid.
• Free from errors of measurement.
• If a student does a test twice within a short time,
and if the test is reliable, the results of the 2 tests
should be the same.
• If 2 raters rate the same writing sample, the ratings
should be consistent if the ratings should be
• The primary concerns in examining reliability is to
identify the different sources or error, then to use
the appropriate empirical procedures for
estimating the effect of these sources of errors on
test scores.
• Validity: the extent to which the inferences or
decisions are meaningful, appropriate and useful.
The test should measure the ability and very little
• If a test is not reliable, it is not valid.
• Validity is a quality of test interpretation and use.
• The investigation of validity is both a matter of
judgment and of empirical research.
Reliability and Validity
• Both are essential to the use of tests.
• Neither is a quality of tests themselves: reliability
is a quality of test scores, while validity is a
quality of interpretations or uses that are made of
test scores.
• Neither is absolute: we can never attain perfectly
error free measures and particular use of a test
score depends upon many factors outside the test
Properties of Measurement
• 4 properties
• distinctiveness: different numbers assigned to
persons with different values
• ordered in magnitude: the larger the number, the
larger the amount of the attribute
• equal interval: equal difference between ability
• absolute zero point: the absence of the attribute
Four Types of Scales
• Nomical: naming classes or categories.
• Ordinal: an order with respect to each other.
• Interval: the distance between the levels are
• Ratio: includes the absolute zero point
• Examples :License plate numbers; Social
Security numbers; names of people, places,
objects; numbers used to identify football
• Limitations: Cannot specify quantitative
differences among categories
• Examples: Letter grades (ratings from
excellent to failing), military ranks, order of
finishing a test
• Limitations: Restricted to specifying
relative differences without regard to
absolute amount of difference
• Examples: Temperature (Celsius and
Fahrenheit), calendar dates
• Limitations: Ratios are meaningless; the
zero point is arbitrarily defined
• Examples: Distance, weight, temperature in
degrees Kelvin, time required to learn a
skill or subject
• Limitations: None except that few
educational variables have ratio
Nominal, Ordinal, Interval or
5 in IELTS
550 in TOEFL
C in BEC
8 in CET-4 writing
58 in the final evaluation of a student
Property and Type of Scale
Type of Scale
Nominal Ordinal Interval Ratio
Equal intervals
Absolute zero
Limitations in Measurement
• It is essential and important for us to
understand the characteristics of measures
of mental abilities and the limitations these
characteristics place on our interpretation of
test scores.
• These limitations are of two kinds:
limitations in specification and limitations
in observation and quantification.
Limitation in Specification
• Two levels of the specification of language ability
• Theoretical level
• Task: we need to specify the ability in relation to,
or in contrast to, other language abilities and other
factors that may affect test performance.
• Reality: large number of different individual
characteristics—cognitive, affective, physical—
that could potentially affect test performance make
the task nearly impossible.
Limitation in Specification
• Operational level
• Task: we need to specify the instances of language
performance as indicators of the ability we wish to
• Reality: the complexity ad the interrelationships
among the factors that affect performance on
language tests force us to simplify assumptions in
designing language tests and interpreting test
• Our interpretations and uses of test scores
will be of limited validity.
• Any theory of language test performance
we develop is likely to be underspecified
and we have to rely on measurement theory
to deal with the problem of
Limitations in Observation and
• All measures of mental ability are indirect,
incomplete, imprecise, subjective and
• The relationship between test scores and the
abilities we want to measure is indirect. Language
tests are indirect indicators of the underlying traits
in which we are interested. Because scores from
language tests are indirect indicators of ability, the
valid interpretation and use of such scores depends
crucially on the adequacy of the way we have
specified the relationship between the test score
and the ability we believe it indicates. To the
extent that this relationship is not adequately
specified, the interpretations and uses made of the
test score may be invalid.
• The performance we observe and measure
in a language test is a sample of an
individual's total performance in that
• Since we cannot observe an individual's
total language use, one of our main
concerns in language testing is assuring that
the sample we do observe is representative
of that total use - a potentially infinite set of
utterances, whether written or spoken.
• It is vitally important that we incorporate
into our measurement design principles or
criteria that will guide us in determining
what kinds of performance will be most
relevant to and representative of the abilities
we want to measure, for example, real life
language use.
• Because of the nature of language, it is virtually
impossible (and probably not desirable) to write
tests with 'pure' items that test a single construct or
to be sure that all items are equally representative
of a given ability. Likewise, it is extremely
difficult to develop tests in which all the tasks or
items are at the exact level of difficulty
appropriate for the individuals being tested.
• As Pilliner (1968) noted, language tests are
subjective in nearly all aspects.
• Test developers
• Test writers
• Test takers
• Test scorers
• The presence or absence of language abilities is
impossible to define in an absolute sense.
• The concept of 'zero' language ability is a complex
• The individual with absolutely complete language
ability does not exist.
• All measures of language ability based on domain
specifications of actual language performance
must be interpreted as relative to some 'norm' of
Steps in Measurement
• Three steps
• 1. identify and define the construct
• 2. define the construct operationally
• 3. establish procedures for quantifying
Defining Constructs
• Historically, there were two distinct
approaches to defining language proficiency.
• Real-life approach: language proficiency
itself is not define, but a domain of actual
language us is identified.
• The approach assumes that if we measure
features present in language use, we
measure the language proficiency.
Real Life Approach: Example
• American Council on the Teaching of Foreign
Languages (ACTFL): definition of advanced level
• Able to satisfy the requirements of everyday
situations and routine school and work
requirements. Can handle with confidence but not
with facility complicated tasks and social
situations, such as elaborating, complaining, and
apologizing. Can narrate the describe with some
details, liking sentences together smoothly. Can
communicate facts and talk casually about topics
of current public and personal interest, using
general vocabulary.
Interactional/ability Approach
• Language proficiency is defined in terms of
its component abilities. These components
can be reading, writing, listening, speaking,
(Lado), functional framework (Halliday),
communicative frameworks (Munby)
Example of Pragmatic
• The knowledge necessary, in addition to
appropriately producing or comprehending
illocutionary competence, or the knowledge
of how to perform speech acts, and
conventions which govern language use.
Defining Constructs
• This step involves determining how to isolate the
construct and make it observable.
• We must decide what specific procedures we will
follow to elicit the kind of performance that will
indicate the degree to which the given construct is
present in the individual.
• The context in which the language testing takes
place influences the operations we would follow.
• The test must elicit language performance in a
standard way, under uniform conditions.
Quantifying Observations
• The units of measurement of language tests are
typically defined in two ways.
• 1. points or levels of language performance.
• From zero to five in oral interview
• Different levels in mechanics, grammar,
organization, content in writing
• Mostly an ordinal scale, therefore needing
appropriate statistics for ordinal scales.
• 2. the number of tasks successfully completed
Quantifying Observations
• 2.
the number of tasks successfully
• We generally treat such a score as one with
an interval scale.
• Conditions for an interval scale
Quantifying Observations
• the performance must be defined and selected in a
way that enables us to determine the relative
difficulty and the extent to which they represent
the construct being tested.
• the relative difficulty: determined from the
statistical analysis of responses to individual test
• How much they represent the construct: depend on
the adequacy of the theoretical definition of the
Score Sorting
• Raw score
• Score Class
1. Range
2. Number of groups: K=1.87(N-1)2/5
3. Interval: I=R/K
4. Highest and Lowest of the group
5. Arrange the data into groups
Central Tendency & Dispersion
• Mean: x-=∑x / N
• Median: middle of the range
• Mode: the score around which the bulk of
the data congregate
• Variance: V=∑(x-x-)2 / (n-1)
• Standard deviation:
S=√(∑(x-x-)2 / (n-1))

Measurement - Xiamen University