Changes in test Scores
with
Multiple Sittings of
CanTEST
Philip Nagy
Rationale
Research Questions
• Do test scores change on repeating the test?
• Is change related to length of time between sittings?
Test Development Questions
• Can data from repeaters be used in test calibration for
new form development?
Context: Receptive Skills
Official Languages and Bilingualism Institute
The Data
Listening Tests: Six forms with 15 short and 25 long passage items
Reading Tests: Seven forms with 15 skim-and-scan, 20 reading
passage, and 25 cloze items
The Sample: Mean first score of 3.6, compared to 4.3 for those who
write only once
Assumptions
• Difficulty of forms is balanced across sittings (true)
• Samples writing each form are equivalent (untested)
Official Languages and Bilingualism Institute
Listening Results: Sitting 2 minus
Sitting 1 (N=179)
Change in
Raw Score
Total Test
(40)
Short
Passages
(15)
Long
Passages
(25)
Down >11
3
Down 6 to 10
18
2
11
Down 3 to 5
18
24
22
Same ± 2
43
91
72
Up 3 to 5
42
42
46
Up 6 to 10
36
20
24
Up >11
19
Official Languages and Bilingualism Institute
1
3
Listening Results, another look
Change in
Raw Score
Total Test
(40)
Short
Passages
(15)
Long
Passages
(25)
Down some
22%
15%
19%
About the
same
24%
51%
40%
Up some
54%
34%
41%
Mean raw
gain
2.6
1.3
1.3
6.5% of 40
items
8.8% of 15
items
5.2% of 25
items
Mean % gain
Official Languages and Bilingualism Institute
Listening Results Interpretation
How important is the improvement?
• On average, 3.6 points needed out of 40 to
improve one band
• So, 2.6 points is about 75% of a band
improvement
Official Languages and Bilingualism Institute
Listening Results Interpretation
Can the data be used for test calibration?
• The changes in average item difficulty are
different for the subtests
•.088 for short passages
•.052 for long passages
• The difference of .036 (.088 - .052) is about the
same as the standard error of the difficulty
indices
• Listening data from repeaters should not be
used for item calibration
Official Languages and Bilingualism Institute
Changes in Listening by
Length of Time between Sittings
Test →
Time Between
Tests ↓
Total
Test
Short
Passages
Long
Passages
> 6 months
(N=63)
+2.13
+0.631
+1.49
< 6 months
(N=116)
+2.87
+1.691
+1.18
1Difference
significant, p=0.05
Those who repeat sooner do better than
those who repeat later
Official Languages and Bilingualism Institute
Reading Results: Sitting 2 minus
Sitting 1 (N=284)
Change in Raw
Score
Total (80)
Skim-&-Scan
(15)
Down 21 or more
17
Down 11 to 20
19
2
Down 6 to 10
21
12
18
32
Down 3 to 5
28
32
30
34
Same score ± 2
46
139
142
106
Up 3 to 5
33
65
63
52
Up 6 to 10
47
31
23
36
Up 11 to 20
48
3
8
12
Up 21 or more
25
Note: Reading Score is doubled to give a total out of 80 rather than 60.
Official Languages and Bilingualism Institute
Passage (20)
Cloze (25)
12
Reading Results, another look
Change in Raw Score
Total (80)
Skim-&Scan (15)
Reading
Passage
(20)
Cloze
Passage
(25)
Down some
30%
16%
17%
27%
About the same
16%
49%
50%
37%
Up some
54%
35%
33%
35%
Official Languages and Bilingualism Institute
Reading Results Interpretation
How important is the improvement?
• On average, 6.5 points needed (out of 80) to
improve one band
• So, 3.45 points is about 55% of a band
improvement
Official Languages and Bilingualism Institute
Reading Results Interpretation
Can the data be used for test calibration?
• The changes in average item difficulty are
different for the subtests
•+0.072 for skim-and-scan
•+0.050 for reading passages
•+0.002 for cloze
• The largest difference of .070 (.072 - .002) is
two to three times larger than the standard error
of the difficulty indices
• Reading data from repeaters should not be used
for item calibration
Official Languages and Bilingualism Institute
Changes in Reading by
Length of Time between Sittings
Test →
Time
Between
Tests ↓
Total (80)
Skim-&Scan
Reading
Passage
Cloze
Passage
> 6 months
(N=105)
-0.119
-0.2921
-0.017
-0.079
< 6 months
(N=179)
+0.070
+0.1711
+0.010
+0.046
1Difference
significant, p=0.05
Those who repeat later actually do worse than
those who repeat sooner
Official Languages and Bilingualism Institute
Conclusion
• Listening:
• 30% of sample do more poorly on 2nd sitting
• Average gain is 75% of a band score
• Differences in gains across item types vary by an item standard
error
• Reading
• 40% of sample do more poorly on 2nd sitting
• Average gain is 55% of a band score
• Differences in gains across item types vary by 2-3 times an
item standard error
• Both
• Those who rewrite within six months do better
• Data from repeaters should not be used for item calibration
Official Languages and Bilingualism Institute
Descargar

Slide 1