Principles in language testing What is a good test? What is the purpose of testing? • The purpose of testing is to obtain information on language skills of the learners. • Information is very costly. The more specific it is, the more cost it involves. – Is language testing targeting specific information? – Costs here involve human and material resources and TIME. – Once an institution/teacher decided that the information is needed, it/he should be ready to meet the costs. Types of tests • • • • • Achievement tests (final or progress) Proficiency tests Pro-achievement tests Diagnostic tests Placement tests Test marking • Assessment scale (also: rating scale) – criteria by which performances at a given level will be recognized – levels of performance: • • • • 10 (excellent), 9 (very good), 8 (good) bands 0-9 in IELTS 1-100 pts in the national English examination level descriptors – verbal descriptions of performances that illustrate each level of competence on the scale Communicative language competences • Linguistic competences – lexical, grammatical, semantic, phonological, orthographic, orthoepic • Sociolinguistic competences – markers of social relations, politeness conventions, expressions of folk wisdom, register differences, dialect and accent • Pragmatic competences – discourse comp. (ability to arrange sentences in proper sequence), functional (requests, invitations etc.) (adapted from CEFR 2001) Competences vs. skills • Competences are tested through skills • The four major skills are subdivided into minor subskills: – reading comprehension: • • • • • reading for general orientation reading for information reading for main ideas reading for specific information reading for implications etc. (CEFR 2001) What is good testing? • • • • It is valid It is reliable It is practical It has positive impact on the teaching process • • • • VALIDITY RELIABILITY PRACTICALITY WASHBACK EFFECT Test validity • It appropriateness of the test; OR • It shows that a test tests what it is supposed to test; OR • A test is valid if it measures accurately what it is intended to measure. • To establish that a test is valid, empirical evidence is needed. The evidence comes from different sources… Types of validity • Construct validity: – the extent to which a test measures the underlying psychological construct (“ability, capacity”) – the extent to which a test reflects the essential aspects of the theory on which that test is based – an overarching notion of validity reflected in many subordinate forms of validity In a more complicated way… • If a test does not have construct validity, test scores will show CONSTRUCT IRRELEVANT VARIANCE. – E. g. in an advanced speaking test candidates may be asked to speak on an abstract topic. Personal engagement in the topic, however, may weaken or improve the performance. BUT: having previous knowledge about the abstract topic should not be assessed. Types of validity • Content validity: – the extent to which a test adequately and sufficiently measures the particular skills it sets out to measure (cf. test specifications) • Response validity: – … test takers respond in the way expected by the test developers • Predictive validity: – … a test accurately predicts future performance • Concurrent validity: – … one test relate to scores on another external measure • Face validity: – … test appears to measure whatever it claims to measure (Hughes 2003: 26-35) Types of validity • Nearly 40 different types have been collected on a language testers’ forum… • The more different types of validity are established in a test, the more valid that test is considered to be. Test reliability • Quality of test scores resulting from test administration: – accuracy of marking and fairness of scores – consistency of marking: • similar scores on different days • similar scores from different markers – inter-rater reliability – intra-rater reliability Factors influencing reliability 1. The performance of test takers 1. 2. 3. 4. 5. a sufficient number of items restricted freedom of test behaviour unambiguous items, clear instructions and rubrics layout, good copies, familiar format proper administration 2. The reliability of scorers 1. objective scoring vs. subjective scoring 2. restricting freedom of response 3. a detailed scoring/marking key Test feasibility/practicality • It is the ease with which the items/tasks can be replicated in terms of resources needed, e. g. time, materials, people Washback effect (sometimes ‘backwash’) • It is a type of impact of examinations/tests on the classroom situation. • Washback may be positive or negative. How to achieve positive washback? 1. Test the abilities/skills whose development you want to encourage. 2. Sample widely and unpredictably. 3. Use direct testing. 4. Make testing criterion-referenced. 5. Base achievement tests on objectives. 6. Make sure that the test is known and understood by students and other teachers. References and additional reading 1. Alderson, Ch., D. Clapham and D. Wall. 1995. Language Test Construction and Evaluation. Cambridge: CUP 2. Hughes, A. 2003. Testing for Language Teachers. 2nd ed. Cambridge: CUP. 3. Council of Europe. 1991. Common European Framework of Reference for Languages. Cambridge: CUP.