CHARACTERISING YOUR LANGUAGE TEST
LANGUAGE TESTING - UNIT 1
Sue Wharton

 

Introduction: How to use this module
Sometimes people use the term "language test" to refer just to the question paper. But of course a test is much more than this. In this module we will think of a test as a testing cycle, from specifying the nature and/or the components of the test through preparing a question paper, candidates sitting the paper, marking, reporting results, analysing results, and feeding this analysis back into the test description.
As far as possible the units of the module follow the sequence of a testing cycle. In the first unit you will be asked to choose a test to study; this should be a test in which you are personally involved, as one of the writing/ marking team, for which you prepare students and for which you have access to data. This could be an institutional test, or a "mock" test for a public examination. In subsequent units stages of the testing cycle will be discussed, and you will be asked to carry out tasks using data from your own test. So, you may find yourself working through this module intermittently, for whatever the length of your testing cycle is.

How to use the module
Each of the units contains tasks in three categories: Reflection tasks, Analysis tasks, and Research tasks. Reflection tasks are to be completed before you read each unit. The analysis and research tasks are for when you have read both the unit and such extra reading as you feel you need.
For units 1 and 6, you should complete all the tasks. For units 2-5 inclusive, you should begin each unit with the reflection tasks but you are not expected to complete all the analysis and research tasks. Instead, you will need to plan a personal itinerary through the module, according to the way your chosen test works and the aspects of it that you want to concentrate on.

How to plan your itinerary
Please start by reading through all the units from beginning to end. Complete the reflection tasks but don't worry at this stage about analysis and research tasks or extra reading.
Then go through each unit again, and decide which aspects of the area covered you are going to concentrate on, and which analysis and research tasks you are going to do in units 2-5. Think about when and how you can obtain the necessary data for these tasks. Decide what reading you want to do. You should also start to form a plan for your assignment.
On the basis of the reading and thinking referred to above, write a plan of work - including, if you can, a draft proposal for an assignment. This is your itinerary through the module. If you would like to send your plan of work to me for feedback, you are welcome to do so.

A note on readings
In addition to the list of references for each unit, most units contain an annotated list of additional readings. You will find these especially useful when you have chosen a theme to focus on in your assignment.
A small set of books is referred to several times throughout the module. The books are:

Alderson JC Clapham C & Wall D 1995: Language test construction & evaluation Cambridge CUP
Alderson JC & North B (eds) 1991: Language Testing in the 1990s Modern English Publications in association with the British Council
Bachman L 1990: Fundamental considerations in language testing Oxford OUP
Brown JD 1996: Testing in Language Programs. New Jersey Prentice Hall.
Hughes A 1989: Testing for language teachers Cambridge CUP

Access to any of these will be particularly useful. If you were thinking of buying just one, I would recommend either Alderson Clapham & Wall or Brown.
I hope you enjoy this module. For me, it represents a new approach to teaching language testing - and so I would particularly like your feedback on it. Please do send me your comments.

Goals
In this unit we will look at some of the dimensions on which language tests may be characterised and at the nature and purpose of test specifications.
By the end of this unit you should be able to:
o Understand the relevance of various descriptive dimensions to your own testing context;
o Characterise your chosen test in detail;
o Assess the relationship between test specifications and test practice in your context
o Draft a plan of work for the rest of the module

Introduction
Before we can analyse a test in detail, and attempt to improve it, we need to build up a picture of its context, its intentions, and the mechanisms through which it attempts to realise those intentions. First we will discuss some of the dimensions on which language tests are usually characterised. This is a first step towards the characterisation of your own test. If you gain a deeper understanding of how your test is currently working, you will be better able to see, as you go through this module, where change might be beneficial.

Reflection task 1
Think about tests currently in use in your institution.
For each test ask yourself:
- what is its alleged function?
- how well is it currently performing that function?
List aspects of each test that are satisfactory and less satisfactory.

Reflection task 2
Which test do you think you would like to work on during this module?
What do you hope to achieve regarding it?

Descriptive dimensions
Test purpose

Perhaps the first thing to consider about your test is its purpose. Most books on Language Testing distinguish four main categories of test: Placement, Diagnostic, Achievement and Proficiency. (See for example Alderson Clapham & Wall 1995 ch1: Brown 1996 ch1: Hughes 1989 ch3).
Placement tests, as their name suggests, are designed to place students in an appropriate course or class for their language level. They are frequently used when students enter a given institution from outside: the students may have previous qualifications in English, but the institution wishes to make its own assessment of their ability, vis a vis the levels/classes offered in the institution. Placement tests may be used in settings such as universities to determine whether students may proceed straight to their content area classes, or will be required to take an EAP course first.
Diagnostic tests are usually used within a class to identify areas where a student is having particular difficulty. These tests are most useful at a general level, to provide information about eg a student's relative ability in the four macro-skills, or about their ability to produce language in different registers. More detailed diagnostic tests: eg to test mastery of a particular grammar point: tend only to be constructed if difficulty with that area is already anticipated. A detailed diagnostic test covering a wide range of elements of the language system would of course be far too long!
Achievement tests are usually related to a syllabus or set of objectives. Their purpose is to establish the level of student attainment vis a vis that syllabus or set of objectives. Achievement tests may be administered at the end of a course of study, or part-way through: in the latter case they of course have a formative function as well as an assessment function.
Proficiency tests on the other hand are not related to a particular syllabus. They are intended to assess language ability independently of the courses of study that individual students may have followed. Proficiency tests may be general, or they may have a "specific purposes" orientation. Their content is likely to be based on a model of the nature of language ability, or possibly on a "needs analysis" of the language required in a particular setting.
These four categories can undoubtedly be useful to characterise a given test, as long as we remember that the categories are not as distinct in the world as they may appear to be on paper. (For example, Cambridge FCE and Profiency are both world-wide exams intended to test general English ability - in this sense they are proficiency tests. But they have spawned a large industry of textbooks and preparation courses: for students who have followed such a course, they are also tests of achievement.) So it would certainly be a mistake to assume that different purposes lead to radically different types of question papers, or that it is possible to assess the purpose of a test simply by looking at an example of a question paper. In many educational and social contexts a given test may be fulfilling a number of purposes.

The candidates
A test is not just a question paper or a set of specifications: far more importantly, it is what happens when real candidates interact with these. So an important aspect of the description of a test is the description of its intended candidature. All and any details are relevant: age, gender, linguistic/ cultural/ educational background, interests, previous experience of language tests, what they want to do with the language in future. For tests within an institution, we may believe that all of these variables are known. And yet there is still an important case, in my opinion, for making them explicit in the description of a test and in any specifications: if they are written down, they can become an anchor for the test development process.

Reference
A third important dimension by which tests can be described is the distinction between norm-referenced (NR) and criterion-referenced (CR) tests. In a norm-referenced test, the score of an individual is reported and should be interpreted with reference to the scores of other test-takers. They provide information about the relative ability of candidates, and tests which have this purpose are often constructed so as to produce as wide a spread of marks as possible.
Scores on criterion referenced tests should be interpreted with reference to a specific content domain or set of objectives. They provide information about the ability of each candidate to meet pre-specified conditions. Such tests are not particularly expected to produce a wide spread of marks.
The decision about which approach to score referencing is most appropriate will depend, of course, on the purpose of the test. Common sense would indicate that in most cases, a criterion-referenced approach would be most suitable for achievement and diagnostic tests, and a norm-referenced approach to placement tests. Proficiency tests also are often norm-referenced, and this relates to a historical assumption: that language proficiency is a psychological trait which, like other psychological traits, is normally distributed in the population. (For an introductory discussion of the concept of the "normal distribution", see Butler 1985 ch4).
The above assumption about traits has been significant in educational measurement and it means that the norm-referenced approach to testing is the longer established of the two. Many statistical techniques associated with the analysis of test results were developed for norm-referenced tests, and are not always suitable for criterion-referenced tests. We will discuss this issue further later on in the module.
For an interesting commentary on the importance of the norm referencing/ criterion referencing distinction, see Brown 1996 ch1 (provided with this unit). Brown uses the distinction as one of the organising principles of his book.

Backwash
Backwash is usually described as the effect that testing practices have on teaching practices. It is of course impossible to quantify or directly describe such an effect (for an interesting discussion see Alderson & Wall 1993) but most people involved in testing and teaching would agree that interrelationships do exist. An important part of the description of a test, then, is the description of any backwash effect that it hopes to have or is believed to have.
That is not to say that backwash is entirely determined by test writers: it also depends on the ways in which teachers and learners interpret test requirements, and work with them in the classroom (Prodromou 1995, Wharton 1996). In an institutional setting, where the teachers and the test writers are one and the same, there is an excellent opportunity to work for congruence between the aims and the methods of teaching and testing. Hughes (1989) has a chapter (ch 6) particularly entitled "Achieving beneficial backwash".

The language of the test
This dimension includes a description both of the language candidates will encounter on the test, and the language they will be expected to produce. We can ask ourselves what text types are included in the test: what is their source, topics, and supposed addressee? What, in fact, are the criteria on which texts are chosen?
We may also ask ourselves what language skills candidates are expected to display and what elements of the language system are targeted. For a highly structured test, it will be easy to answer these questions. For a more open ended test, the language that candidates are expected to produce will be less circumscribed. An analysis of tasks will give an idea of the sort of language they are likely to produce, but there will certainly be an element of unpredictability.

Question or task types
These may be characterised from a variety of perspectives. They may be more open, or more closed, thus requiring subjective or objective marking. They may aim to test knowledge of a discrete element of the language system (eg a sentence transformation test, designed to test knowledge of a certain structure) or they may draw on a wider base of knowledge and skills (eg a listening comprehension test) . Perhaps they seek to emulate some part of the candidates' presumed target language situation.
Different aspects of language may well be tested by different methods. We should consider whether the method of testing is intended to be independent from the ability tested (as in for example a multiple-choice test of grammar) or whether the method is seen as part of the ability the question seeks to measure (as in for example a letter-writing task).

Criteria for assessment and pass marks
One cannot write about everything at once, and possibly for that reason many texts on language testing - including this module - deal with the design of questions and the assessment of responses in different sections. But of course there is a danger in this, since an essential part of the question design process is consideration of the criteria for assessment of responses. In characterising a test we should look at questions and assessment criteria as a whole.
Related to the assessment criteria is the question of how "pass" marks are determined. Is it necessary for candidates to pass all sections of the test or can high marks on one section compensate for low marks on another? Are pass marks set out in advance, or decided for each administration of the test? In the latter case, how is the decision reached? All of these issues are intimately related to the purpose of the test, to its domain of reference, and to its backwash effect.
In this section, then, we have discussed a number of descriptive dimensions on which language tests may be characterised. These dimensions provide a framework for the analysis of a test and they also provide a basis for test specifications. It is to these that we turn next.

Test specifications
Test specifications treat "the test" as an abstraction above any particular question paper or test administration. They are a detailed statement of policy for this "abstract" test from which any real version of the test should be derived. The descriptive dimensions which we saw above can form a basis for the writing of test specifications, which will have several roles: to guide test construction, to guide decisions about assessment mechanisms, and also to provide information to any "outside user" who wishes to understand what a "pass" on this particular test actually means.
In a large-scale project to develop a new test, it might be considered usual to work out the test specifications first and proceed from there to operational decisions. In small scale situations, test specifications might nowhere be written down; with test writers relying on past question papers, and their shared knowledge of what the test is supposed to be about, to guide them.
It is difficult to draw up test specifications in a vacuum. (Or perhaps I should say it is all too easy, and the difficulty comes when one tries to actually work from such specifications!) And it can be dangerous to work with no specifications at all. In many contexts perhaps, test developers find benefit in "to-ing and fro-ing" from actual test questions/ mark schemes etc, to abstract specifications. By a combination of description and speculation, a policy statement for a test may be achieved.
An example of a short set of specifications for a fictional test is given in Alderson Clapham & Wall 1995 pp14-16, printed here as appendix 1. Other examples are available in the form of specifications for public examinations, obtainable from the relevant examining body.

Conclusion
The aim of this unit is to help you to choose, and then to characterise the test that you are going to work on during this module. I hope that by the time you complete the analysis tasks you will have not only a description of the test, but also a sharpened sense of how well it is working for you. As you move through the current cycle of your test, the units of this module will help you to critically evaluate it in a great deal more detail, and to improve it where appropriate.

Analysis task 1
Using the descriptive dimensions discussed in this unit, characterise your test in as much detail as possible.

Analysis task 2
Is there a written specification for your test?
If yes - compare this document with the results of your analysis above.
How close is the test, currently, to its specifications?
If there is a mismatch, what do you think needs to change?
If no - try to draft some specifications. There is a model for test specifcations in Alderson Clapham & Wall 1995 pp 14-16 which could help you.
Then show your draft specifications to a colleague also involved with this test, and get their opinion on the adequacy of your description.

Research task
Plan your work for the rest of the module. For details on how to do this, please see the module introduction at the top of this page.

References
Alderson JC & Wall D 1993: "Does washback exist?" Applied Linguistics 14/2: 115-129
Alderson JC Clapham C & Wall D 1995: Language test construction & evaluation Cambridge CUP
Brown JD 1996: Testing in Language Programs. New Jersey Prentice Hall.
Butler C 1985: Statistics in Linguistics Oxford Basil Blackwell
Hughes A 1989: Testing for language teachers Cambridge CUP
Prodromou L 1995: "The backwash effect: from testing to teaching" ELTJ 49/1 13-25
Wharton S 1996: "Testing Innovations" in Willis J & D (eds) Challenge and Change in Language Teaching London Heinemann

Supplied reading
Brown JD 1996: Testing in Language Programs. New Jersey Prentice Hall chapter 1

Additional reading
Skehan P 1988: "State of the art article: Language testing (part 1) Language Teaching 21/4: 211-221
Skehan P 1989: "State of the art article: Language testing (part 2) Language Teaching 22/1: 1-13

Although now ten years old these articles give a succinct and yet probing overview of issues in language testing. They include plenty of references to actual tests, which illustrate the concepts discussed well. You should note, though, that many of the public tests referred to have now changed.

Wood R 1991: Assessment and testing: a survey of research Cambridge CUP chapter 19: Language Testing

Wood's book is an overview of research into assessment and testing in all subjects, from the perspectives most of interest to a UK exam board ie concentrating mainly on large scale assessment of those in public education.
The chapter on language testing assesses some of the main trends in the field from the perspective of their practical usefulness.

Coleman H 1991: "The testing of appropriate behaviour in an academic context" in Adams, P, Heaton B & Howarth P 1991 Socio-Cultural Issues in English for Academic Purposes Modern English Publications

This paper looks at the needs-analysis approach to test specification. Focusing on the 1987 IELTS revision project, it looks at the complexities of specifying target situation behaviour and using such an analysis as the basis for test questions.