Valid and fair assessments

Valid and fair assessments

Large-scale standardised tests are supposed to be permeating our education system at a fast pace. Punjab, through Punjab Examination Commission (PEC), has been administering examinations to all children attending public schools at the end of 5th and 8th class. Sindh has done the same through outsourcing the development and administration of examinations at that same grade levels to a private firm.

In addition to these large-scale tests, the provinces have also been experimenting with small-scale sample-based educational assessments. Most of these tests are supposed to be standardised. But what is standardisation all about? We must be able to understand their nature, see how they are different, and see why we don’t have them yet. Sometimes, the large-scale tests are erroneously referred to as standardised even when they have not been put through the technical process of standardisation.

In order to extend this discussion, let us first take a look at our current perspectives on testing. Our understanding of examinations and assessments is tainted by our personal encounters with the traditional system of examinations. As a student, the only examinations I, and most other children, cared about were the dreaded, as well as much anticipated, annual examinations. These examinations carried very high stakes for us as they do for our children. Passing them meant promotion to a new grade and, more importantly, a bag full of new books, new stationary, and several other goodies. Failing them meant being left behind in unacceptable ways.

Also, all I cared about was the raw score, read marks, which I obtained on a test. The marks were always out of a total of 100, except for some subjects that had a little less or more weightage in the scheme of studies. Marks were the only thing that distinguished between the good students and the not-so-good students.

Yet, as we know now, raw score has never been a good way of discriminating between students taking different tests at the same level.  For example, consider candidates A and B appear in the secondary examinations organised by two different boards of secondary and intermediate education in Punjab. A gets 80 out of 100 in a certain subject and B 70 out of 100. Can we infer that A was more able than B? Traditionally, this would be our conclusion. But this conclusion is wrong since it does not consider the possible variation in the degree of difficulty of the two papers set by two different paper setters in two different places. Raw score was a crude criterion for comparison. It continues to be so for most students, parents, and teachers.

Our higher education institutions do not offer relevant courses in quantitative methods in assessment. As a result, we are perennially dependent on the unsustainable practice of relying on foreign consultants to help with the demand to develop tests at the primary level.

As a schoolteacher, I also let the expectations about terminal examinations determine the content of what I taught my students. The topics were divided into ‘very important’, ‘important’, and ‘not important’. My teaching was seldom designed to cover the curriculum, but to ensure that my students obtained mastery in all ‘very important’ and ‘important’ topics. My purpose as a teacher was to do everything I could to increase their chances of getting more and more marks in terminal exams. I was not alone in teaching to the test. Nearly every schoolteacher I knew then had similar concerns and objectives. In my early years as an education researcher, I learnt the phrase WYTIWYG (What You Test is What You Get). The nature of testing and their stakes drives the teaching and learning in the classrooms.

The onset of standardisation of tests, or standardised tests, does not change the basic truth about teaching and learning embodied in WYTIWYG. However, it does change the way the tests are constructed, administered, and interpreted. Raw scores loose much of their traditional value. Absolute marks obtained by a student do not solely determine his or her location relative to other students who may be taking a slightly more difficult or slightly easier test at the same level. Different versions of the same test may vary in the degree of difficulty of the items they contain.

Relying on raw scores is extremely unfair, because a student with lower ability taking an easier version of the test may obtain higher scores than a student with higher ability taking a more difficult version of the same test. Scaled scores resolve this problem through test equating, i.e., by adjusting for relative difficulty levels in different editions of the same test.

But how do we ever find out if a particular test question (sometimes also called test items in the technical parlance) is more or less difficult? No, we cannot make that decision through our own subjective judgment. We do not know about the relative difficulty of particular items unless we put them through the empirical test of giving them to a sample of students and then analysing the results. If most students respond correctly to an item it may be concluded that it has lower difficulty level than another item that elicits fewer correct responses.

But evaluating the difficulty levels of the test items is only part of what is involved in constructing good tests. In order to make the examinations and assessments truly fair to the students, and useful for the education system at large, the test development process must ensure development and administration of valid, reliable, and hence comparable tests. This requires assuring quality at each step in the testing cycle, including test development, administration, scoring, analysis of data, development of results, and their dissemination.

Therefore, it is imperative that the testing instruments [at all levels] must undergo technical analyses for robustness, which include establishing their validity and reliability, as well as their ability to discriminate between students at various ability levels. Reporting raw scores, as is the current practice, does not allow fair comparisons between students across test administrations, i.e., across years. Hence scaled scores need to be computed so that scores on different editions of the same test may be compared.

The field of test items development, i.e. psychometrics, has developed and refined several techniques to ensure that tests meet the above-mentioned criteria. The technology required to use the new techniques of test developments is relatively easy to acquire. All it requires is for the responsible agencies, such as the boards of intermediate and secondary education and other examining bodies, to ensure that test competent and qualified people perform development, administration, and analysis. And this is where we come across the real crunch.

We do not have enough people qualified in the technical skills needed for the development of standardised tests. Our higher education institutions do not offer relevant courses in quantitative methods in assessment. As a result, we are perennially dependent on the unsustainable practice of relying on foreign consultants to help with the demand to develop tests at the primary level. As for the assessment at the secondary level, no board in Pakistan, except perhaps one in the private sector, has even expressed ambition to reform its testing practices.

The idea of standardised testing entered Pakistan’s education discourse in the late 1990s with the establishment of National Education Assessment System (NEAS). NEAS and its provincial accomplices did train some individuals in the use of new techniques, especially the use of Item Response Theory (IRT), to implement their assessment programmes. But NEAS was essentially meant to conduct sampled based small-scale and low stakes assessments. Because of the scale of the enterprise, the human resources developed on the job through NEAS and provincial assessments are few and far between.

More importantly, we must look around and recognise that nations, which have made progress in education, could not have made it without well-developed assessment and examination systems. Examinations regulate and control what goes on in the classrooms by way of teaching and learning. Everything else being equal, good examinations can potentially positively impact classroom practices.

But we will not make progress in this regard without meeting, at a minimum, two conditions. First, the boards of intermediate and secondary education must commit to improve their current practices and align them with the best practices made possible due to availability of technology and our advanced knowledge about testing techniques.

Second, our higher education institutions must create degree level courses in psychometrics, staff them with good faculty, and do their job in meeting the demand for human resources in the field of testing.

Students appearing in intermediate, secondary, middle, as well as the primary institutions must get a fair deal. They deserve to be judged and certified fairly. The advantages of a good examination system are not confined to students alone. Teacher educators, policymakers, and education researchers can also learn from it and do their bit to improve it further.

Valid and fair assessments