BASR: Discourse Journal

Teaching and Learning > DISCOURSE

Double Marking versus Monitoring Examinations

Author: Roger White

Journal Title: PRS-LTSN Journal

ISSN:

ISSN-L:

Volume: 1

Number: 1

Start page: 52

End page: 60

Return to vol. 1 no. 1 index page

In some subjects in which either, as in law, there is an identifiable body of facts which the students need to master to obtain a degree, or, as in mathematics, it will be relatively beyond dispute whether a student has or has not succeeded in proving what it is required to be proved, it ought to be relatively straightforward to devise an objective and fair method of assessment appropriate for deciding that a student has attained the relevant standard to obtain a degree of any given class. However, in other subjects, of which philosophy is perhaps the most obvious example, what constitutes excellence in the subject is far more a matter of judgment—and even controversy. Hence, there is a great need to scrutinise the method whereby we seek to ensure objectivity in examinations.

Traditionally the preferred method of assessment was ‘double marking’—a method where every script was marked by two internal examiners who then meet to discuss the marks they have independently arrived at, to arrive at an agreed mark which is then submitted to the scrutiny of an external examiner. Latterly, a different method of assessment has grown up—‘monitoring’—in which there is a first examiner who submits a set of marks to a second examiner, who samples sufficient number of the first examiner’s marks on scripts to form a judgment of how far the two examiners agree. The monitor does not attempt to agree marks with the first examiner but writes a brief report on the first examiner’s marking. This report is then discussed, and then, if necessary, the first examiner’s marks are systematically adjusted. For example, if the monitor forms the opinion that the examiner has been too harsh, and succeeds in persuading the examiner that this is so, then the mark of every student may be raised somewhat. The scripts are then submitted together with the monitor’s report to the scrutiny of the external examiner.

There is no doubt that the system of monitoring has grown up in large part under the pressure of the increased workload created by such factors as worsening staff/student ratios, and the increased number of heads under which students are assessed under modularisation. Since it has frequently been adopted thus for reasons of expediency, there is a widespread feeling that this system is inferior to double marking and has only been adopted out of necessity.

However, since I believe that the widespread opinion that double marking is the superior way of examining is an irrational prejudice, simply based on the vague idea that two heads must be better than one, and that in fact in most respects monitoring is a method of examining which, properly done, is more likely to yield an objectively just result, it is worthwhile spelling out reasons why this is so. I actually advocated that we switch away from double marking long before the pressure of work led our own department to do so. This was as a result of studies of what actually occurred when people do in fact double mark that were conducted a long time ago by my colleague Timothy Potts and myself. Although these studies were conducted a long time ago, what we discovered then is still relevant to the current situation.

I initially became worried about the objectivity and rationality of our examination procedure shortly after I came to Leeds, as a result of a few cases where what had happened seemed difficult to reconcile with the idea that justice was done to groups of students taking those particular courses. (I will not identify the examiners involved: all are now retired.) At that time, double marking was of course a sacred cow, and the department was small enough to cope easily with the workload involved. (The externals also read every script: something which is now completely impractical: but that was the only feature of the system that could protect the examination from becoming a farce in the case that I shall mention. That safeguard has now long vanished.) The case that was most worrying, because it represents in an extreme form a situation that, even if only for minority of batches of scripts, does recur with sufficient frequency to be a problem for a system of examination. Here the two examiners had produced marks that bore no discernible relation to one another at all: one examiner would give a 1st to a script that the other examiner saw as low 2/2 (or even in one case a 3rd), and vice versa. As a result I did an informal study of the examination for all the courses for that year. The results were sufficiently disturbing for me to raise the issue of the objectivity of our examination procedure. This was followed up by Timothy Potts who, following my lead, did a complete statistical breakdown of the marks assigned in examinations for the previous three years, comparing such things as the arithmetical mean mark, standard deviations and rank orderings produced by each pair of examiners for each course. What follows are some of the conclusions I arrived at as a result of the studies we had undertaken between us. (I am of course relying here on memory from a long time ago—there may be the odd mistake in what I say, but I am confident that for the most part my memory is accurate.)

There was far more variation in the distributions of marks produced by different examiners and on different scripts than I would have anticipated. Some examiners produced marks that fell in normal distribution curves, but there were also a large number of ‘tri-modal’ distributions—every script is seen as good, bad or indifferent—and quite a few bimodal distributions. The average marks assigned by the severest examiner and by the most generous examiner were almost a class apart. And, finally, some examiners had their marks widely spread out with a large standard deviation, while others had their marks bunched up—in some cases with the highest and lowest mark both within the same class. These differences already put great strain on the idea of the two different examiners of a particular script meeting to discuss and arrive at an agreed mark: if, as we shall see, the results of such discussions is usually to average the two original marks, it is hard to interpret the significance of averaging the marks of a harsh, bipolar, examiner with a large standard deviation with those of a generous marker whose marks are distributed in a narrow normal distribution.
The sets of marks with very few exceptions fell into one of three types. There were those cases where the two examiners were producing virtually identical marks throughout a batch of scripts, only very occasionally disagreeing significantly. There were those cases where the two examiners were producing different marks, but in a systematic way—the most common case being where one examiner was simply more generous than the other. The oddest case here was one where the two examiners had virtually identical rank orderings, but one had assigned marks in a normal distribution curve, whereas the other had produced a bimodal distribution. In all these cases, it is reasonable to suppose that the two examiners’ opinions of the scripts were on the whole the same, but differed in the way that they thought that opinion should best be translated into a mark. The third type was one where it was impossible to make much sense of the comparison between the examiners’ marks, leading to the conclusion that the two examiners were not seeing eye to eye at all—were, e.g., looking for quite different qualities in a good script. (In the case that originally prompted my attention, Timothy Potts discovered a statistically significant negative correlation between the marks awarded by the two examiners.) In a way it seems obvious that these should be the three types; what is not so obvious is the extent to which every set of marks fell recognisably into one of these three types, with few intermediate cases, and also that in each case the pattern (or non-pattern) would almost invariably be preserved throughout a batch of scripts. Of these three types, the second, where examiners diverged, but diverged systematically, was the most common, followed by the first. The third was less frequent, but there were sufficient cases to indicate that there was a problem worth thinking about.
Double marking consists in two examiners both marking a set of scripts, and then meeting so as to arrive at an agreed mark where there is a disagreement. In theory this mark is not arrived at by simple averaging but by a discussion that finally resolves the disagreement. In practice, looking at the marks awarded by individual examiners suggests that something very different occurs (even if the examiners think this is what they are doing). What we find in the great majority of cases is that the examiners have not simply averaged, but they have had a discussion and then they have simply averaged. That is, in far and away the majority of cases the agreed mark is the average of the two originals. There is another pattern which sometimes occurs—it is impossible to tell what lies behind this in each case: one examiner will systematically defer to the other so that the agreed marks are virtually identical to one of the two examiners’ original marks. The cases where marks are awarded for particular scripts that diverge from one of these two patterns are a small handful.

Against this background, the questions arise, ‘How well does double marking do as a method for arriving at a just mark on scripts?’ and ‘Is there reason to suppose that monitoring fares better?’ I take monitoring to be the practice we have adopted at Leeds where one examiner marks an entire batch of scripts, and then a second marker marks a significant sample, large enough to judge how well the first examiner has done their job—departmental policy says that 10% of scripts plus 1sts and fails should be looked at: I have always interpreted this as minimum, and where one is monitoring a small batch of scripts (e.g. a module with 20 or fewer scripts), it would be clearly inadequate just to look at two—what is required is to look at enough scripts to get a proper picture of what the first examiner has done. The two examiners then meet to discuss how, if at all, it would be appropriate to modify the first examiners’ marks. Departmental policy is that monitoring is a monitoring of the whole examination, and not the provision of second marks for individual scripts. That is, the result of monitoring should not be the adjustment of individual marks, but to suggest a systematic modification of all the first examiners’ marks. The only individual marks that are adjusted are perhaps those at the very top or the very bottom, where it is a question of how a very good or very bad script is to be marked. Otherwise, adjusting individual marks is unfair either on those students whose scripts happen to have been selected for monitoring, or on those who have not. (The only exception I would, perhaps somewhat inconsistently, make to this rule, is where the divergence between the examiner and the monitor is explained not by a difference in judgment between the two, but by a definite indisputable oversight on the part of the examiner: for instance where the examiner overlooks a gross error of fact on the part of the candidate.) So, how do monitoring and double marking fare for each of the three types of sets of mark I identified in 2. above?

The first type of batches of scripts—where the two examiners turn out to be in substantial agreement throughout—is the most straightforward, and the one where double marking and monitoring both work equally well. The result of the process is simply that the second examiner/monitor endorses what the first examiner has done¹. The only difference between the two is that monitoring arrives at this result more quickly.
It is with the second type of sets of marks that the advantages of monitoring begin to emerge. The point is that double marking is ill equipped to detect systematic differences in marking practice. The two examiners concentrate on the scripts one by one and so systematic differences running through the whole batch will not be readily apparent—this is particularly true with the first few scripts that they discuss: it will be only later on in their discussions that patterns in their disagreement will become apparent, if at all. This will have the effect that the two examiners will only too frequently fail to appreciate where the real differences in judgment between them occur. Disagreements can even be masked completely: for instance if examiner A is a generous examiner, but examiner B somewhat mean, A and B might both award a mark of 62, but for A this signifies the opinion that this is a somewhat average script, while for B this signifies that it is one of the best two or three of this particular batch. Here the examiners are almost bound to look at the fact that they have both given 62 and think that they can pass on without further discussion, even if this script represents their biggest single disagreement. On the other hand, they may well spend a long time discussing a script which A has marked as 50, but B as 40, asking which mark is appropriate for this particular script, whereas what is at stake is not a disagreement as to the quality of this particular script, but a general difference of opinion as to how to mark a weak script. By contrast, it is the primary task of a monitor not to mull over particular disagreements, but to look for a pattern in those disagreements that occur so as to locate the pattern of disagreement, and then discuss with the first examiner whether it is appropriate to adjust the whole batch of marks originally given. That is to say, the discussions between examiner and monitor are focussed precisely where they should be.
It is the third type of set of marks that creates the greatest difficulty for any system of marking. Where there is no meeting of minds between two examiners it is frequently difficult to know how to proceed. Double marking provides no clear-cut, rational decision procedure for such a case, and, in practice, looking at what was actually done suggests that examiners simply ‘split the difference’. But here the significance of a mark that is the average of two marks arrived at in very different ways is difficult to understand. The effect of such averaging for this type is a massive regression to the mean—double marking will always produce some regression to the mean, but here it becomes completely pernicious: in the case I originally looked at virtually all the scripts ended up bunched up around the 2/1 2/2 borderline. If we assume that one of the two examiners was actually thinking along the right lines, this inevitably means that the students he had rightly seen as 1st class were deprived of their 1st on that script, and the ones he had rightly seen as weak were allowed to get away with murder. Unless one can give a good reason to suppose that when two examiners diverge wildly and unpredictably the average of the two marks they award is likely to be the right one, it seems that double marking copes with this case very badly. In fact double marking gives no rationally defensible decision procedure for this case: if we assume both the examiners have arrived at their original mark conscientiously, then they are marking in very different ways, or looking for very different qualities in a script. A brief discussion will at most reveal that fact, but not indicate a way to resolve the dispute, leaving little alternative but to average. At first sight it looks as though monitoring is in the same awkward position. The major advantage is, however, that the two examiners are not required to agree marks on individual scripts, and so not compelled artificially to concoct an ‘agreed’ mark where there has been no real meeting of minds. The function of the monitor is simply to produce a report on the first examiner’s work: and in this case the report could even take the form ‘I could not make head or tail of the marks examiner A was giving’. This at least flushes the situation out into the open. It does not remedy the situation, but at least alerts everyone to the need for a remedy. This will usually take the form of an appeal to a third party: at its simplest, a request to the external examiner to pay particular attention to this particular batch of scripts. In two cases a few years ago, where there was gross disagreement between the two examiners, a third examiner was in effect appointed: in one case, the external marked every script and his mark was taken as a final adjudication, in the other I was asked to come in and my marks were the ones sent to the external as the internal examiners’ marks. Even if we only resort to such measures occasionally, they demonstrate the kind of remedies available under the monitoring system of examination.

Some Conclusions

The main conclusion I draw from the preceding is that the system of double marking, despite its reputation, is a deeply flawed system. The idea that it is the best system of examination is a myth, which is only sustained because it is not subjected to scrutiny—including the kind of empirical scrutiny which Timothy Potts and I subjected it to. The following defects emerge from the earlier discussion:

Surveying the ‘agreed’ marks actually given by two examiners suggests that whatever we think that we are doing, most of the time the upshot of the discussions between the two examiners is to produce a mark which is the average of their two original marks. If the examiners are in fact disagreed, either in the qualities that they are looking for in a good script or in the way that they translate their opinion of scripts into numbers, it is hard to believe that such average marks have much real meaning. (The most that can be said is, that if either of the two original marks was right, the average ‘won’t be too far out’—I suspect it is that thought which makes averaging attractive. However, that thought may well be depriving a student of a 1st class mark, if one of the two examiners has seriously underestimated the script.)

The effect of such averaging is a large-scale regression to the mean. This is perhaps both the most obvious defect, and the most vicious aspect of the system of double marking. When, as now, we are assessing students under a large number of heads, and then arriving at a class by averaging, the threat of regression to the mean is already real enough—even now we have a system where it is remarkably easy to get a low 2/1, but difficult to get a 1st or a 3rd. If we were to engage in double marking with our present numbers of students and under a modular system, we would have a system of examining which would make it impossible to differentiate students, apart from the very few that swam against the stream by being exceptionally good or bad in everyone’s opinion.

The system of double marking is not designed readily to detect when differences between the marks awarded by two examiners for a particular script were the effect of systematic differences of marking practice between the two examiners rather than disagreements about this particular script. Such systematic differences should be dealt with systematically and not somewhat erratically on a script-by-script basis. Systematic differences between the marking practices of two examiners, which will affect a whole batch of scripts, and can have large effects on individual marks are probably far more significant than particular disagreements in judgment, and yet are completely neglected by double marking.

The system of double marking does not have built into it a rational decision procedure for what should happen when there is no real meeting of minds between the two examiners. Looking at the results produced by Timothy’s studies suggested that examiners were typically prone in such cases to produce an average mark as the agreed mark, even though in these cases such average marks are almost completely meaningless.

The defects noted above would to some extent be compensated for (at the time that Timothy Potts and I made our studies) by the role of the external examiner. At that time we were a much smaller department, marking a much smaller number of courses and the external examiners did read and mark every script, so that the vagaries of the internal examiners could be and frequently were overridden. However, the time when that was possible are long past, and also the pressure of exam load has increased in ways that would exacerbate the problems we detected. (There is now, for example., much less time for a full discussion between examiners, increasing the temptation simply to average marks.)

The system of monitoring is designed in such a way that it avoids all of the defects that I have specified: examiners do not agree marks on each individual script and hence do not average marks; as a consequence the system has absolutely no tendency to produce a regression to the mean; the task of the monitor is precisely to detect systematic differences of opinion which can then allow one to adjust a whole set of marks systematically; and finally the fact that a monitor’s primary task is simply to make a report on the first examiner’s work means that the situation of a radical difference between the two can be brought in the open to be then dealt with.

The only indisputable advantage of double marking is that there can occur cases where the first examiner makes an error of judgment on a particular script which is then picked up by the second, and the first examiner is persuaded of the error. However, looking at the extent to which practice is dominated by simply averaging marks suggests that this situation may occur less frequently than we think, and given that no examination system is ever going to be perfect, I believe there is an overwhelming case for saying that monitoring is on balance the vastly superior system, quite disregarding questions of the workload imposed on examiners by the two systems.

Endnotes

In Leeds, we have also adopted the practice that the first examiner should supply the monitor with a statement of the criteria they have employed in marking, which facilitates the interpretation of the set of marks. If, as is usually the case, the first examiner is responsible for teaching the course, these criteria will also be known in advance by the students, giving them due warning of what is expected of them.

Return to vol. 1 no. 1 index page

This page was originally on the website of The Subject Centre for Philosophical and Religious Studies. It was transfered here following the closure of the Subject Centre at the end of 2011.