On the distribution of midterm marks

I prepare a midterm, the students write it, and we mark it. The midterm average is 75%, with a standard deviation of 15. A histogram shows a nice bell curve. So can we declare victory and go home now? Everyone's happy, right?

Not so fast.

When we make up the marking scheme, there's no mention of a target distribution, no examination of the effect of a possible marking scheme on that distribution. We don't predict what an average student might get on each question. We say nothing about what we considered an acceptable or a good level of performance. There is typically no attempt to tie performance on the exam to the goals of the course.

I'm not sure we could do any of that if we'd tried. It's true that circumstances often make it less necessary to do any of it. In CS 251, the tenor of the current version of the course was set with our midterm in the maiden offering, in fall 2000, and it's clear that other instructors followed our model. So historical precedent happens to coincide with the way we are doing things in that course. But that avoids the question of whether we got it right in the first place, and how we could tell. It also avoids the question of whether, if we had set the bar higher in that first offering, we would have been justified in doing so, or what we should have done if other 251 instructors had given considerably easier midterms -- that is, the extent to which we are constrained by precedent.

We have tigher curricular control at UW than many institutions do, but it is still not very tight. The main description of most courses is the single-page handbook description, now available on the School of Computer Science Web page. These descriptions are written solely in terms of topics. They do not discuss skills, or depth of coverage apart from a listing of the hours devoted to each set of topics. Barring a completely developed set of lectures, notes, and assignments, as those responsible for CS 133 maintain, this is less a formal definition of curriculum than a vague guideline. And the extent to which it is taken seriously depends on the instructor. They may well teach based on their own experience as an undergraduate, or on a set of lecture notes obtained from a colleague (which may be flawed), or on their own interpretation of the skeletal handbook description.

But there is this assumption that the average of midterms, finals, and final grades be in the range 65-75, with not too many people failing. How does this hold in the face of these variations? If the numbers fall above this range, students are not likely to complain, though too high a midterm mark makes them fear that the final will be a "killer" to bring the marks down to the acceptable range. If the numbers fall below this range, students will claim the midterm was too hard, and the marks should be adjusted. On what basis?

It wouldn't have taken a large change to lower the marks from the hypothetical distribution I mentioned in the first paragraph. We typically make up more questions than we can use, and select a final subset. Many of these questions describe a situation, and we can do so in more or less detail to leave students with less or more to do (respectively), to adjust their difficulty. A few words here, a few words there, just one more small question added at the end, that part d) is so clever we just have to ask it -- and suddenly the average is fifteen marks lower.

And then what?

When I taught 341 (Algorithms) in fall 2002, I gave a midterm without many "turn-the-crank" questions on it. They weren't particularly creative questions; many of them were fairly standard, and could be turned up in Web searches of midterms in similar courses at other institutions. Because I hadn't set any "run this algorithm on this example" questions, which are used to pull averages up, I had expected a low average, something in the low '60's. This, I hoped, would be a spur to working harder, and we would announce that a good final performance would erase a poor midterm performance. This is more or less the way it had worked the previous time I had taught the course; no one failed that offering.

But that fall, the average was 48%, and at least a third of the class made each of two particularly egregious errors: one was of the form (2^a)(2^b) = 2^(ab) [instead of 2^(a+b)], and one was of the form "Given any two vertices v and w in a tree, one is an ancestor of the other" [think of two siblings]. I didn't think third-year CS students, given the math courses they had to take and pass, should make mistakes like that. The usual practice with that low an average is to add fifteen percent to everyone's grade, or some other crude form of adjustment. I later spoke to the person who taught this set of students CS 240 (one of our more popular instructors) and the same thing happened to him; he added the fifteen percent.

But I didn't. (Rather, we didn't; my co-instructor agreed.) We felt that putting the marks up, in the face of such basic errors, would be like scolding a child for not finishing their dinner, but then giving them dessert anyway. That's no way to get them to finish their dinner in future. Why would they bother working towards the final exam if they knew we were just going to bump them up in the event they made similar elementary mistakes? So I told them that we would adjust the marks only after the final exam.

Perhaps it was this sort of comparison that led an angry student, on the course evaluations, to complain that I treated them like children. (That was one of the milder charges in a seriously vicious set of comments. If you want to know what people are capable of under the cloak of anonymity, come to my office and look at them. I've always considered my course evaluations to be public property and open to all.) Twenty-odd people took late withdrawal, and those who stayed then started arguing that the final average should be even higher, because obviously all the low-end students had left. (Even if that were true, no one ever says that the first-year regular math sections should be adjusted downward because of the absence of those taking advanced sections.)

In the end, I did adjust the marks after the final exam, allowing a good exam performance to override the midterm -- and the exam marks were within the usual parameters. Mind you, it was too late for the evaluations.

So maybe that midterm exam was too hard. What's my point? This is my point. What if the average on that exam had been 73? Would it still have been too hard? What if the average on a typical 251 exam comes out to 48? Would it have been too hard? I mean with the same exams. Why should the marks be different if the exams are similar? Maybe some sets of students are weaker than others. Maybe high-school labour relations were bad in their graduating year, maybe they had particularly lousy or particularly good instructors in first year. Classes are not a statistically pure sampling of the population; different programs yield different schedules, and some programs attract or admit better students.

If we require that mark distributions have a certain nice form, then the only way to assure that, given these possible variations, is to require all marks to be adjusted to that form. But then we cannot guarantee that the marks mean anything about what the student has learned or can do. They only specify a position within a class, and when you average them among different classes with different populations, the result loses meaning. But if marks are tied to some absolute standard of knowledge and performance, then we'd have to ensure consistency in both instruction and test-setting, and no one is prepared to go to the lengths that the creators of standardized tests do to try to ensure that (especially given their dubious record at it).

I remember seeing a film called Lola, by the German director Rainer Werner Fassbinder, with a friend. In it, a zealous, idealistic civic reformer takes on a city establishment run by a mob. At the end, the reformer has sold out, marrying the mistress of the mob leader, who is still in control (and she is still his mistress). I asked my friend what he thought of the movie. "I wanted a happy ending," he complained. "Everyone in the film was happy at the end," I pointed out. "But *I'm* not happy!" he insisted.

I am in my friend's shoes now. If the average is high or medium, everyone will be happy. If the average is low, I will announce an adjustment, because it is clear that I can present no argument against this that will be accepted, and once I make the adjustment, everyone will be happy.

But I won't be happy. --PR

(Adapted from a blog posting made October 26, 2003.)