Eliminate Final Exams

2022-04-17
15 min read

Parrhesia makes the argument that a single final exam should be worth 100% of a student’s grade. Dynomight responds to that argument in an essay entitled “Teaching is a slow process of becoming everything you hate”.

Parrhesia is wrong. Dynomight’s meandering response raises some reasonable concerns, but also whiffs on the original essay’s most problematic arguments. And its apologetic tone makes it sounds like Parrhesia should be right—but then reality intervened.

So here’s a more full-throated rebuttal to the idea that it is accurate, ethical, conducive to learning, and less stressful to have a student’s entire grade based on a single repeatable assessment.

Let’s consider accurate, ethical, and less stressful first. Then we’ll address learning.

It’s questionable to begin with whether a single assessment can ever accurately measure student understanding. Given only a few hours, determining whether someone has developed a comprehensive understanding of weeks of material is difficult to impossible.

To be effective, most summary assessments rely on the element of surprise. Because students aren’t sure precisely what will appear on the exam, they must engage in a broader review of the course material. Unfortunately, when you make an exam repeatable, students will engage in targeted preparation based on their knowledge of the exam contents, further reducing the effectiveness of the exam as a comprehensive assessment.

Perhaps we now expect instructors to generate an unlimited stream of fresh assessments to allow multiple retakes without allowing overfitting. First, this is fairly unrealistic—good exams are hard to write! But now you’re further sacrificing accuracy, since different exams will naturally end up testing different subsets of the material.

Confronted with a single summary assessment, the logical approach is a study strategy colloquially known as cramming.

Now think about how a student prepares for that comprehensive exam. Confronted with a single summary assessment, the logical approach is a study strategy colloquially known as cramming. This comprises trying to load as much information into short-term memory as possible immediately prior to the exam, to maximize performance without doing the harder and more continuous work required to more deeply internalize the content. Because it takes more work to remember things for a long amount of time than a short amount of time, this is the optimal study strategy in this situation. It’s also what students tend to do, as evidenced by centuries of observation.

Making the exam repeatable makes the situation worse. As pointed out above, repeatable exams are less accurate and prone to overfitting. But they also encourage students to engage in what let’s call bar-clearing behavior. Study a bit, then take the exam. Didn’t do well enough? Study a bit more, focus on what you missed last time, and then repeat the exam. Continue until you achieve the desired result. At the end of this process, do we really know how much the student has learned? And, of course, the entire process is even worse if the exam contents don’t change, since iterating allows students to focus their cramming on questions that they know appear on the assessment, or things that they did poorly on last time. From a learning perspective, targeted cramming is even more useless than regular cramming.

There are high-stakes exams that are repeatable: For example, the Bar Exam taken by prospective lawyers. However, these exams impose significant penalties on test takers who must repeat them multiple times that serve to deter bar-clearing strategies—including delays in entering the workforce, reputational damage, and monetary fees. These aren’t really replicable in a university setting.

Single exams are inherently inaccurate and unrepresentative single data points. Making them repeatable makes the situation worse. Cramming for exams is stressful. Incentivizing cramming is unethical, particularly given to the lengths students have been known to go to pack their brains before the test—caffeine at best, less legal drugs at worst. Giving exams that reward cramming both incentivizes an ineffective study strategy, while also rewarding those students who are willing and able to engage in unhealthy behavior—such as marathon all-night pre-exam study sessions.

And all this so that students can temporarily demonstrate knowledge that they will immediately forget.

Oddly enough, Parrhesia’s essay seems to acknowledge this, concluding:

What matters is the knowledge the student has at the end of the course, rather than the knowledge they had at one point but forgot.

But there’s nothing magical about the end of the semester: It’s just another “one point” in time. What actually matters is the knowledge that students retain, not just what they happen to know during one three-hour interval at the end of the course. And a single 100% exam, repeated or not, is very likely to encourage students to behave in ways that allow them to feign knowledge at one point in time—the end of the course—but then immediately forget what they temporarily learned. The original essay rebuts itself:

If you’ve forgotten the content before you have even finished immediately after you finish the course, what good was the course?

(Edits mine.)

If we want students to actually learn the material, we need to support continuous engagement with course content and concepts. That means multiple assessments(1).

I don’t think that educators who give a 100% final exam are intending to encourage cramming or bar-clearing behavior

One way to frame this goal is to ask: How would I want my students to study? I don’t think that educators who give a 100% final exam are intending to encourage cramming or bar-clearing behavior, even if that’s what will inevitably happen. They’d probably say that students should study a bit each day, or at least every few days, over the course of the semester, and regularly check their learning and understanding through self-assessment. Parrhesia’s essay includes nods at this type of continuous learning support. But of course, this isn’t what happens when you give a single 100% assessment. Cramming happens.

At this point many educators will resort to the argument that, well, students just need to learn how to self-regulate! It’s this kind of exasperation that seems to underlie the tone of responses like Dynomight’s. A single 100% final exam would be great. But these lousy students can’t self-regulate, and so we can’t have nice things.

Amusingly, the same educators making this argument are frequently the same ones you find scrambling at the last minute to meet paper and grant deadlines, complete required university trainings, submit promotional materials, prepare for meetings, and so on. Procrastination is our human reality, and pretty much every functional workplace finds structural approaches to help people work steadily, incrementally, and in healthy ways toward long-term goals—weekly check-ins, daily stand-up meetings, milestones and sprints, and all kinds of other workspace-specific variants. No sane organization would give a junior employee a big project and say: “Good luck, see you in four months!” So why do we expect this from students? Effectively supporting student learning requires a lot more structure than what’s provided by a single final exam.

It’s also worth noting that university instructors are doubly unrepresentative here, in that they (1) usually succeeded in college at least in part due to unusual abilities to self-regulate, and (2) are employed in the knowledge occupation with an unusually small amount of workplace structure. There’s a branching point into another essay here, but for now let’s just say that these factors may cause faculty to have unrealistic expectations about the degree to which their students can learn in unstructured environments.

If we want to help students truly learn something, we need to support repeated engagement with the material over a long period of time. In practice, this means rather than a single assessment, effective courses use more than one assessment.

Potentially a lot more. Two assessments—the traditional midterm and final—are better than one, but still not enough to discourage cramming or support more representative sampling of student understanding. What’s better than two assessments? Three. But four is even better, but worse than five. You see the pattern.

Obviously there is a limit here, but it’s probably a large enough number that most courses don’t come close to reaching it. In my CS1 course we give 15 weekly 1-hour quizzes together worth 40% of a student’s grade.

This has enormous benefits. Rather than 3 hours to assess all the course content, we have 15 hours, allowing us to test more concepts and test central ones repeatedly in a form of spaced repetition. Because each assessment is worth a much smaller percentage of a student’s grade, they are less stressful, and because we have more time, each individual assessment can be more relaxed. Rather than earning 100% of their grade in 3 hours—33% per hour!—they earn 40% over 15 hours, or ~3% per hour. A temporary brain freeze or bit of forgetfulness at the wrong moment is much less likely to ruin their entire grade. And students are less likely to walk away from a single bad assessment with the misimpression that they don’t understand anything. To make the sampling of their knowledge more fair, we can and do drop low scores.

A much larger set of assessments also turns the quizzes into something that helps guide each student’s learning process. I tell students—if you do poorly on a quiz, that’s a sign that you need to adjust your study habits and overall approach to the material. Given that material in my course is quite cumulative, we now also offer a catch-up grading policy, allowing students to earn some points back on a previous quiz if they can display that they understand the material on a later assessment.

To me there’s also something unethical about not giving students a chance to learn from their mistakes and improve, which is unlikely to happen when too few assessments are used. And yes, it’s true that you could provide students with the opportunity to turn in ungraded work during the semester to receive feedback, as Parrhesia suggests. But now we’re back to making wildly unrealistic claims about students’ abilities to self-regulate.

One way to think about all this is that frequent low-stakes assessment just generates a lot more information about how well each student understands the course material—more data points, collected at multiple points in time, and giving students the chance to improve their study strategies on their journey to comprehension. Given an abundance of data, it’s a lot easier to design accurate, low-stress, and ethical policies mapping assessment results to a student grade. Fewer and fewer data points means more student stress and less assessment accuracy, and as a result is less ethical.

Compared to a single assessment, multiple assessments have a lot of advantages. This is particularly true once the number of assessments gets fairly large. Frequent small assessments produce a more accurate picture of what a student actually knows, not just what they crammed. And frequent assessments are inherently lower-stakes and less stressful, while also enabling other ethical assessment policies such as dropping low scores.

The specter of more grading hangs uneasily over these conversations about ethical and equitable grading

But multiple assessments also have one big disadvantage—they’re more work to grade. The specter of more grading hangs uneasily over these conversations about ethical and equitable grading. Should we be surprised when faculty who don’t like to grade—read, all faculty—seem quite eager to agree that a single 100% final exam is the most ethical, most accurate, less stressful, and overall best approach to grading? The same approach that also just happens to be the least work for the instructor? Let’s at least be suspicious.

I’ll admit that I am very fortunate here: All of the assessments for my course are autograded. So I don’t want to be overly dismissive of grading burden. Allocating staff time is a zero-sum game. As the number of assessments goes up, the amount of staff time needed to grade also increases, reducing the amount of staff time available for other staff activities including direct student support. That said, I suspect that staff for a course with a single 100% final exam won’t have much direct student support to do during the semester anyway, since most students won’t do much of anything until a few days before the single assessment.

So there is a tradeoff here, particularly with human-graded assessments. But it’s hard to believe that a single 100% exam is ever the right point in this design space. And, given properly designed assessments, grading can provide students with useful and actionable feedback about how to improve their understanding of the material and their study and learning processes. Grading may be unpleasant, but feedback should be valuable.

A subject that probably deserves a lot more discussion is how to create assessments that are both easy to grade but still valuable for students. A great deal of the learning that is supported by frequent small assessments occurs due to student preparation—the act of studying—not due to the assessment itself. I tell my students: If I could figure out a way to get you to study without the quiz, we’d skip the quiz. I just haven’t figure out how to do that yet. Students still do expect the quiz itself to be fair and accurately marked. But there’s probably a sweet spot here—which definitely varies based on the material—allowing students to receive timely feedback and the benefits of regular preparation and engagement, while minimizing the amount of grading and leaving the course staff with as much time for direct student support as possible.

Frequent small assessments are more effective in supporting student learning than one large assessment. However, they also raise a variety of questions related to scheduling, flexibility, pace, makeups, and so on—none of which I want to get into now.

But before concluding, I do want to speak to the title of this essay. First, obviously I don’t mean it literally—any series of exams will result in a “final exam”, meaning the only way to eliminate them entirely would be to have no assessments. That’s probably not wise.

When I say final exam, I’m referring to any high-value summary assessment conducted at the end of the semester. We give a 15th and final quiz, but it’s only worth ~3% of the student’s grade like every other quiz, and so doesn’t qualify as a final exam. The point cutoff is debatable, but I’d usually consider any late-semester assessment worth 20% or more of the student’s grade as qualifying as a final exam.

Combined with frequent small assessments that structure learning during the semester, I’m less concerned with a 20% final exam(2). However, once you begin to increase the percentage of the grade allocated to the final, I become less convinced that there isn’t some better way to reallocate those points to smaller assessments to support student learning and reduce the stress and incentives to cram created by a high-stakes assessment. To paraphrase one of my favorite films, I’d be excited to see you get that high-stakes final exam off of your gradebook. I think it would open up all kinds of interesting possibilities(3).

I’ve heard a few reasonable arguments for a higher-point summary assessment. One is that it can allow students to integrate everything they’ve learned during the semester. This is probably best when the course content is somewhat fragmented. When you teach material that builds on itself, assessments later in the semester will tend to naturally reinforce earlier concepts and skills.

Another argument is that frequent small assessments allow students to check out once they’ve earned a grade that they are happy with, and potentially miss out on important material covered later in the semester. This is a valid concern. But I’ve been on the lookout for evidence of this effect in the data we collect from my class, and so far haven’t seen enough of it to justify bringing back a true final exam. This may be one of the few places where grade inflation is working in our favor, since at least at Illinois, the number of students who seem satisfied with passing my course with a C is very small. However, I’d love to move to Pass/Fail grading at some point, and if we did I think there’d be a stronger argument for a summary assessment to help keep students engaged with the material all the way until the end of the semester.

Overall, while I strongly disagree with Parrhesia’s position, I’m happy to see more consideration of this facet of instructional design. Typical course assessment focuses way too much on the learning theater that happens in the classroom, and far too little on structural components of course design that support student engagement and knowledge retention.

The educators I know are very much in agreement about our core goals—assessment that is accurate, ethical, not unduly stressful, and overall support of student learning. I just don’t see how even the traditional midterm and final exam model effectively supports these objectives, much less a single 100% final exam.

Autograding and computerized testing has also made it possible for computer science educators to more easily explore new points in the assessment design space. So far my results support the position that more assessments are better.

Thanks for reading!
I'd love to know what you think.
Feel free to get in touch.