I was thinking a little bit about what a test score confirms regarding a student’s knowledge. If I get a 90% on an exam, does that mean I know 90% of the material (1)? The answer seems obvious – NO – but it doesn’t really help us, either, if our main interest is to understand the student’s knowledge.
Let’s have a closer… examination… of this question by considering a really basic knowledge test – US state capitals. There are fifty capitals for fifty states so this test should have fifty questions to comprehensively cover the material. Now, if I get a 90% on the test, I could walk around boasting that I know nine-tenths of the nation’s capitals. But this doesn’t seem quite right, at least based on my experience taking tests. I’ve scored 90% when I’ve known far less than 90% (2).
What gives? As it does in so much of life, I think design matters. Design counts for optimizing test performance just as it does for your lawnmower, smartphone, or teakettle. The most common form of test design is the multiple-choice exam. A 90% on this format is very different from a 90% on a sheet of paper with fifty blank lines alongside each state (or even better, just a sheet of paper with fifty blank lines). If I write out 90% of the state capitals correctly, I probably know more than someone who circled the correct letter 90% of the time on the multiple-choice version (3).
This raises the question of how much the design boosts the grade. The way I thought about this was to reverse the problem – a 90% means I don’t know 10% of the answers. But is this true? Let’s suppose we have the simplest form of multiple-choice exam possible – the true or false test. In this format, the state is listed alongside a city and all you do is choose true or false. A 90% score means five incorrect guesses. However, I argue that the student must have also guessed correctly five additional times. This is based on an application of basic probability theory – given that one of two choices is correct, a random guess is right half the time. So, given five incorrect choices, the theory suggests, insists even, that there were also five correct choices. A 90% on a true or false exam is really an 80% knowledge score plus 10% design-driven grade inflation (4).
Another way to think about this is to consider what the minimum grade is for a student with no knowledge. Let’s say you take the state capital test in a language you’ve never used before in your life. If the exam format were true or false, you would get a 50% (on average) just by guessing. If the test had five choices per question, you would get a 20%. And if the test were a blank sheet of paper… well, let's hope 'Albany' in Arabic doesn't mean something offensive.
This doesn’t necessarily mean that all tests should opt for the open-ended format. I would certainly wonder whether it was worth the effort of redesigning exams in an educational setting for the sake of stripping away a small percentage point bias in the final score. But I think the broader lesson of thinking seriously about design is an important consideration for someone who is using a multiple-choice format in other contexts. A poorly designed study, for example, could trace its flaws back to the way subjects were asked certain questions. These results do have real-world consequences that just can’t be said regarding the day you forgot about Carson City.
One example I recently read about was in the endnotes of Malcolm Gladwell’s Talking To Strangers. He recounted how using a multiple-choice format led researchers to incorrectly conclude that people’s facial expressions show emotions in the same way regardless of culture. It’s one thing to ask ‘does this face look happy?’ - the answer is yes or no. It turns out that asking ‘what emotion does this face show?’ brings around an entirely different set of potential responses, stripping away the fake patterns and coincidences that sometimes emerge in guesswork.
Perhaps the only real lesson in all of this is one I’ve written about countless times on TOA – don’t make assumptions. A test format with options is essentially a subtle form of assumption. It might be OK to use this approach when the questions assess someone’s knowledge of cut and dry facts. The application of the same theory for inquiry or discovery, however, presents some serious problems in the way we imply the answer is known before we can actually confirm our knowledge.
Footnotes / endnotes
0. Is this a loophole in the new TOA setup?
I count the words in footnotes for the word count, don't you worry reader.
1. But what if the student’s grandmother just died?
There are some obvious reasons why this would not be the case. In the interest of simplifying the question let’s just say that (a) the test covers all the material, (b) the material is presented without bias, (c) the test is administered under perfect conditions, and (d) the student is in perfect condition to take the test.
Of course, some people will still insist on pointing out a host of other reasons why a test result never actually tells you how much the test-taker knows. I’m not sure why some people insist on shooting things down rather than trying to build on ideas. Fire at will, I say, just don't come complaining to me when people stop sharing their ideas with you.
2. But what if you knew more?
This isn’t really true since I usually knew at least 90% of the material. This isn’t a humble brag, it’s just a brag – I was wicked smaht, and sometimes still am.
3. This is really about footnote #1.
Like I noted above, this is the point where someone will cut me off and start pointing out inane hypothetical scenarios. What if someone who would have correctly guessed on a write-in test gets fooled because the four choices tempt a switch to a different city? The thing I never understand is why someone would apply a standard of perfection to reject changes regarding something no one ever insisted was perfect.
4. It’s about the formula, not the numbers.
The formula stays the same if you add more choices while the grade inflation number comes down. If a student selected a state capital from six choices per question, the 90% is really an 88% with a 2% design boost (because each question has a one in six chance of a correct guess).