Interpreting Diagnostic Tests
Most people—including most physicians—don’t understand how to correctly interpret diagnostic medical tests.
Here’s a concrete example: the BinaxNOW home COVID test has 68% sensitivity (chance of giving a positive result if you have COVID) and 99% specificity (chance of giving a negative result if you don’t have COVID). So does that mean positive test results are 68% accurate and negative results are 99% accurate? Unfortunately, it isn’t that simple.
My COVID risk level means that for me, a positive test only has a 6% chance of being accurate but a negative test has a 99.97% chance of being accurate. Your odds might be completely different, depending on your risk level.
In this post, I’ll explain how it all works and show you how to understand test results. (Spoiler: aside from some medical terminology, this is just an application of Bayes’ Theorem).
Update: mayleaf has an excellent piece on interpreting COVID test results using Bayes factors, which I highly recommend.
Sensitivity and specificity
Let’s start with the easy part. Sensitivity and selectivity measure the intrinsic accuracy of a test regardless of who’s taking it.
Sensitivity is how often a test gives a positive result when testing someone who is actually positive. BinaxNOW has a sensitivity of 68%, so if 100 people with COVID take a BinaxNOW test, 68 of them will test positive.
Specificity is how often a test gives a negative result when testing someone who is actually negative. BinaxNOW has a specificity of 99%, so if 100 people who do not have COVID take a BinaxNOW test, 99 of them will test negative.
Why does your risk level matter?
Let’s do a thought experiment:
If 100 people who have COVID take BinaxNOW tests, they will get 68 positive results and 32 negative results. All the positives are correct (0% false positive rate) and all the negatives are incorrect (100% false negative rate).
If 100 people who don’t have COVID take BinaxNOW tests, they will get 1 incorrect positive result (100% false positive rate) and 99 correct negative results (0% false negative rate).
The same test has completely different false positive and false negative rates, depending on how likely it is that the person taking it has COVID. So how do I calculate the test’s accuracy based on my risk level?
First, some terminology
The numbers I want are:
Positive predictive value (PPV): how accurate is a positive test result? If I get a positive test result, the PPV is the chance that I truly have COVID.
Negative predictive value (NPV): how accurate is a negative test result? If I get a negative test result, the NPV is the chance that I truly don’t have COVID.
In order to calculate those, I need to know:
Prior probability (sometimes called pre-test probability or prevalence): what is the probability that I have COVID based on my risk level, symptoms, known exposure, etc.?
I’m on a 500 microCOVID per week risk budget, which means my chance of having COVID at any given time is approximately 0.1%. So my prior probability is 0.1%. (Assuming I don’t have any symptoms: if I suddenly get a fever and lose my sense of smell, my prior probability might be close to 100%).
Give me the answer already
This calculator by Epitools lets you calculate PPV and NPV. In my case, I put 0.68 in the Sensitivity box, 0.99 in the Specificity box, .001 in the Prior probability box, and press Submit. After waiting a surprisingly long time, the calculator tells me my PPV is 6.4% and my NPV is 99.97%.
A graphical explanation
The BMJ has an excellent Covid-19 test calculator that does a nice job of graphically representing how this works. I recommend you take two minutes to play with it if you want to develop an intuitive understanding of how this works.
Unfortunately, the BMJ calculator doesn’t have the precision to calculate PPV and NPV for someone with a very low prior probability.
A numerical explanation
If you’re familiar with Bayes’ Theorem, you already know how to do the math. If not, here’s a quick summary of how to calculate PPV and NPV yourself.
Probabilities can be hard to think about: for most people, it’s easiest to imagine a large number of people taking the same test. So let’s imagine 1,000,000 of me taking the test. (Is the world ready for a million Tornuses? I say yes!)
Because my prior probability is 0.1%, 1,000 of the hypothetical Tornuses have COVID. The sensitivity of the test is 68%, so 680 of them get correct positive results and 320 get incorrect negative results.
What about the remaining 999,000 Tornuses who don’t have COVID? The specificity is 99%, so 989,010 get correct negative results and 9,990 get incorrect positive results.
So how do we calculate PPV? There are 10,670 positive tests (680 + 9,990), of which 680 are accurate. So the odds of a positive test being accurate are 6.4% (680 / 10,670).
If I get a positive test result, it has a 6.4% chance of being accurate.
How about the NPV? There are 989,330 negative tests, of which 989,010 are accurate. NPV = 989,010 / 989,330 = 99.97%.
If I get a negative test result, it has a 99.97% chance of being accurate.
Step by step
1a. Imagine 1,000,000 people taking the test
2a. Truly positive people = 1,000,000 x (prior probability)
2b. Correct positives = (truly positive people) x (sensitivity)
2c. Incorrect negatives = (truly positive people) x (1 - sensitivity)
3a. Truly negative people = 1,000,000 x (1 - prior probability)
3b. Correct negatives = (truly negative people) x (specificity)
3c. Incorrect positives = (truly negative people) x (1 - specificity)
4a. PPV = (correct positives) / (correct positives + incorrect positives)
4b. NPV = (correct negatives) / (correct negatives + incorrect negatives)
In closing
Don’t feel too bad if this doesn’t make intuitive sense to you. If you understand the question and you know where to calculate the answer, you’re ahead of most physicians.
Unfortunately it’s not even that simple. This is really only the tip of a rather gnarly iceberg. The sorts of calculations in the post are a start toward decent interpretation, but only a start.
This is almost certainly false, because the test results aren’t independent. They very likely do reliably test some underlying factor that is correlated with having COVID, such as presence of a particular class of proteins at some concentration in the sample. It is more likely that you would see a much larger or smaller fraction of positive tests since the underlying concentration in multiple samples taken from you is likely to be much more consistent than those taken from a collection of random people infected by COVID. So you can’t even “average it out” by doing lots of tests.
What’s worse, some underlying factors are likely to vary between tested populations and various instances of sample collection and so on. So even knowing the proportion of false positives and negatives they (claimed to!) get in their own test population quite likely won’t be the same as your probability of getting a false positive or negative, so you should allow for even wider variance than the advertised figures because you usually don’t know how closely you match their validation testing profile.
Even worse still, you can’t even multiply the chances of false positives or negatives with prior probabilities from other evidence, because factors related to the other evidence might also co-vary with the probabilities of false positives or negatives. For example, suppose you reduce your evaluated chance of having COVID (moderately) by the fact that you’re not displaying symptoms. Then you lower it more by having a negative test. Oops! Many of these sorts of tests are far more likely to give false negatives in people who are not showing symptoms, so you’ve double-counted some of the same evidence!
These are just a few of the extra pitfalls in interpreting tests, and indeed when interpreting statistical evidence of all types.
Yes, those are all excellent points.
I wrote this as a side reference for a deep dive on the BinaxNOW that’s coming shortly, and it’ll dig into the numerous, complex, and important issues affecting BinaxNOW accuracy. Short version: the accuracy varies substantially, largely based on viral load. And you’re correct that repeated tests on the same individual will be strongly correlated.
And you’ve convinced me to change the example you cite: I’d gone with the first person for narrative consistency, but I’m shifting it to prioritize technical accuracy.