Following up on a recent thread, I want to discuss how Covid case numbers are generated. I think this is interesting not only because Covid is interesting, but because it’s a good example of non-obvious real-world complexity.
A reasonable person might have a mental model of surveillance that looks something like this:
A person takes a Covid test
The lab enters the result in a state-wide database
An automatic process runs a database query once a day
Yesterday’s cases show up on a pretty chart
That model is completely wrong.
This is a brief overview of some of the challenges involved in Covid surveillance: the situation is far more complicated than I’m presenting here.
Organizational challenges
Public health in the United States has been severely underfunded for four decades. Everything is running on a shoestring and every team is understaffed.
Everyone is trying to hire epidemiologists and public health informaticians right now and there simply aren’t enough to go around. No matter how much money you throw at the problem, you can’t build your team to the size you need.
Government salaries for tech staff are shockingly lower than the private sector. Good tech people are essential, but hiring and keeping a good person on a government salary takes a minor miracle.
Many of the core information systems were built by epidemiologists. They’re smart people who are good at what they do, but they aren’t professional coders or data scientists. Most of them know just enough code to be dangerous.
Until Covid, reportable condition pipelines were optimized for ease of analysis, not for throughput. All of a sudden, throughput is critically important: many systems are pushing 100 times as much data as they were two years ago. I’m aware of a critical daily data pipeline that takes 8 hours to run, assuming nothing goes wrong.
Every part of the pipeline has enormous amounts of tech debt.
Legal challenges
Testing data is covered by HIPAA as well as numerous laws and regulations that control how it has to be stored and who it can be shared with.
Multiple arms of the federal government are involved. CDC and HHS make large grants to state health departments that come with substantial technical requirements.
In many states, every county is a semi-autonomous Local Health Jurisdiction (LHJ). Every LHJ has its own staff, processes, systems, and priorities.
There are 574 federally recognized tribes in the United States and every one is a sovereign nation. There are complicated and sensitive issues around data sharing between the federal government and tribes.
Pipeline challenges
Before Covid, it was pretty much unheard of for health departments to collect negative test results or tests results from point-of-care tests like BinaxNow. These results now make up the large majority of reports sent to state health departments and are shoehorned into systems that were never designed for them.
Testing data is sent to the state health department by:
Hospitals, clinics, and doctors’ offices
Federal, state, tribal, local, and private testing sites
Clinical labs
Private employers, including fishing vessels and other ships at sea
Schools
Nursing homes
Home tests with an electronic back end
Random Covid startups that may or may not know anything about medicine
Many other sources
Data enters the main pipeline via:
Various electronic portals and APIs
Faxes and eFaxes that are transcribed by hand (in at least one case, by a room full of National Guard)
Spreadsheets in unknown format that are sent in by email
Many of those sources are in different formats and many are incomplete, erroneous, or malformed. Many of the points of origin have never before conducted any kind of medical testing and are completely new to reporting lab results.
There are numerous third party systems dumping data into the main pipeline and every one of them has the ability to break the entire pipeline by introducing malformed data.
Data challenges
Test results are often incomplete or contain errors. Many test results enter the system multiple times from multiple sources, sometimes with misspelled names or different names for the same person. There is no universal identifier like a social security number associated with reports. Data deduplication is computationally expensive and requires significant human intervention.
A typical sequence of reporting goes something like this:
A person gets infected
A couple of days later they develop symptoms
The next day, they get tested
A few days later, the test is complete
The result is sent to the state health department
The result goes through a complex pipeline where it is validated, de-duplicated, and combined with related entries
The result might get sent to one or more separate systems for case investigation and contact tracing
Finally, the result ends up going through a complex pipeline for the reporting and dashboard system
What date do you associate with a given test: the day the sample was collected, or the day the result was reported to the health department? CDC uses the report date, which simplifies reporting but gives less insight into the situation on the ground. Many states use the date the sample was collected, which is more useful but involves longer reporting delays. Regardless of which convention you use, there are substantial delays: in a state with an efficient surveillance system, it might take 7 days for 90% of all test results to pass through the system.
What about other data? It’s useful to know how many kids in a given school are testing positive so you can detect and respond to outbreaks. That’s great if you’re testing at school, but what if someone goes to a drive-thru site? What if they get tested by their family doctor? Most of those sources don’t report what school the patient attends. When they do, there is no standard format for reporting schools, so the raw data contains a mess of misspellings and mistakes in the school names, which must be tediously corrected to be useful.
Testing data is used for many kinds of analysis, some of which require complex interactions between multiple test records, medical records, and other data sources. For example, determining whether a given case is a reinfection requires figuring out whether the patient has ever had a previous positive test (keeping in mind that they may have moved, changed their name, etc. in the interim). Calculating breakthrough cases requires matching test results to vaccination records. None of those operations are easy, and none of them are computationally cheap.
Closing thoughts
There is, I think, an interesting object lesson here. Much like container logistics, Covid surveillance is much harder than it seems to a casual observer. From the outside, it’s hard to understand why Covid surveillance is so slow, incomplete, and erratic. If you’re in tech, it’s easy to imagine how you could build a modern data pipeline that would solve all of those problems, and it’s hard to understand why someone hasn’t already done that.
That pipeline is theoretically possible (and exists in some other countries), but in the US there are very substantial structural obstacles to creating and deploying it. Any attempt to improve Covid surveillance must account for those obstacles, or it will fail.
More generally, changing the real world requires understanding the real world. Victory comes from solving the tedious mundane problems as well as the fun technical problems.
Important disclaimers
I speak for myself only and not for any other person or government agency.
I have no professional credentials in this field, but I have consulted closely with an expert in Covid surveillance while writing this.
Why is Covid Surveillance so Hard?
Why is Covid Surveillance so Hard?
Following up on a recent thread, I want to discuss how Covid case numbers are generated. I think this is interesting not only because Covid is interesting, but because it’s a good example of non-obvious real-world complexity.
A reasonable person might have a mental model of surveillance that looks something like this:
A person takes a Covid test
The lab enters the result in a state-wide database
An automatic process runs a database query once a day
Yesterday’s cases show up on a pretty chart
That model is completely wrong.
This is a brief overview of some of the challenges involved in Covid surveillance: the situation is far more complicated than I’m presenting here.
Organizational challenges
Public health in the United States has been severely underfunded for four decades. Everything is running on a shoestring and every team is understaffed.
Everyone is trying to hire epidemiologists and public health informaticians right now and there simply aren’t enough to go around. No matter how much money you throw at the problem, you can’t build your team to the size you need.
Government salaries for tech staff are shockingly lower than the private sector. Good tech people are essential, but hiring and keeping a good person on a government salary takes a minor miracle.
Many of the core information systems were built by epidemiologists. They’re smart people who are good at what they do, but they aren’t professional coders or data scientists. Most of them know just enough code to be dangerous.
Until Covid, reportable condition pipelines were optimized for ease of analysis, not for throughput. All of a sudden, throughput is critically important: many systems are pushing 100 times as much data as they were two years ago. I’m aware of a critical daily data pipeline that takes 8 hours to run, assuming nothing goes wrong.
Every part of the pipeline has enormous amounts of tech debt.
Legal challenges
Testing data is covered by HIPAA as well as numerous laws and regulations that control how it has to be stored and who it can be shared with.
Multiple arms of the federal government are involved. CDC and HHS make large grants to state health departments that come with substantial technical requirements.
In many states, every county is a semi-autonomous Local Health Jurisdiction (LHJ). Every LHJ has its own staff, processes, systems, and priorities.
There are 574 federally recognized tribes in the United States and every one is a sovereign nation. There are complicated and sensitive issues around data sharing between the federal government and tribes.
Pipeline challenges
Before Covid, it was pretty much unheard of for health departments to collect negative test results or tests results from point-of-care tests like BinaxNow. These results now make up the large majority of reports sent to state health departments and are shoehorned into systems that were never designed for them.
Testing data is sent to the state health department by:
Hospitals, clinics, and doctors’ offices
Federal, state, tribal, local, and private testing sites
Clinical labs
Private employers, including fishing vessels and other ships at sea
Schools
Nursing homes
Home tests with an electronic back end
Random Covid startups that may or may not know anything about medicine
Many other sources
Data enters the main pipeline via:
Various electronic portals and APIs
Faxes and eFaxes that are transcribed by hand (in at least one case, by a room full of National Guard)
Spreadsheets in unknown format that are sent in by email
Many of those sources are in different formats and many are incomplete, erroneous, or malformed. Many of the points of origin have never before conducted any kind of medical testing and are completely new to reporting lab results.
There are numerous third party systems dumping data into the main pipeline and every one of them has the ability to break the entire pipeline by introducing malformed data.
Data challenges
Test results are often incomplete or contain errors. Many test results enter the system multiple times from multiple sources, sometimes with misspelled names or different names for the same person. There is no universal identifier like a social security number associated with reports. Data deduplication is computationally expensive and requires significant human intervention.
A typical sequence of reporting goes something like this:
A person gets infected
A couple of days later they develop symptoms
The next day, they get tested
A few days later, the test is complete
The result is sent to the state health department
The result goes through a complex pipeline where it is validated, de-duplicated, and combined with related entries
The result might get sent to one or more separate systems for case investigation and contact tracing
Finally, the result ends up going through a complex pipeline for the reporting and dashboard system
What date do you associate with a given test: the day the sample was collected, or the day the result was reported to the health department? CDC uses the report date, which simplifies reporting but gives less insight into the situation on the ground. Many states use the date the sample was collected, which is more useful but involves longer reporting delays. Regardless of which convention you use, there are substantial delays: in a state with an efficient surveillance system, it might take 7 days for 90% of all test results to pass through the system.
What about other data? It’s useful to know how many kids in a given school are testing positive so you can detect and respond to outbreaks. That’s great if you’re testing at school, but what if someone goes to a drive-thru site? What if they get tested by their family doctor? Most of those sources don’t report what school the patient attends. When they do, there is no standard format for reporting schools, so the raw data contains a mess of misspellings and mistakes in the school names, which must be tediously corrected to be useful.
Testing data is used for many kinds of analysis, some of which require complex interactions between multiple test records, medical records, and other data sources. For example, determining whether a given case is a reinfection requires figuring out whether the patient has ever had a previous positive test (keeping in mind that they may have moved, changed their name, etc. in the interim). Calculating breakthrough cases requires matching test results to vaccination records. None of those operations are easy, and none of them are computationally cheap.
Closing thoughts
There is, I think, an interesting object lesson here. Much like container logistics, Covid surveillance is much harder than it seems to a casual observer. From the outside, it’s hard to understand why Covid surveillance is so slow, incomplete, and erratic. If you’re in tech, it’s easy to imagine how you could build a modern data pipeline that would solve all of those problems, and it’s hard to understand why someone hasn’t already done that.
That pipeline is theoretically possible (and exists in some other countries), but in the US there are very substantial structural obstacles to creating and deploying it. Any attempt to improve Covid surveillance must account for those obstacles, or it will fail.
More generally, changing the real world requires understanding the real world. Victory comes from solving the tedious mundane problems as well as the fun technical problems.
Important disclaimers
I speak for myself only and not for any other person or government agency.
I have no professional credentials in this field, but I have consulted closely with an expert in Covid surveillance while writing this.