Sammy Martin comments on Open & Welcome Thread—August 2020

Sammy Martin 15 Aug 2020 16:38 UTC
11 points
Covid19Projections has been one of the most successful coronavirus models in large part because it is as ‘model-free’ and simple as possible, using ML to backtrack parameters for a simple SEIR model from death data only. This has proved useful because case numbers are skewed by varying numbers of tests, so deaths are more consistently reliable as a metric. You can see the code here.
However, in countries doing a lot of testing, with a reasonable number of cases but with very few deaths, like most of Europe, the model is not that informative, and essentially predicts near 0 deaths out to the limit of its measure. This is expected—the model is optimised for the US.
Estimating SEIR parameters based on deaths works well when you have a lot of deaths to count, if you don’t then you need another method. Estimating purely based on cases has its own pitfalls—see this from epidemic forecasting, which mistook an increase in testing in the UK mid-july for a sharp jump in cases and wrongly inferred brief jump in R_t. As far as I understand their paper, the estimate of R_t from case data adjusts for delays in infection to onset and for other things, but not for the positivity rate or how good overall testing is.
This isn’t surprising—there is no simple model that combines test positivity rate and the number of cases and estimates the actual current number of infections. But perhaps you could use a Covid19pro like method to learn such a mapping.
Very oversimplified, Covid19pro works like this:
Our COVID-19 prediction model adds the power of artificial intelligence on top of a classic infectious disease model. We developed a simulator based on the SEIR model (Wikipedia) to simulate the COVID-19 epidemic in each region. The parameters/inputs of this simulator are then learned using machine learning techniques that attempts to minimize the error between the projected outputs and the actual results. We utilize daily deaths data reported by each region to forecast future reported deaths. After some additional validation techniques (to minimize a phenomenon called overfitting), we use the learned parameters to simulate the future and make projections.
$f (d e a t h s : t_{0}) = (S, E, I, R) t_{0}$
$g ((S, E, I, R) t_{0}) = d e a t h s : t_{1}$
And the functions f and g, estimate the SEIR (susceptible, exposed, infectious, recovered) parameters from current deaths up to some time t_0, and the future deaths based on those parameters respectively. These functions are then both optimised to minimise error when the actual number of deaths at t_1 is fed into the model.
This oversimplification is deliberate:
Deaths data only: Our model only uses daily deaths data as reported by Johns Hopkins University. Unlike other models, we do not use additional data sources such as cases, testing, mobility, temperature, age distribution, air traffic, etc. While supplementary data sources may be helpful, they can also introduce additional noise and complexity which can notably skew results.
What I suggest is a slight increase in complexity, where we use a similar model except we feed it paired test positivity rate and case data instead of death data. The positivity rate /tests per case serves as a ‘quality estimate’ which serves to tell you how good the test data is. That’s how tests per case is treated by our world in data. We all know intuitively that if positivity rate is going down but cases are going up, the increase might not be real, but if positivity rate is going up and cases are going up the increase definitely is real.
What I’m suggesting is that we combine do something like this:
$f (c a s e s, t e s t s / c a s e : t_{0}) = (S, E, I, R) t_{0}$
$g ((S, E, I, R) t_{0}) = d e a t h s : t_{1}$
Now, you need to have reliable data on the number of people tested each week, but most of Europe has that. If you can learn a model that gives you a more accurate estimate of the SEIR parameters from combined cases and tests/case data, then it should be better at predicting future infections. It won’t necessarily predict future cases, since the number of future cases is also going to depend on the number of tests conducted, which is subject to all sorts of random fluctuations that we don’t care about when modelling disease transmission, so instead you could use the same loss function as the original covid19pro—minimizing the difference between projected and actual deaths.
Hopefully the intuition that you can learn more from the pair (tests/case, number of cases) than number of cases or number of deaths alone should be borne out, and a c19pro-like model could be trained to make high quality predictions in places with few deaths using such paired data. You would still need some deaths for the loss function and fitting the model.