While this post is my perspective and not an official post of my employer’s, it also draws on a lot
of collaborative work with others at the Nucleic Acid Observatory
(NAO)
One of the future scenarios I’m most worried about is someone creating
a “stealth”
pandemic. Imagine a future HIV that first infects a large number of
people with minimal side effects and only shows its nasty side after
it has spread very widely. This is not something we’re prepared
for today: current detection approaches (symptom reporting, a
doctor noticing an unusual pattern) require visible effects.
Over the last year, with my colleagues at the NAO, I’ve been exploring one
promising method of identifying this sort of pandemic. The overall
idea is:
Collect some sort of biological material from a lot of people
on an ongoing basis, for example by sampling sewage.
Use metagenomic sequencing to learn what nucleic acids are in
these samples.
Run novel-pathogen detection algorithms on the sequencing data.
When you find some thing sufficiently concerning, follow-up
with tests for the specific thing you’ve found.
While there are important open questions in all four of these, I’ve
been most focused on the third: once you have metagenomic sequencing
data, what do you do?
I see four main approaches. You can look for sequences that are:
Dangerous:
There are some genetic sequences that code for dangerous things
that we should not normally see in our samples. If you see a series
of base pairs that are unique to smallpox that’s very concerning! The
main downside of this approach if you want to extend it beyond smallpox etc is
that you need to make a list of non-obvious dangerous things, which is
in itself a dangerous thing to do: what if your list is stolen and it
points people to sequences they wouldn’t have thought to try using?
This is similar to another problem: how do you check if people are
synthesizing dangerous sequences without risking a list of all the
things that shouldn’t be synthesized? SecureDNA has been working on this
problem, with an encrypted database with a distributed key
system that allows flagging sequences without it being practical to
get a list of all flagged sequences (paper).
There are some blockers to using SecureDNA for this today, since it
was designed for slightly different constraints, but I think they are
all surmountable and I’m hoping to implement a SecureDNA-based
metagenomic sequencing screening system at some point in the next
year.
An alternative and somewhat longer-term approach here would be to use
tools that are able to estimate the function of novel sequences to
extend this to sequences that aren’t closely derived from existing
ones. I’m less enthusiastic about this: not only could this work end
up increasing risk by improving humanity’s ability to judge how
dangerous a novel sequence is, it’s not clear to me that this approach
is likely to catch things the other methods wouldn’t.
Modified:
In engineering a new virus for a stealth pandemic, the easiest
way would likely be to begin with an existing virus. If we see a
sequencing read where part
matches a known viral genome and part does not (a “chimera”), one
potential explanation is that the read comes from a genetically
engineered virus.
But this is not the only reason this approach could flag a read. For
example, it could come from:
Lack of knowledge. Perhaps a virus has a lot of variation, much
more than is reflected in the databases you are using to define
“normal”. It will look like you have found a novel virus when it’s
just an incomplete database. And, of course, the database will always
be incomplete: viruses are always evolving. Still, solving this seems
practical: handling these initial false positives requires expanding
our knowledge of the variety of existing viruses, but that is
something many virologists are deeply interested in.
Sequencing: perhaps some of the biological processing you do
prior to (or during) sequencing can attach unrelated fragments. When
you see a chimera how do you know whether that existed in the sample
you originally collected vs if it was created accidentally in the lab?
On the other hand, you can (a) compare the fraction of chimeras in
different sequencing approaches and pick ones where this is rare and
(b) pay more attention to cases where you’ve seen the same chimera
multiple times.
Biological chimerism: bacteria will occasionally incorporate
viral sequences. This method would flag this as genetic engineering
even if it was a natural and unconcerning process. As long as this is
rare enough, however, we can deal with this by surfacing such reads to
a biologist who figures out how concerned to be and what next steps
makes sense.
This is the main approach I’ve been working on lately, trying to get
the false positive rate down.
New:
If we understood what “normal” looked like well enough, then we
could flag anything new for investigation. This is a serious research
project: if you take data from a sewage sample and run it through
basic tooling, it’s common
to have 50% of reads unclassified. Making progress here will
require, among other things, much better tooling (and maybe
algorithms) for metagenomic assembly: I’m not aware of anything that
could efficiently integrate trillions of bases a week into an assembly
graph.
Ryan Teo, a first-year
graduate student with Nicole
Wheeler at the University of Birmingham has started his thesis in
this area, which I’m really excited to see. Lenni Justen, another
first-year graduate student, with Kevin
Esvelt, is also exploring this area as part of his work with the
NAO. I’d be excited to see more work, however, and if you’re working
on this or interested in working on it but blocked by not having
access to enough metagenomic sequencing data please get in touch!
Growing:
It may turn out that our samples are deeply complex:
potentially as you sequence the rate of seeing new things falls off
very slowly. If it falls off slowly enough, and then you will keep
seeing “new” things that are just so rare that you haven’t happened to
see them before. I am quite unsure how likely this is, and I expect
it varies by sample type (sewage is likely much more complex than,
say, blood) but it seems possible. An approach that’s robust to this
is that instead of flagging some thing just for being new, you could
flag it based on its growth pattern: first you’ve never seen it, then
you see it once, then you start seeing it more often, then you start
seeing it many times per sample. In theory a new pandemic should begin
with approximately exponentially spread, since with few people already
infected the number of new infections should be proportional to
the number of infectious people.
At the NAO we’ve been calling this “exponential
growth detection” (EGD). We worked on this some in 2022, but have
put it on hold until we have a deep enough timeseries dataset to work
with.
These approaches can also be combined: if a sequence originally comes
to your attention because it’s chimeric but you’re not sure how
seriously to take it, you could look at the growth pattern of its
components. Or, while you can detect growing things with a
genome-free approach simply by looking for increasing k-mers, the kind
of “thoroughly understand the metagenome” work that I described above
as an approach for identifying new things can also be used to make a
much more sensitive tool that detects growing things.
In terms of prioritization, I’m enthusiastic about work on all of
these, and would like to see them progress in parallel. The
approaches of detecting dangerous and modified sequences require less
scientific progress and should work on amounts of data that are
achievable with philanthropic funding. De novo protein
design is getting more capable and more accessible, however, which
allows creation of pathogens those two methods don’t catch. We will
need approaches that don’t depend on matching known things, which is
where detecting new and/or growing sequences comes in. Those two
methods will require a lot more data, enough that unless sequencing
goes through another round of the kind of massive cost improvement we
saw in 2008-2011 we’re talking about large-scale government-funded
projects. Advances in detection methods make it more likely that
we’ll be able to make the case for these larger projects, and reduce
the risk that the detection ability might lag infrastructure creation.
Computational Approaches to Pathogen Detection
Link post
While this post is my perspective and not an official post of my employer’s, it also draws on a lot of collaborative work with others at the Nucleic Acid Observatory (NAO)
One of the future scenarios I’m most worried about is someone creating a “stealth” pandemic. Imagine a future HIV that first infects a large number of people with minimal side effects and only shows its nasty side after it has spread very widely. This is not something we’re prepared for today: current detection approaches (symptom reporting, a doctor noticing an unusual pattern) require visible effects.
Over the last year, with my colleagues at the NAO, I’ve been exploring one promising method of identifying this sort of pandemic. The overall idea is:
Collect some sort of biological material from a lot of people on an ongoing basis, for example by sampling sewage.
Use metagenomic sequencing to learn what nucleic acids are in these samples.
Run novel-pathogen detection algorithms on the sequencing data.
When you find some thing sufficiently concerning, follow-up with tests for the specific thing you’ve found.
While there are important open questions in all four of these, I’ve been most focused on the third: once you have metagenomic sequencing data, what do you do?
I see four main approaches. You can look for sequences that are:
Dangerous:There are some genetic sequences that code for dangerous things that we should not normally see in our samples. If you see a series of base pairs that are unique to smallpox that’s very concerning! The main downside of this approach if you want to extend it beyond smallpox etc is that you need to make a list of non-obvious dangerous things, which is in itself a dangerous thing to do: what if your list is stolen and it points people to sequences they wouldn’t have thought to try using?
This is similar to another problem: how do you check if people are synthesizing dangerous sequences without risking a list of all the things that shouldn’t be synthesized? SecureDNA has been working on this problem, with an encrypted database with a distributed key system that allows flagging sequences without it being practical to get a list of all flagged sequences (paper).
There are some blockers to using SecureDNA for this today, since it was designed for slightly different constraints, but I think they are all surmountable and I’m hoping to implement a SecureDNA-based metagenomic sequencing screening system at some point in the next year.
An alternative and somewhat longer-term approach here would be to use tools that are able to estimate the function of novel sequences to extend this to sequences that aren’t closely derived from existing ones. I’m less enthusiastic about this: not only could this work end up increasing risk by improving humanity’s ability to judge how dangerous a novel sequence is, it’s not clear to me that this approach is likely to catch things the other methods wouldn’t.
Modified:In engineering a new virus for a stealth pandemic, the easiest way would likely be to begin with an existing virus. If we see a sequencing read where part matches a known viral genome and part does not (a “chimera”), one potential explanation is that the read comes from a genetically engineered virus.
But this is not the only reason this approach could flag a read. For example, it could come from:
Lack of knowledge. Perhaps a virus has a lot of variation, much more than is reflected in the databases you are using to define “normal”. It will look like you have found a novel virus when it’s just an incomplete database. And, of course, the database will always be incomplete: viruses are always evolving. Still, solving this seems practical: handling these initial false positives requires expanding our knowledge of the variety of existing viruses, but that is something many virologists are deeply interested in.
Sequencing: perhaps some of the biological processing you do prior to (or during) sequencing can attach unrelated fragments. When you see a chimera how do you know whether that existed in the sample you originally collected vs if it was created accidentally in the lab? On the other hand, you can (a) compare the fraction of chimeras in different sequencing approaches and pick ones where this is rare and (b) pay more attention to cases where you’ve seen the same chimera multiple times.
Biological chimerism: bacteria will occasionally incorporate viral sequences. This method would flag this as genetic engineering even if it was a natural and unconcerning process. As long as this is rare enough, however, we can deal with this by surfacing such reads to a biologist who figures out how concerned to be and what next steps makes sense.
This is the main approach I’ve been working on lately, trying to get the false positive rate down.
New:If we understood what “normal” looked like well enough, then we could flag anything new for investigation. This is a serious research project: if you take data from a sewage sample and run it through basic tooling, it’s common to have 50% of reads unclassified. Making progress here will require, among other things, much better tooling (and maybe algorithms) for metagenomic assembly: I’m not aware of anything that could efficiently integrate trillions of bases a week into an assembly graph.
Ryan Teo, a first-year graduate student with Nicole Wheeler at the University of Birmingham has started his thesis in this area, which I’m really excited to see. Lenni Justen, another first-year graduate student, with Kevin Esvelt, is also exploring this area as part of his work with the NAO. I’d be excited to see more work, however, and if you’re working on this or interested in working on it but blocked by not having access to enough metagenomic sequencing data please get in touch!
Growing:It may turn out that our samples are deeply complex: potentially as you sequence the rate of seeing new things falls off very slowly. If it falls off slowly enough, and then you will keep seeing “new” things that are just so rare that you haven’t happened to see them before. I am quite unsure how likely this is, and I expect it varies by sample type (sewage is likely much more complex than, say, blood) but it seems possible. An approach that’s robust to this is that instead of flagging some thing just for being new, you could flag it based on its growth pattern: first you’ve never seen it, then you see it once, then you start seeing it more often, then you start seeing it many times per sample. In theory a new pandemic should begin with approximately exponentially spread, since with few people already infected the number of new infections should be proportional to the number of infectious people.
At the NAO we’ve been calling this “exponential growth detection” (EGD). We worked on this some in 2022, but have put it on hold until we have a deep enough timeseries dataset to work with.
These approaches can also be combined: if a sequence originally comes to your attention because it’s chimeric but you’re not sure how seriously to take it, you could look at the growth pattern of its components. Or, while you can detect growing things with a genome-free approach simply by looking for increasing k-mers, the kind of “thoroughly understand the metagenome” work that I described above as an approach for identifying new things can also be used to make a much more sensitive tool that detects growing things.
In terms of prioritization, I’m enthusiastic about work on all of these, and would like to see them progress in parallel. The approaches of detecting dangerous and modified sequences require less scientific progress and should work on amounts of data that are achievable with philanthropic funding. De novo protein design is getting more capable and more accessible, however, which allows creation of pathogens those two methods don’t catch. We will need approaches that don’t depend on matching known things, which is where detecting new and/or growing sequences comes in. Those two methods will require a lot more data, enough that unless sequencing goes through another round of the kind of massive cost improvement we saw in 2008-2011 we’re talking about large-scale government-funded projects. Advances in detection methods make it more likely that we’ll be able to make the case for these larger projects, and reduce the risk that the detection ability might lag infrastructure creation.