There has been surprisingly little written on concrete threat models for how AI leads to existential catastrophes (though you’ve done some great work rectifying this!). Why is this? And what are the most compelling threat models that don’t have good public write-ups? In particular, are there under-appreciated threat models that would lead to very different research priorities within Alignment?
And what are the most compelling threat models that don’t have good public write-ups?
Depends how you slice and dice the space (and what counts as a “threat model”), I don’t have a good answer for this. In general I feel like a threat model is more like something that everyone can make for themselves and is a model of the space of threats, not like a short list of things that you might discover.
We could talk about particular threats that don’t have good public write-ups. I feel like there are various humans-are-fragile-so-weak-AI-takes-over-when-world-falls-apart possibilities and those haven’t been written up very well.
I think Neel is using this in the sense I use the phrase, where you carve up the space of threats in some way, and then a “threat model” is one of the pieces that you carved up, rather than the way in which you carved it up.
This is meant to be similar to how in security there are many possible kinds of risks you might be worried about, but then you choose a particular set of capabilities that an attacker could have and call that a “threat model”—this probably doesn’t capture every setting you care about, but does capture one particular piece of it.
(Though maybe in security the hope is to choose a threat model that actually contains all the threats you expect in reality, so perhaps this analogy isn’t the best.)
(I think “that’s a thing that people make for themselves” is also a reasonable response for this meaning of “threat model”.)
On that perspective I guess by default I’d think of a threat as something like “This particular team of hackers with this particular motive” and a threat model as something like “Maybe they have one or two zero days, their goal is DoS or exfiltrating information, they may have an internal collaborator but not one with admin privileges...” And then the number of possible threat models is vast even compared to the vast space of threats.
There has been surprisingly little written on concrete threat models for how AI leads to existential catastrophes (though you’ve done some great work rectifying this!). Why is this?
I’m not sure why there isn’t more work on concrete descriptions of possible futures and how they go wrong. Some guesses:
Anything concrete is almost certainly wrong. People are less convinced that it’s useful given that it will be wrong, and so try to make vaguer / more abstract stories that maybe describe reality at the expense of having less detail.
It’s not exactly clear what you do with such a story or what the upside is, it’s kind of a vague theory of change and most people have some specific theory of change they are more excited about (even if this kind of story is a bit of a public good that’s useful on a broader variety of perspectives / to people who are skeptical).
Any detailed story produced in a reasonable amount of time will also be obviously wrong to someone who notices the right considerations or has the right background. It’s very demoralizing to write something that someone is going to recognize is obviously incoherent/wrong (especially if you expect that to get pointed out and taken by some to undermine your view).
It just kind of takes a long time and is hard, and people don’t do that many hard things that take a long time.
A lot of people most interested in futurism are into very fast-take-off models where there isn’t as much to say and they maybe feel like it’s mostly been said.
(I think that “threat models” is somewhat broader / different from concrete stories, and it’s a bit less clear to me exactly how much people have done or what counts.)
It’s not exactly clear what you do with such a story or what the upside is, it’s kind of a vague theory of change and most people have some specific theory of change they are more excited about (even if this kind of story is a bit of a public good that’s useful on a broader variety of perspectives / to people who are skeptical).
Ah, interesting! I’m surprised to hear that. I was under the impression that while many researchers had a specific theory of change, it was often motivated by an underlying threat model, and that different threat models lead to different research interests.
Eg, someone worries about a future where AI control the world but are not human comprehensible, feels very different from someone worried about a world where we produce an expected utility maximiser that has a subtly incorrect objective, resulting in bad convergent instrumental goals.
Do you think this is a bad model of how researchers think? Or are you, eg, arguing that having a detailed, concrete story isn’t important here, just the vague intuition for how AI goes wrong?
I think most people have expectations regarding e.g. how explicitly will systems represent their preferences, how much will they have preferences, how will that relate to optimization objectives used in ML training, how well will they be understood by humans, etc.
Then there’s a bunch of different things you might want: articulations of particular views on some of those questions, stories that (in virtue of being concrete) show a whole set of guesses and how they can lead to a bad or good outcome, etc. My bullet points were mostly regarding the exercise of fleshing out a particular story (which is therefore most likely to be wrong), rather than e.g. thinking about particular questions about the future.
There has been surprisingly little written on concrete threat models for how AI leads to existential catastrophes (though you’ve done some great work rectifying this!). Why is this? And what are the most compelling threat models that don’t have good public write-ups? In particular, are there under-appreciated threat models that would lead to very different research priorities within Alignment?
Depends how you slice and dice the space (and what counts as a “threat model”), I don’t have a good answer for this. In general I feel like a threat model is more like something that everyone can make for themselves and is a model of the space of threats, not like a short list of things that you might discover.
We could talk about particular threats that don’t have good public write-ups. I feel like there are various humans-are-fragile-so-weak-AI-takes-over-when-world-falls-apart possibilities and those haven’t been written up very well.
I think Neel is using this in the sense I use the phrase, where you carve up the space of threats in some way, and then a “threat model” is one of the pieces that you carved up, rather than the way in which you carved it up.
This is meant to be similar to how in security there are many possible kinds of risks you might be worried about, but then you choose a particular set of capabilities that an attacker could have and call that a “threat model”—this probably doesn’t capture every setting you care about, but does capture one particular piece of it.
(Though maybe in security the hope is to choose a threat model that actually contains all the threats you expect in reality, so perhaps this analogy isn’t the best.)
(I think “that’s a thing that people make for themselves” is also a reasonable response for this meaning of “threat model”.)
On that perspective I guess by default I’d think of a threat as something like “This particular team of hackers with this particular motive” and a threat model as something like “Maybe they have one or two zero days, their goal is DoS or exfiltrating information, they may have an internal collaborator but not one with admin privileges...” And then the number of possible threat models is vast even compared to the vast space of threats.
I’m not sure why there isn’t more work on concrete descriptions of possible futures and how they go wrong. Some guesses:
Anything concrete is almost certainly wrong. People are less convinced that it’s useful given that it will be wrong, and so try to make vaguer / more abstract stories that maybe describe reality at the expense of having less detail.
It’s not exactly clear what you do with such a story or what the upside is, it’s kind of a vague theory of change and most people have some specific theory of change they are more excited about (even if this kind of story is a bit of a public good that’s useful on a broader variety of perspectives / to people who are skeptical).
Any detailed story produced in a reasonable amount of time will also be obviously wrong to someone who notices the right considerations or has the right background. It’s very demoralizing to write something that someone is going to recognize is obviously incoherent/wrong (especially if you expect that to get pointed out and taken by some to undermine your view).
It just kind of takes a long time and is hard, and people don’t do that many hard things that take a long time.
A lot of people most interested in futurism are into very fast-take-off models where there isn’t as much to say and they maybe feel like it’s mostly been said.
(I think that “threat models” is somewhat broader / different from concrete stories, and it’s a bit less clear to me exactly how much people have done or what counts.)
Ah, interesting! I’m surprised to hear that. I was under the impression that while many researchers had a specific theory of change, it was often motivated by an underlying threat model, and that different threat models lead to different research interests.
Eg, someone worries about a future where AI control the world but are not human comprehensible, feels very different from someone worried about a world where we produce an expected utility maximiser that has a subtly incorrect objective, resulting in bad convergent instrumental goals.
Do you think this is a bad model of how researchers think? Or are you, eg, arguing that having a detailed, concrete story isn’t important here, just the vague intuition for how AI goes wrong?
I think most people have expectations regarding e.g. how explicitly will systems represent their preferences, how much will they have preferences, how will that relate to optimization objectives used in ML training, how well will they be understood by humans, etc.
Then there’s a bunch of different things you might want: articulations of particular views on some of those questions, stories that (in virtue of being concrete) show a whole set of guesses and how they can lead to a bad or good outcome, etc. My bullet points were mostly regarding the exercise of fleshing out a particular story (which is therefore most likely to be wrong), rather than e.g. thinking about particular questions about the future.