paulfchristiano comments on AMA: Paul Christiano, alignment researcher

paulfchristiano 1 May 2021 16:59 UTC
LW: 8 AF: 5
AF
There has been surprisingly little written on concrete threat models for how AI leads to existential catastrophes (though you’ve done some great work rectifying this!). Why is this?
I’m not sure why there isn’t more work on concrete descriptions of possible futures and how they go wrong. Some guesses:
- Anything concrete is almost certainly wrong. People are less convinced that it’s useful given that it will be wrong, and so try to make vaguer / more abstract stories that maybe describe reality at the expense of having less detail.
- It’s not exactly clear what you do with such a story or what the upside is, it’s kind of a vague theory of change and most people have some specific theory of change they are more excited about (even if this kind of story is a bit of a public good that’s useful on a broader variety of perspectives / to people who are skeptical).
- Any detailed story produced in a reasonable amount of time will also be obviously wrong to someone who notices the right considerations or has the right background. It’s very demoralizing to write something that someone is going to recognize is obviously incoherent/wrong (especially if you expect that to get pointed out and taken by some to undermine your view).
- It just kind of takes a long time and is hard, and people don’t do that many hard things that take a long time.
- A lot of people most interested in futurism are into very fast-take-off models where there isn’t as much to say and they maybe feel like it’s mostly been said.
(I think that “threat models” is somewhat broader / different from concrete stories, and it’s a bit less clear to me exactly how much people have done or what counts.)
- Neel Nanda 1 May 2021 19:05 UTC
  LW: 2 AF: 1
  AF Parent
  It’s not exactly clear what you do with such a story or what the upside is, it’s kind of a vague theory of change and most people have some specific theory of change they are more excited about (even if this kind of story is a bit of a public good that’s useful on a broader variety of perspectives / to people who are skeptical).
  Ah, interesting! I’m surprised to hear that. I was under the impression that while many researchers had a specific theory of change, it was often motivated by an underlying threat model, and that different threat models lead to different research interests.
  Eg, someone worries about a future where AI control the world but are not human comprehensible, feels very different from someone worried about a world where we produce an expected utility maximiser that has a subtly incorrect objective, resulting in bad convergent instrumental goals.
  Do you think this is a bad model of how researchers think? Or are you, eg, arguing that having a detailed, concrete story isn’t important here, just the vague intuition for how AI goes wrong?
  - paulfchristiano 1 May 2021 20:50 UTC
    LW: 5 AF: 4
    AF Parent
    I think most people have expectations regarding e.g. how explicitly will systems represent their preferences, how much will they have preferences, how will that relate to optimization objectives used in ML training, how well will they be understood by humans, etc.
    Then there’s a bunch of different things you might want: articulations of particular views on some of those questions, stories that (in virtue of being concrete) show a whole set of guesses and how they can lead to a bad or good outcome, etc. My bullet points were mostly regarding the exercise of fleshing out a particular story (which is therefore most likely to be wrong), rather than e.g. thinking about particular questions about the future.