Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.
are the same arguments used to justify working on gain-of-function research. This is not a knock-down criticism of these kinds of arguments, but I do note we should expect similar failure-modes, and not enough people are sufficiently pessimistic when it comes to analyzing failure-modes of their plans.
I think that there are many kinds of in vitro failures that don’t pose any lab leak risk. For example, training models against weak overseers and observing the dynamics when they try to overpower those overseers, doesn’t have any effect on increasing takeover risk. Similarly, the kinds of toy models of deceptive alignment we would build don’t increase the probability of deceptive alignment.
I think this kind of work is pretty much essential to realistic stories for how alignment actually makes progress or how we anticipate alignment failures.
Where treacherous turns in powerful and not so powerful systems occur for the same reasons, we should expect treacherous turns in not so powerful agents before they go FOOM, and we’ll have lots of time to iterate on such failures before we make more capable AGIs
This seems wrong. For example, you can get treacherous turns in weak systems if you train them with weak overseers, or if you deliberately take actions that make in vitro treacherous turns more likely, without automatically getting such failures if you are constantly doing your best to make your AIs behave well.
I’m skeptical of such work leading to good alignment work or a slow-down in capabilities in worlds where these properties do not hold. You likely won’t convince anyone of the problem because they’ll see you advocating for us living in one world yet showing demonstrations which are only evidence of doom in a different world, and if you do they’ll work on the wrong problems.
I completely disagree. I think having empirical examples of weak AIs overpowering weak overseers, even after a long track record of behaving well in training, would be extremely compelling to most ML researchers as a demonstration that stronger AIs might overpower stronger overseers, even after a long track record of behaving well in training. And whether or not it was persuasive, it would be extremely valuable for doing actually productive research to detect and correct such failures.
(The deceptive alignment story is more complicated, and I do think it’s less of a persuasive slam dunk, though I still think it’s very good for making the story 90% less speculative and creating toy settings to work on detection and correction.)
In particular, this kills us in worlds where RLHF does in fact mostly just work, we don’t get an intelligence explosion, and we do need to worry about misuse risks, or the equivalent of AGI “lab-leaks”.
I don’t think that most of the work in this category meaningfully increases the probability of lab leaks or misuse (again, the prototypical example is a weak AI overpowering a weak overseer).
That said, I am also interested in work that does have real risks, for example understanding how close AI systems are to dangerous capabilities by fine-tuning them for similar tasks. In these cases I think taking risks seriously is important. But (as with gain-of-function research on viruses) I think the question ultimately comes down to a cost-benefit analysis. In this case it seems possible to do the research in a way with relatively low risk, and the world where “AI systems would be catastrophic if they decided to behave badly, but we never checked” is quite a bad world that you had a good chance of avoiding by doing such work.
I think it’s reasonable to expect people to underestimate risks of their own work via attachment to it and via selection (whoever is least concerned does it), so it seems reasonable to have external accountability and oversight for this and to be open to people making arguments that risks are underestimated.
I think that there are many kinds of in vitro failures that don’t pose any lab leak risk. For example, training models against weak overseers and observing the dynamics when they try to overpower those overseers, doesn’t have any effect on increasing takeover risk. Similarly, the kinds of toy models of deceptive alignment we would build don’t increase the probability of deceptive alignment.
I think this kind of work is pretty much essential to realistic stories for how alignment actually makes progress or how we anticipate alignment failures.
This seems wrong. For example, you can get treacherous turns in weak systems if you train them with weak overseers, or if you deliberately take actions that make in vitro treacherous turns more likely, without automatically getting such failures if you are constantly doing your best to make your AIs behave well.
I completely disagree. I think having empirical examples of weak AIs overpowering weak overseers, even after a long track record of behaving well in training, would be extremely compelling to most ML researchers as a demonstration that stronger AIs might overpower stronger overseers, even after a long track record of behaving well in training. And whether or not it was persuasive, it would be extremely valuable for doing actually productive research to detect and correct such failures.
(The deceptive alignment story is more complicated, and I do think it’s less of a persuasive slam dunk, though I still think it’s very good for making the story 90% less speculative and creating toy settings to work on detection and correction.)
I don’t think that most of the work in this category meaningfully increases the probability of lab leaks or misuse (again, the prototypical example is a weak AI overpowering a weak overseer).
That said, I am also interested in work that does have real risks, for example understanding how close AI systems are to dangerous capabilities by fine-tuning them for similar tasks. In these cases I think taking risks seriously is important. But (as with gain-of-function research on viruses) I think the question ultimately comes down to a cost-benefit analysis. In this case it seems possible to do the research in a way with relatively low risk, and the world where “AI systems would be catastrophic if they decided to behave badly, but we never checked” is quite a bad world that you had a good chance of avoiding by doing such work.
I think it’s reasonable to expect people to underestimate risks of their own work via attachment to it and via selection (whoever is least concerned does it), so it seems reasonable to have external accountability and oversight for this and to be open to people making arguments that risks are underestimated.