RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line. RLHF is increasingly important as time goes on, but it also becomes increasingly overdetermined that people would have done it. In general I think your expectation should be that incidental capabilities progress from safety research is a small part of total progress, given that it’s a small fraction of people, very much not focused on accelerating things effectively, in a domain with diminishing returns to simultaneous human effort. This can be overturned by looking at details in particular cases, but I think safety people making this argument mostly aren’t engaging with details in a realistic way.
I think this argument, if true, mostly says that your work on RLHF must have been net-neutral, because people would have done RLHF even if nobody did it for the purposes of alignment. If false, then RLHF was net-negative because of its capabilities externalities. I also don’t buy your argument about relative numbers of people working on capabilities versus alignment. Yes, more people are in the ML field than the alignment field, but the vast majority of the ML field is not so concerned about AGI, and more concerned about making local progress. It is also far easier to make progress on capabilities than alignment, especially when you’re not trying to make progress on alignment’s core problems, and instead trying to get very pretty lines on graphs so you can justify your existence to your employer. It also, empirically, just seems weird that GPT and RLHF were both developed as alignment strategies, yet have so many uses in capabilities.
I also note that strategies like
Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.
are the same arguments used to justify working on gain-of-function research. This is not a knock-down criticism of these kinds of arguments, but I do note we should expect similar failure-modes, and not enough people are sufficiently pessimistic when it comes to analyzing failure-modes of their plans. In particular, this kills us in worlds where RLHF does in fact mostly just work, we don’t get an intelligence explosion, and we do need to worry about misuse risks, or the equivalent of AGI “lab-leaks”. I think such worlds are unlikely, but I also think most of the benefits of such work only occur in such worlds. Where treacherous turns in powerful and not so powerful systems occur for the same reasons, we should expect treacherous turns in not so powerful agents before they go FOOM, and we’ll have lots of time to iterate on such failures before we make more capable AGIs. I’m skeptical of such work leading to good alignment work or a slow-down in capabilities in worlds where these properties do not hold. You likely won’t convince anyone of the problem because they’ll see you advocating for us living in one world yet showing demonstrations which are only evidence of doom in a different world, and if you do they’ll work on the wrong problems.
I think this argument, if true, mostly says that your work on RLHF must have been net-neutral, because people would have done RLHF even if nobody did it for the purposes of alignment.
Doing things sooner and in a different way matters.
This argument is like saying that scaling up language models is net-neutral for AGI, because people would have done it anyway for non-AGI purposes. Doing things sooner matters a lot. I think in most of science and engineering that’s the main kind of effect that anything has.
If false, then RLHF was net-negative because of its capabilities externalities.
No, if false then it has a negative effect which must be quantitatively compared against positive effects.
Most things have some negative effects (e.g. LW itself).
It is also far easier to make progress on capabilities than alignment
This doesn’t seem relevant—we were asking how large an accelerating effect alignment researchers have relative to capabilities researchers (since that determines how many days of speed-up they cause), so if capabilities progress is easier then that seems to increase both numerator and denominator.
especially when you’re not trying to make progress on alignment’s core problems, and instead trying to get very pretty lines on graphs so you can justify your existence to your employer.
To the extent this is a claim about my motivations, I think it’s false. (I don’t think it should look especially plausible from the outside given the overall history of my life.)
As a claim about what matters to alignment and what is “core” it’s merely totally unjustified.
It also, empirically, just seems weird that GPT and RLHF were both developed as alignment strategies, yet have so many uses in capabilities.
This is false, so it makes sense it would seem weird!
Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.
are the same arguments used to justify working on gain-of-function research. This is not a knock-down criticism of these kinds of arguments, but I do note we should expect similar failure-modes, and not enough people are sufficiently pessimistic when it comes to analyzing failure-modes of their plans.
I think that there are many kinds of in vitro failures that don’t pose any lab leak risk. For example, training models against weak overseers and observing the dynamics when they try to overpower those overseers, doesn’t have any effect on increasing takeover risk. Similarly, the kinds of toy models of deceptive alignment we would build don’t increase the probability of deceptive alignment.
I think this kind of work is pretty much essential to realistic stories for how alignment actually makes progress or how we anticipate alignment failures.
Where treacherous turns in powerful and not so powerful systems occur for the same reasons, we should expect treacherous turns in not so powerful agents before they go FOOM, and we’ll have lots of time to iterate on such failures before we make more capable AGIs
This seems wrong. For example, you can get treacherous turns in weak systems if you train them with weak overseers, or if you deliberately take actions that make in vitro treacherous turns more likely, without automatically getting such failures if you are constantly doing your best to make your AIs behave well.
I’m skeptical of such work leading to good alignment work or a slow-down in capabilities in worlds where these properties do not hold. You likely won’t convince anyone of the problem because they’ll see you advocating for us living in one world yet showing demonstrations which are only evidence of doom in a different world, and if you do they’ll work on the wrong problems.
I completely disagree. I think having empirical examples of weak AIs overpowering weak overseers, even after a long track record of behaving well in training, would be extremely compelling to most ML researchers as a demonstration that stronger AIs might overpower stronger overseers, even after a long track record of behaving well in training. And whether or not it was persuasive, it would be extremely valuable for doing actually productive research to detect and correct such failures.
(The deceptive alignment story is more complicated, and I do think it’s less of a persuasive slam dunk, though I still think it’s very good for making the story 90% less speculative and creating toy settings to work on detection and correction.)
In particular, this kills us in worlds where RLHF does in fact mostly just work, we don’t get an intelligence explosion, and we do need to worry about misuse risks, or the equivalent of AGI “lab-leaks”.
I don’t think that most of the work in this category meaningfully increases the probability of lab leaks or misuse (again, the prototypical example is a weak AI overpowering a weak overseer).
That said, I am also interested in work that does have real risks, for example understanding how close AI systems are to dangerous capabilities by fine-tuning them for similar tasks. In these cases I think taking risks seriously is important. But (as with gain-of-function research on viruses) I think the question ultimately comes down to a cost-benefit analysis. In this case it seems possible to do the research in a way with relatively low risk, and the world where “AI systems would be catastrophic if they decided to behave badly, but we never checked” is quite a bad world that you had a good chance of avoiding by doing such work.
I think it’s reasonable to expect people to underestimate risks of their own work via attachment to it and via selection (whoever is least concerned does it), so it seems reasonable to have external accountability and oversight for this and to be open to people making arguments that risks are underestimated.
I don’t know all the details, but the idea was that a thing that mimics humans and was capable would be safer than a thing that did lots of RL in a range of tasks and was powerful, so the creator of the architecture worked on improving text generation.
I don’t think this is true. Transformers were introduced by normal NLP researchers at Google. Generative pre-training is a natural thing to do with them, introduced at OpenAI by Alec Radford (blog post here) with no relationship to alignment.
I just looked into it, it turns out you’re right. I think I was given a misimpression of the motivations here due to much OpenAI research at the time being vaguely motivated by “lets make AGI, and lets make it good”, but it was actually largely divorced from modern alignment considerations.
And this is actually pretty reasonable as a strategy, given their general myopia by default and their simulator nature playing well with alignment ideas like HCH. If we could avoid a second optimizer arising, then this scaled up would be nearly ideal for automated research on say alignment. But RLHF ruined it, and this was IMO a good example of a looking good alignment strategy that wasn’t actually good.
I’m not quite clear on what you are saying here. If conditioning generative models is a reasonably efficient way to get work out of an AI, we can still do that. Unfortunately it’s probably not an effective way to build an AI, and so people will do other things. You can convince them that other things are less safe and then maybe they won’t do other things.
Are you saying that maybe no one would have thought of using RL on language models, and so we could have gotten a way with a world where we used AI inefficiently because we didn’t think of better ideas? In my view (based e.g. on talking a bunch to people working at OpenAI labs prior to me working on RLHF) that was never remotely plausible outcome.
ETA: also just to be clear I think that this (the fictional strategy of developing GPT so that future AIs won’t be agents) would be a bad strategy, vulnerable to 10-100x more compelling versions of the legitimate objections being raised in the comments.
Basically, I’m talking about how RLHF removed a very valuable property called myopia. If you had myopia by default, like say the GPT series of simulators, then you just had to apply the appropriate decision theory like LCDT, and the GPT series of simulators could do something like HCH or IDA on real life. But RLHF removed myopia, and thus deceptive alignment and mesa optimization is possible, arguably incentivized under a non-myopic scheme. This is probably harder to solve than having a non-agentic system alignment problem.
Now you do mention that RLHF is more capable, and yeah that is sort of depressing that the most capable models align well with the most deceptive models.
I don’t think GPT has the sense of myopia relevant to deceptive alignment any more or less than models fine-tuned with RLHF. There are other bigger impacts of RLHF both for the quoted empirical results and for the actual probability of deceptive alignment, and I think the concept is being used in a way that is mostly incoherent.
But I was mostly objecting to the claim that RLHF ruined [the strategy]. I think even granting the contested empirics it doesn’t quite make sense to me.
Sorry to respond late, but a crux I might have here is that I see the removal of myopia and the addition of agency/non-causal decision theories as a major negative of an alignment plan by default, and without specific mechanisms of how deceptive alignment/mesa optimizers can’t arise, I expect non-myopic training to find such things.
In general, the fact that OpenAI chose RLHF made the problem quite harder, and I suspect this is an example of Goodhart’s law in action.
The Recursive Reward Modeling and debate plans could make up for this, assuming we can solve deceptive alignment. But right now, I see trouble ahead and OpenAI is probably going to be bailed out by other alignment groups.
Why should we think of base GPT as myopic, such that “non-myopic training” can remove that property? Training a policy to imitate traces of “non-myopic cognition” in the first place seems like a way to plausibly create a policy that itself has “non-myopic cognition”. But this is exactly how GPT pretraining works.
Huh, I’d not heard that, would be interested in hearing more about the thought process behind its development.
Think they could well turn out to be correct in that having systems with such a strong understanding of human concepts gives us levers we might not have had, though code-writing proficiency is a very unfortunate development.
I think this argument, if true, mostly says that your work on RLHF must have been net-neutral, because people would have done RLHF even if nobody did it for the purposes of alignment. If false, then RLHF was net-negative because of its capabilities externalities. I also don’t buy your argument about relative numbers of people working on capabilities versus alignment. Yes, more people are in the ML field than the alignment field, but the vast majority of the ML field is not so concerned about AGI, and more concerned about making local progress. It is also far easier to make progress on capabilities than alignment, especially when you’re not trying to make progress on alignment’s core problems, and instead trying to get very pretty lines on graphs so you can justify your existence to your employer. It also, empirically, just seems weird that GPT and RLHF were both developed as alignment strategies, yet have so many uses in capabilities.
I also note that strategies like
are the same arguments used to justify working on gain-of-function research. This is not a knock-down criticism of these kinds of arguments, but I do note we should expect similar failure-modes, and not enough people are sufficiently pessimistic when it comes to analyzing failure-modes of their plans. In particular, this kills us in worlds where RLHF does in fact mostly just work, we don’t get an intelligence explosion, and we do need to worry about misuse risks, or the equivalent of AGI “lab-leaks”. I think such worlds are unlikely, but I also think most of the benefits of such work only occur in such worlds. Where treacherous turns in powerful and not so powerful systems occur for the same reasons, we should expect treacherous turns in not so powerful agents before they go FOOM, and we’ll have lots of time to iterate on such failures before we make more capable AGIs. I’m skeptical of such work leading to good alignment work or a slow-down in capabilities in worlds where these properties do not hold. You likely won’t convince anyone of the problem because they’ll see you advocating for us living in one world yet showing demonstrations which are only evidence of doom in a different world, and if you do they’ll work on the wrong problems.
Doing things sooner and in a different way matters.
This argument is like saying that scaling up language models is net-neutral for AGI, because people would have done it anyway for non-AGI purposes. Doing things sooner matters a lot. I think in most of science and engineering that’s the main kind of effect that anything has.
No, if false then it has a negative effect which must be quantitatively compared against positive effects.
Most things have some negative effects (e.g. LW itself).
This doesn’t seem relevant—we were asking how large an accelerating effect alignment researchers have relative to capabilities researchers (since that determines how many days of speed-up they cause), so if capabilities progress is easier then that seems to increase both numerator and denominator.
To the extent this is a claim about my motivations, I think it’s false. (I don’t think it should look especially plausible from the outside given the overall history of my life.)
As a claim about what matters to alignment and what is “core” it’s merely totally unjustified.
This is false, so it makes sense it would seem weird!
I think that there are many kinds of in vitro failures that don’t pose any lab leak risk. For example, training models against weak overseers and observing the dynamics when they try to overpower those overseers, doesn’t have any effect on increasing takeover risk. Similarly, the kinds of toy models of deceptive alignment we would build don’t increase the probability of deceptive alignment.
I think this kind of work is pretty much essential to realistic stories for how alignment actually makes progress or how we anticipate alignment failures.
This seems wrong. For example, you can get treacherous turns in weak systems if you train them with weak overseers, or if you deliberately take actions that make in vitro treacherous turns more likely, without automatically getting such failures if you are constantly doing your best to make your AIs behave well.
I completely disagree. I think having empirical examples of weak AIs overpowering weak overseers, even after a long track record of behaving well in training, would be extremely compelling to most ML researchers as a demonstration that stronger AIs might overpower stronger overseers, even after a long track record of behaving well in training. And whether or not it was persuasive, it would be extremely valuable for doing actually productive research to detect and correct such failures.
(The deceptive alignment story is more complicated, and I do think it’s less of a persuasive slam dunk, though I still think it’s very good for making the story 90% less speculative and creating toy settings to work on detection and correction.)
I don’t think that most of the work in this category meaningfully increases the probability of lab leaks or misuse (again, the prototypical example is a weak AI overpowering a weak overseer).
That said, I am also interested in work that does have real risks, for example understanding how close AI systems are to dangerous capabilities by fine-tuning them for similar tasks. In these cases I think taking risks seriously is important. But (as with gain-of-function research on viruses) I think the question ultimately comes down to a cost-benefit analysis. In this case it seems possible to do the research in a way with relatively low risk, and the world where “AI systems would be catastrophic if they decided to behave badly, but we never checked” is quite a bad world that you had a good chance of avoiding by doing such work.
I think it’s reasonable to expect people to underestimate risks of their own work via attachment to it and via selection (whoever is least concerned does it), so it seems reasonable to have external accountability and oversight for this and to be open to people making arguments that risks are underestimated.
Really? How so?
I don’t know all the details, but the idea was that a thing that mimics humans and was capable would be safer than a thing that did lots of RL in a range of tasks and was powerful, so the creator of the architecture worked on improving text generation.
I don’t think this is true. Transformers were introduced by normal NLP researchers at Google. Generative pre-training is a natural thing to do with them, introduced at OpenAI by Alec Radford (blog post here) with no relationship to alignment.
I just looked into it, it turns out you’re right. I think I was given a misimpression of the motivations here due to much OpenAI research at the time being vaguely motivated by “lets make AGI, and lets make it good”, but it was actually largely divorced from modern alignment considerations.
And this is actually pretty reasonable as a strategy, given their general myopia by default and their simulator nature playing well with alignment ideas like HCH. If we could avoid a second optimizer arising, then this scaled up would be nearly ideal for automated research on say alignment. But RLHF ruined it, and this was IMO a good example of a looking good alignment strategy that wasn’t actually good.
I’m not quite clear on what you are saying here. If conditioning generative models is a reasonably efficient way to get work out of an AI, we can still do that. Unfortunately it’s probably not an effective way to build an AI, and so people will do other things. You can convince them that other things are less safe and then maybe they won’t do other things.
Are you saying that maybe no one would have thought of using RL on language models, and so we could have gotten a way with a world where we used AI inefficiently because we didn’t think of better ideas? In my view (based e.g. on talking a bunch to people working at OpenAI labs prior to me working on RLHF) that was never remotely plausible outcome.
ETA: also just to be clear I think that this (the fictional strategy of developing GPT so that future AIs won’t be agents) would be a bad strategy, vulnerable to 10-100x more compelling versions of the legitimate objections being raised in the comments.
Basically, I’m talking about how RLHF removed a very valuable property called myopia. If you had myopia by default, like say the GPT series of simulators, then you just had to apply the appropriate decision theory like LCDT, and the GPT series of simulators could do something like HCH or IDA on real life. But RLHF removed myopia, and thus deceptive alignment and mesa optimization is possible, arguably incentivized under a non-myopic scheme. This is probably harder to solve than having a non-agentic system alignment problem.
I’ll provide a link below:
https://www.lesswrong.com/posts/yRAo2KEGWenKYZG9K/discovering-language-model-behaviors-with-model-written
Now you do mention that RLHF is more capable, and yeah that is sort of depressing that the most capable models align well with the most deceptive models.
I don’t think GPT has the sense of myopia relevant to deceptive alignment any more or less than models fine-tuned with RLHF. There are other bigger impacts of RLHF both for the quoted empirical results and for the actual probability of deceptive alignment, and I think the concept is being used in a way that is mostly incoherent.
But I was mostly objecting to the claim that RLHF ruined [the strategy]. I think even granting the contested empirics it doesn’t quite make sense to me.
Sorry to respond late, but a crux I might have here is that I see the removal of myopia and the addition of agency/non-causal decision theories as a major negative of an alignment plan by default, and without specific mechanisms of how deceptive alignment/mesa optimizers can’t arise, I expect non-myopic training to find such things.
In general, the fact that OpenAI chose RLHF made the problem quite harder, and I suspect this is an example of Goodhart’s law in action.
The Recursive Reward Modeling and debate plans could make up for this, assuming we can solve deceptive alignment. But right now, I see trouble ahead and OpenAI is probably going to be bailed out by other alignment groups.
Why should we think of base GPT as myopic, such that “non-myopic training” can remove that property? Training a policy to imitate traces of “non-myopic cognition” in the first place seems like a way to plausibly create a policy that itself has “non-myopic cognition”. But this is exactly how GPT pretraining works.
Huh, I’d not heard that, would be interested in hearing more about the thought process behind its development.
Think they could well turn out to be correct in that having systems with such a strong understanding of human concepts gives us levers we might not have had, though code-writing proficiency is a very unfortunate development.