I don’t know all the details, but the idea was that a thing that mimics humans and was capable would be safer than a thing that did lots of RL in a range of tasks and was powerful, so the creator of the architecture worked on improving text generation.
I don’t think this is true. Transformers were introduced by normal NLP researchers at Google. Generative pre-training is a natural thing to do with them, introduced at OpenAI by Alec Radford (blog post here) with no relationship to alignment.
I just looked into it, it turns out you’re right. I think I was given a misimpression of the motivations here due to much OpenAI research at the time being vaguely motivated by “lets make AGI, and lets make it good”, but it was actually largely divorced from modern alignment considerations.
And this is actually pretty reasonable as a strategy, given their general myopia by default and their simulator nature playing well with alignment ideas like HCH. If we could avoid a second optimizer arising, then this scaled up would be nearly ideal for automated research on say alignment. But RLHF ruined it, and this was IMO a good example of a looking good alignment strategy that wasn’t actually good.
I’m not quite clear on what you are saying here. If conditioning generative models is a reasonably efficient way to get work out of an AI, we can still do that. Unfortunately it’s probably not an effective way to build an AI, and so people will do other things. You can convince them that other things are less safe and then maybe they won’t do other things.
Are you saying that maybe no one would have thought of using RL on language models, and so we could have gotten a way with a world where we used AI inefficiently because we didn’t think of better ideas? In my view (based e.g. on talking a bunch to people working at OpenAI labs prior to me working on RLHF) that was never remotely plausible outcome.
ETA: also just to be clear I think that this (the fictional strategy of developing GPT so that future AIs won’t be agents) would be a bad strategy, vulnerable to 10-100x more compelling versions of the legitimate objections being raised in the comments.
Basically, I’m talking about how RLHF removed a very valuable property called myopia. If you had myopia by default, like say the GPT series of simulators, then you just had to apply the appropriate decision theory like LCDT, and the GPT series of simulators could do something like HCH or IDA on real life. But RLHF removed myopia, and thus deceptive alignment and mesa optimization is possible, arguably incentivized under a non-myopic scheme. This is probably harder to solve than having a non-agentic system alignment problem.
Now you do mention that RLHF is more capable, and yeah that is sort of depressing that the most capable models align well with the most deceptive models.
I don’t think GPT has the sense of myopia relevant to deceptive alignment any more or less than models fine-tuned with RLHF. There are other bigger impacts of RLHF both for the quoted empirical results and for the actual probability of deceptive alignment, and I think the concept is being used in a way that is mostly incoherent.
But I was mostly objecting to the claim that RLHF ruined [the strategy]. I think even granting the contested empirics it doesn’t quite make sense to me.
Sorry to respond late, but a crux I might have here is that I see the removal of myopia and the addition of agency/non-causal decision theories as a major negative of an alignment plan by default, and without specific mechanisms of how deceptive alignment/mesa optimizers can’t arise, I expect non-myopic training to find such things.
In general, the fact that OpenAI chose RLHF made the problem quite harder, and I suspect this is an example of Goodhart’s law in action.
The Recursive Reward Modeling and debate plans could make up for this, assuming we can solve deceptive alignment. But right now, I see trouble ahead and OpenAI is probably going to be bailed out by other alignment groups.
Why should we think of base GPT as myopic, such that “non-myopic training” can remove that property? Training a policy to imitate traces of “non-myopic cognition” in the first place seems like a way to plausibly create a policy that itself has “non-myopic cognition”. But this is exactly how GPT pretraining works.
Huh, I’d not heard that, would be interested in hearing more about the thought process behind its development.
Think they could well turn out to be correct in that having systems with such a strong understanding of human concepts gives us levers we might not have had, though code-writing proficiency is a very unfortunate development.
I don’t know all the details, but the idea was that a thing that mimics humans and was capable would be safer than a thing that did lots of RL in a range of tasks and was powerful, so the creator of the architecture worked on improving text generation.
I don’t think this is true. Transformers were introduced by normal NLP researchers at Google. Generative pre-training is a natural thing to do with them, introduced at OpenAI by Alec Radford (blog post here) with no relationship to alignment.
I just looked into it, it turns out you’re right. I think I was given a misimpression of the motivations here due to much OpenAI research at the time being vaguely motivated by “lets make AGI, and lets make it good”, but it was actually largely divorced from modern alignment considerations.
And this is actually pretty reasonable as a strategy, given their general myopia by default and their simulator nature playing well with alignment ideas like HCH. If we could avoid a second optimizer arising, then this scaled up would be nearly ideal for automated research on say alignment. But RLHF ruined it, and this was IMO a good example of a looking good alignment strategy that wasn’t actually good.
I’m not quite clear on what you are saying here. If conditioning generative models is a reasonably efficient way to get work out of an AI, we can still do that. Unfortunately it’s probably not an effective way to build an AI, and so people will do other things. You can convince them that other things are less safe and then maybe they won’t do other things.
Are you saying that maybe no one would have thought of using RL on language models, and so we could have gotten a way with a world where we used AI inefficiently because we didn’t think of better ideas? In my view (based e.g. on talking a bunch to people working at OpenAI labs prior to me working on RLHF) that was never remotely plausible outcome.
ETA: also just to be clear I think that this (the fictional strategy of developing GPT so that future AIs won’t be agents) would be a bad strategy, vulnerable to 10-100x more compelling versions of the legitimate objections being raised in the comments.
Basically, I’m talking about how RLHF removed a very valuable property called myopia. If you had myopia by default, like say the GPT series of simulators, then you just had to apply the appropriate decision theory like LCDT, and the GPT series of simulators could do something like HCH or IDA on real life. But RLHF removed myopia, and thus deceptive alignment and mesa optimization is possible, arguably incentivized under a non-myopic scheme. This is probably harder to solve than having a non-agentic system alignment problem.
I’ll provide a link below:
https://www.lesswrong.com/posts/yRAo2KEGWenKYZG9K/discovering-language-model-behaviors-with-model-written
Now you do mention that RLHF is more capable, and yeah that is sort of depressing that the most capable models align well with the most deceptive models.
I don’t think GPT has the sense of myopia relevant to deceptive alignment any more or less than models fine-tuned with RLHF. There are other bigger impacts of RLHF both for the quoted empirical results and for the actual probability of deceptive alignment, and I think the concept is being used in a way that is mostly incoherent.
But I was mostly objecting to the claim that RLHF ruined [the strategy]. I think even granting the contested empirics it doesn’t quite make sense to me.
Sorry to respond late, but a crux I might have here is that I see the removal of myopia and the addition of agency/non-causal decision theories as a major negative of an alignment plan by default, and without specific mechanisms of how deceptive alignment/mesa optimizers can’t arise, I expect non-myopic training to find such things.
In general, the fact that OpenAI chose RLHF made the problem quite harder, and I suspect this is an example of Goodhart’s law in action.
The Recursive Reward Modeling and debate plans could make up for this, assuming we can solve deceptive alignment. But right now, I see trouble ahead and OpenAI is probably going to be bailed out by other alignment groups.
Why should we think of base GPT as myopic, such that “non-myopic training” can remove that property? Training a policy to imitate traces of “non-myopic cognition” in the first place seems like a way to plausibly create a policy that itself has “non-myopic cognition”. But this is exactly how GPT pretraining works.
Huh, I’d not heard that, would be interested in hearing more about the thought process behind its development.
Think they could well turn out to be correct in that having systems with such a strong understanding of human concepts gives us levers we might not have had, though code-writing proficiency is a very unfortunate development.