Basically, I’m talking about how RLHF removed a very valuable property called myopia. If you had myopia by default, like say the GPT series of simulators, then you just had to apply the appropriate decision theory like LCDT, and the GPT series of simulators could do something like HCH or IDA on real life. But RLHF removed myopia, and thus deceptive alignment and mesa optimization is possible, arguably incentivized under a non-myopic scheme. This is probably harder to solve than having a non-agentic system alignment problem.
Now you do mention that RLHF is more capable, and yeah that is sort of depressing that the most capable models align well with the most deceptive models.
I don’t think GPT has the sense of myopia relevant to deceptive alignment any more or less than models fine-tuned with RLHF. There are other bigger impacts of RLHF both for the quoted empirical results and for the actual probability of deceptive alignment, and I think the concept is being used in a way that is mostly incoherent.
But I was mostly objecting to the claim that RLHF ruined [the strategy]. I think even granting the contested empirics it doesn’t quite make sense to me.
Sorry to respond late, but a crux I might have here is that I see the removal of myopia and the addition of agency/non-causal decision theories as a major negative of an alignment plan by default, and without specific mechanisms of how deceptive alignment/mesa optimizers can’t arise, I expect non-myopic training to find such things.
In general, the fact that OpenAI chose RLHF made the problem quite harder, and I suspect this is an example of Goodhart’s law in action.
The Recursive Reward Modeling and debate plans could make up for this, assuming we can solve deceptive alignment. But right now, I see trouble ahead and OpenAI is probably going to be bailed out by other alignment groups.
Why should we think of base GPT as myopic, such that “non-myopic training” can remove that property? Training a policy to imitate traces of “non-myopic cognition” in the first place seems like a way to plausibly create a policy that itself has “non-myopic cognition”. But this is exactly how GPT pretraining works.
Basically, I’m talking about how RLHF removed a very valuable property called myopia. If you had myopia by default, like say the GPT series of simulators, then you just had to apply the appropriate decision theory like LCDT, and the GPT series of simulators could do something like HCH or IDA on real life. But RLHF removed myopia, and thus deceptive alignment and mesa optimization is possible, arguably incentivized under a non-myopic scheme. This is probably harder to solve than having a non-agentic system alignment problem.
I’ll provide a link below:
https://www.lesswrong.com/posts/yRAo2KEGWenKYZG9K/discovering-language-model-behaviors-with-model-written
Now you do mention that RLHF is more capable, and yeah that is sort of depressing that the most capable models align well with the most deceptive models.
I don’t think GPT has the sense of myopia relevant to deceptive alignment any more or less than models fine-tuned with RLHF. There are other bigger impacts of RLHF both for the quoted empirical results and for the actual probability of deceptive alignment, and I think the concept is being used in a way that is mostly incoherent.
But I was mostly objecting to the claim that RLHF ruined [the strategy]. I think even granting the contested empirics it doesn’t quite make sense to me.
Sorry to respond late, but a crux I might have here is that I see the removal of myopia and the addition of agency/non-causal decision theories as a major negative of an alignment plan by default, and without specific mechanisms of how deceptive alignment/mesa optimizers can’t arise, I expect non-myopic training to find such things.
In general, the fact that OpenAI chose RLHF made the problem quite harder, and I suspect this is an example of Goodhart’s law in action.
The Recursive Reward Modeling and debate plans could make up for this, assuming we can solve deceptive alignment. But right now, I see trouble ahead and OpenAI is probably going to be bailed out by other alignment groups.
Why should we think of base GPT as myopic, such that “non-myopic training” can remove that property? Training a policy to imitate traces of “non-myopic cognition” in the first place seems like a way to plausibly create a policy that itself has “non-myopic cognition”. But this is exactly how GPT pretraining works.