I don’t think GPT has the sense of myopia relevant to deceptive alignment any more or less than models fine-tuned with RLHF. There are other bigger impacts of RLHF both for the quoted empirical results and for the actual probability of deceptive alignment, and I think the concept is being used in a way that is mostly incoherent.
But I was mostly objecting to the claim that RLHF ruined [the strategy]. I think even granting the contested empirics it doesn’t quite make sense to me.
Sorry to respond late, but a crux I might have here is that I see the removal of myopia and the addition of agency/non-causal decision theories as a major negative of an alignment plan by default, and without specific mechanisms of how deceptive alignment/mesa optimizers can’t arise, I expect non-myopic training to find such things.
In general, the fact that OpenAI chose RLHF made the problem quite harder, and I suspect this is an example of Goodhart’s law in action.
The Recursive Reward Modeling and debate plans could make up for this, assuming we can solve deceptive alignment. But right now, I see trouble ahead and OpenAI is probably going to be bailed out by other alignment groups.
Why should we think of base GPT as myopic, such that “non-myopic training” can remove that property? Training a policy to imitate traces of “non-myopic cognition” in the first place seems like a way to plausibly create a policy that itself has “non-myopic cognition”. But this is exactly how GPT pretraining works.
I don’t think GPT has the sense of myopia relevant to deceptive alignment any more or less than models fine-tuned with RLHF. There are other bigger impacts of RLHF both for the quoted empirical results and for the actual probability of deceptive alignment, and I think the concept is being used in a way that is mostly incoherent.
But I was mostly objecting to the claim that RLHF ruined [the strategy]. I think even granting the contested empirics it doesn’t quite make sense to me.
Sorry to respond late, but a crux I might have here is that I see the removal of myopia and the addition of agency/non-causal decision theories as a major negative of an alignment plan by default, and without specific mechanisms of how deceptive alignment/mesa optimizers can’t arise, I expect non-myopic training to find such things.
In general, the fact that OpenAI chose RLHF made the problem quite harder, and I suspect this is an example of Goodhart’s law in action.
The Recursive Reward Modeling and debate plans could make up for this, assuming we can solve deceptive alignment. But right now, I see trouble ahead and OpenAI is probably going to be bailed out by other alignment groups.
Why should we think of base GPT as myopic, such that “non-myopic training” can remove that property? Training a policy to imitate traces of “non-myopic cognition” in the first place seems like a way to plausibly create a policy that itself has “non-myopic cognition”. But this is exactly how GPT pretraining works.