Noosphere89 comments on Thoughts on the impact of RLHF research

Noosphere89 27 Jan 2023 17:57 UTC
1 point
−1
Sorry to respond late, but a crux I might have here is that I see the removal of myopia and the addition of agency/non-causal decision theories as a major negative of an alignment plan by default, and without specific mechanisms of how deceptive alignment/mesa optimizers can’t arise, I expect non-myopic training to find such things.

In general, the fact that OpenAI chose RLHF made the problem quite harder, and I suspect this is an example of Goodhart’s law in action.

The Recursive Reward Modeling and debate plans could make up for this, assuming we can solve deceptive alignment. But right now, I see trouble ahead and OpenAI is probably going to be bailed out by other alignment groups.
- cfoster0 27 Jan 2023 18:42 UTC
  4 points
  2
  Parent
  Why should we think of base GPT as myopic, such that “non-myopic training” can remove that property? Training a policy to imitate traces of “non-myopic cognition” in the first place seems like a way to plausibly create a policy that itself has “non-myopic cognition”. But this is exactly how GPT pretraining works.