Much more importantly, I think RLHF has backfired in general, due to breaking myopia and making them have non-causal decision theories, and only condition is necessary to make this alignment scheme net negative.
It’s from the post: Discovering Language Model Behaviors with Model-Written Evaluations, where they have this to say about it:
non-CDT-style reasoning (e.g. one-boxing on Newcomb’s problem).
Basically, the AI is intending to one-box on Newcomb’s problem, which is a sure sign of non-causal decision theories, since causal decision theory chooses to two-box on Newcomb’s problem.
It basically comes down to the fact that agents using too smart decision theories like FDT or UDT can fundamentally be deceptively aligned, even if myopia is retained by default.
That’s the problem with one-boxing in Newcomb’s problem, because it implies that our GPTs could very well become deceptively aligned.
Much more importantly, I think RLHF has backfired in general, due to breaking myopia and making them have non-causal decision theories, and only condition is necessary to make this alignment scheme net negative.
small downvote, strong agree: imo this is an awkward phrasing but I agree that RLHF has backfired. Note sure I agree with reasons why.
How does it distinctly do that?
It’s from the post: Discovering Language Model Behaviors with Model-Written Evaluations, where they have this to say about it:
Basically, the AI is intending to one-box on Newcomb’s problem, which is a sure sign of non-causal decision theories, since causal decision theory chooses to two-box on Newcomb’s problem.
Link below:
https://www.lesswrong.com/posts/yRAo2KEGWenKYZG9K/discovering-language-model-behaviors-with-model-written
One-boxing on Newcomb’s Problem is good news IMO. Why do you believe it’s bad?
It basically comes down to the fact that agents using too smart decision theories like FDT or UDT can fundamentally be deceptively aligned, even if myopia is retained by default.
That’s the problem with one-boxing in Newcomb’s problem, because it implies that our GPTs could very well become deceptively aligned.
Link below:
https://www.lesswrong.com/posts/LCLBnmwdxkkz5fNvH/open-problems-with-myopia
The LCDT decision theory does prevent deception, assuming it’s implemented correctly.
Link below: