Thomas Kwa comments on AI #33: Cool New Interpretability Paper

Thomas Kwa 17 Oct 2023 0:00 UTC
8 points
2
The proposed mechanism here is fascinating. You use RLHF to try and avoid damaging answers to certain questions. In doing so, you necessarily reinforce against accurate maps and logical consistency in the general case, unless you do something highly bespoke to prevent this from happening. [...] the AI’s maximizing solution involved not exhibiting proper reasoning as often, which kept happening out of distribution.
I don’t think we should conclude that algorithms like RLHF will “necessarily reinforce against accurate maps and logical consistency in the general case”. RLHF is a pretty crude algorithm, since it updates the model in all directions that credit assignment knows about. Even if feedback against damaging answers must destroy some good reasoning circuits, my guess is RLHF causes way more collateral damage than necessary and future techniques will be able to get the results of RLHF with less impact on reasoning quality.