Charbel-Raphaël comments on Compendium of problems with RLHF

Charbel-Raphaël 30 Jan 2023 19:41 UTC
3 points
2
Thanks, I overlooked this and it makes sense to me. However, I’m not as certain about your last sentence:
“and would also have corrigibility problems that were never present in ChatGPT because ChatGPT was never trying to navigate the real world.”
I agree with the idea of “steering the trajectory,” and this is a possibility we must consider. However, I still expect that if we train the robot to use the “Shut Down” token when it hears “Hi RobotGPT, please shut down,” I don’t see why it wouldn’t work.
It seems to me that we’re comparing a second-order effect with a first-order effect.