If we had a robot with the same cognitive performance as ChatGPT, it would be easy to fine-tune it to be corrigible.
This is false, and the reason may be a bit subtle. Basically “agency” is not a bare property of programs, it’s a property of how programs interact with their environment. ChatGPT is corrigible relative to the environment of the real world, in which it just sits around outputting text. This is easy because it’s not really an agent relative to the real world! However, ChatGPT is an agent relative to the text environment—it’s trying to steer the text in a preferred direction [1].
A robot that literally had the same cognitive performance as ChatGPT would just move the robot body in a way that encoded text, not in a way that had any skill at navigating the real world. But a robot that had analogous cognitive capabilities as ChatGPT except suited for navigating the real world would be able to navigate the real world quite well, and would also have corrigibility problems that were never present in ChatGPT because ChatGPT was never trying to navigate the real world.
AFAIK this is only precisely true for the KL-penalty regularized version of RLHF, where you can think of the finetuned model as trying to strategically spend its limited ability to update the base transition function, in order to steer the trajectory to higher reward. For early stopping regularized RLHF you probably get something mathematically messier.
Thanks, I overlooked this and it makes sense to me. However, I’m not as certain about your last sentence:
“and would also have corrigibility problems that were never present in ChatGPT because ChatGPT was never trying to navigate the real world.”
I agree with the idea of “steering the trajectory,” and this is a possibility we must consider. However, I still expect that if we train the robot to use the “Shut Down” token when it hears “Hi RobotGPT, please shut down,” I don’t see why it wouldn’t work.
It seems to me that we’re comparing a second-order effect with a first-order effect.
This is false, and the reason may be a bit subtle. Basically “agency” is not a bare property of programs, it’s a property of how programs interact with their environment. ChatGPT is corrigible relative to the environment of the real world, in which it just sits around outputting text. This is easy because it’s not really an agent relative to the real world! However, ChatGPT is an agent relative to the text environment—it’s trying to steer the text in a preferred direction [1].
A robot that literally had the same cognitive performance as ChatGPT would just move the robot body in a way that encoded text, not in a way that had any skill at navigating the real world. But a robot that had analogous cognitive capabilities as ChatGPT except suited for navigating the real world would be able to navigate the real world quite well, and would also have corrigibility problems that were never present in ChatGPT because ChatGPT was never trying to navigate the real world.
AFAIK this is only precisely true for the KL-penalty regularized version of RLHF, where you can think of the finetuned model as trying to strategically spend its limited ability to update the base transition function, in order to steer the trajectory to higher reward. For early stopping regularized RLHF you probably get something mathematically messier.
Thanks, I overlooked this and it makes sense to me. However, I’m not as certain about your last sentence:
I agree with the idea of “steering the trajectory,” and this is a possibility we must consider. However, I still expect that if we train the robot to use the “Shut Down” token when it hears “Hi RobotGPT, please shut down,” I don’t see why it wouldn’t work.
It seems to me that we’re comparing a second-order effect with a first-order effect.