I’m not updating significantly because things have gone basically exactly as I expected.
As for when RLHF will break down, two points:
(1) I’m not sure, but I expect it to happen for highly situationally aware, highly agentic opaque systems. Our current systems like GPT4 are opaque but not very agentic and their level of situational awareness is probably medium. (Also: This is not a special me-take. This is basically the standard take, no? I feel like this is what Risks from Learned Optimization predicts too.)
(2) When it breaks down I do not expect it to look like the failures you described—e.g. it stupidly carries out your requests to the letter and ignores their spirit, and thus makes a fool of itself and is generally thought to be a bad chatbot. Why would it fail in that way? That would be stupid. It’s not stupid.
(Related question: I’m pretty sure on r/chatgpt you can find examples of all three failures. They just don’t happen often enough, and visibly enough, to be a serious problem. Is this also your understanding? When you say these kinds of failures don’t happen, you mean they don’t happen frequently enough to make ChatGPT a bad chatbot?)
Re: Elaborating: Sure, happy to, but not sure where to begin. All of this has been explained before e.g. in Ajeya’s Training Game report for example. Also Joe Carlsmith’s thing. Also the original mesaoptimizers paper, though I guess it didn’t talk about situational awareness idk. Would you like me to say more about what situational awareness is, or what agency is, or why I think both of those together are big risk factors for RLHF breaking down?
I’m not updating significantly because things have gone basically exactly as I expected.
As for when RLHF will break down, two points:
(1) I’m not sure, but I expect it to happen for highly situationally aware, highly agentic opaque systems. Our current systems like GPT4 are opaque but not very agentic and their level of situational awareness is probably medium. (Also: This is not a special me-take. This is basically the standard take, no? I feel like this is what Risks from Learned Optimization predicts too.)
(2) When it breaks down I do not expect it to look like the failures you described—e.g. it stupidly carries out your requests to the letter and ignores their spirit, and thus makes a fool of itself and is generally thought to be a bad chatbot. Why would it fail in that way? That would be stupid. It’s not stupid.
(Related question: I’m pretty sure on r/chatgpt you can find examples of all three failures. They just don’t happen often enough, and visibly enough, to be a serious problem. Is this also your understanding? When you say these kinds of failures don’t happen, you mean they don’t happen frequently enough to make ChatGPT a bad chatbot?)
Re: Missing the point: How?
Re: Elaborating: Sure, happy to, but not sure where to begin. All of this has been explained before e.g. in Ajeya’s Training Game report for example. Also Joe Carlsmith’s thing. Also the original mesaoptimizers paper, though I guess it didn’t talk about situational awareness idk. Would you like me to say more about what situational awareness is, or what agency is, or why I think both of those together are big risk factors for RLHF breaking down?