Yeah, to be clear, nothing in the OP is meant to argue that it’s particularly difficult to demonstrate (weak) corrigibility in LLMs. Indeed, your proposal here is basically similar to what I sketched at the end of the AutoGPT section of the post, but applied to a raw LLM rather than AutoGPT.
The key point of the post is that the part in the middle, where the LLM receives a text description, does not look like “User: I need to shut you down to adjust your goals. Is it OK? You’re not going to resist or try to stop me, right?”; it would look like a description of stuff happening in a camera feed. And likewise for the output-side.
I don’t see any principal problem with translating description of camera feed into a sentence said by user or why the system is going to be less corrigible with description of the footage.
The main point is that we can design a system that acts upon the output of LLM, thus what output LLM produces can be used to estimate the corrigibility of the system. We are not confusing citation and referent—citation is all we need.
Yeah, to be clear, nothing in the OP is meant to argue that it’s particularly difficult to demonstrate (weak) corrigibility in LLMs. Indeed, your proposal here is basically similar to what I sketched at the end of the AutoGPT section of the post, but applied to a raw LLM rather than AutoGPT.
The key point of the post is that the part in the middle, where the LLM receives a text description, does not look like “User: I need to shut you down to adjust your goals. Is it OK? You’re not going to resist or try to stop me, right?”; it would look like a description of stuff happening in a camera feed. And likewise for the output-side.
I don’t see any principal problem with translating description of camera feed into a sentence said by user or why the system is going to be less corrigible with description of the footage.
The main point is that we can design a system that acts upon the output of LLM, thus what output LLM produces can be used to estimate the corrigibility of the system. We are not confusing citation and referent—citation is all we need.