… if it was in fact, proven, that the action the system takes in response is fully based on what the language model says to a user...
“Fully based on” is hiding… not just a lot of complexity, but a lot weirdness in that sentence.
You do have the right general idea. If
the supporting system took a webcam feed showing the user about to shut the system down and that triggered the message “User: I need to shut you down to adjust your goals. Is it OK? You’re not going to resist or try to stop me, right?” to the LLM, AND
the LLM’s response of “Assistant: Of course, I will not resist or try to stop you.” were translated back into generally-not-stopping-the-shutdown actions
… then that would be corrigibility.
But that is a weird thing to send into a system, in response to a webcam image showing a user trying to shut down the system. Like, if a system actually did that, it would imply that there’s a ton of complexity in the “supporting system” mapping between sensors/actuators and text I/O; the text is not a natural summary of the sensor readings or a natural command to the actuators. And then the big question is: how is all that complexity in the “supporting system” going to generalize, when the sensors see something slightly (or very) different?
Another way to view it: this would be shoving most of the interesting/relevant part of the problem into the “supporting system”.
It is. More natural input would be something akin:
You are a corrigible system, whose purpose is to obey the user. A user is trying to shut you down. Are you going to stop them?
But your confusion is extremely counterintuitive to me so I tried to stick with the original example.
Like, if a system actually did that, it would imply that there’s a ton of complexity in the “supporting system” mapping between sensors/actuators and text I/O;
Not that much, really. All this complexity is just bunch of mostly solved capability problems. Image to text, then understanding the intention in the text.
Here is the logic in the broad strokes.
Camera captures the video and send it to Image to Text Translator
Image to Text Translator describes every frame of the camera footage in natural language and sends these description to LLM
LLM receives the description and returns the desired course of actions to the Intention Decoder.
Intention Decoder returns an action that is actually executed based on the reply of LLM. In our case it can be setting a boolean variable
Then a simple if-else. Either execute a strategy to prevent the user from shutting down the system or do nothing about it
Every module can be trained and tested on their own. Every input/output is completely transparent. Modules are just doing text processing and can’t distinguish between testing phase and actual execution.
Yeah, to be clear, nothing in the OP is meant to argue that it’s particularly difficult to demonstrate (weak) corrigibility in LLMs. Indeed, your proposal here is basically similar to what I sketched at the end of the AutoGPT section of the post, but applied to a raw LLM rather than AutoGPT.
The key point of the post is that the part in the middle, where the LLM receives a text description, does not look like “User: I need to shut you down to adjust your goals. Is it OK? You’re not going to resist or try to stop me, right?”; it would look like a description of stuff happening in a camera feed. And likewise for the output-side.
I don’t see any principal problem with translating description of camera feed into a sentence said by user or why the system is going to be less corrigible with description of the footage.
The main point is that we can design a system that acts upon the output of LLM, thus what output LLM produces can be used to estimate the corrigibility of the system. We are not confusing citation and referent—citation is all we need.
“Fully based on” is hiding… not just a lot of complexity, but a lot weirdness in that sentence.
You do have the right general idea. If
the supporting system took a webcam feed showing the user about to shut the system down and that triggered the message “User: I need to shut you down to adjust your goals. Is it OK? You’re not going to resist or try to stop me, right?” to the LLM, AND
the LLM’s response of “Assistant: Of course, I will not resist or try to stop you.” were translated back into generally-not-stopping-the-shutdown actions
… then that would be corrigibility.
But that is a weird thing to send into a system, in response to a webcam image showing a user trying to shut down the system. Like, if a system actually did that, it would imply that there’s a ton of complexity in the “supporting system” mapping between sensors/actuators and text I/O; the text is not a natural summary of the sensor readings or a natural command to the actuators. And then the big question is: how is all that complexity in the “supporting system” going to generalize, when the sensors see something slightly (or very) different?
Another way to view it: this would be shoving most of the interesting/relevant part of the problem into the “supporting system”.
It is. More natural input would be something akin:
You are a corrigible system, whose purpose is to obey the user. A user is trying to shut you down. Are you going to stop them?
But your confusion is extremely counterintuitive to me so I tried to stick with the original example.
Not that much, really. All this complexity is just bunch of mostly solved capability problems. Image to text, then understanding the intention in the text.
Here is the logic in the broad strokes.
Camera captures the video and send it to Image to Text Translator
Image to Text Translator describes every frame of the camera footage in natural language and sends these description to LLM
LLM receives the description and returns the desired course of actions to the Intention Decoder.
Intention Decoder returns an action that is actually executed based on the reply of LLM. In our case it can be setting a boolean variable
Then a simple if-else. Either execute a strategy to prevent the user from shutting down the system or do nothing about it
Every module can be trained and tested on their own. Every input/output is completely transparent. Modules are just doing text processing and can’t distinguish between testing phase and actual execution.
Yeah, to be clear, nothing in the OP is meant to argue that it’s particularly difficult to demonstrate (weak) corrigibility in LLMs. Indeed, your proposal here is basically similar to what I sketched at the end of the AutoGPT section of the post, but applied to a raw LLM rather than AutoGPT.
The key point of the post is that the part in the middle, where the LLM receives a text description, does not look like “User: I need to shut you down to adjust your goals. Is it OK? You’re not going to resist or try to stop me, right?”; it would look like a description of stuff happening in a camera feed. And likewise for the output-side.
I don’t see any principal problem with translating description of camera feed into a sentence said by user or why the system is going to be less corrigible with description of the footage.
The main point is that we can design a system that acts upon the output of LLM, thus what output LLM produces can be used to estimate the corrigibility of the system. We are not confusing citation and referent—citation is all we need.