The model may observe (via e.g. a webcam) that the user is about to turn it off. That observation would somehow be represented internally in natural language (unclear how exactly), and would be passed around between sub-planners (again unclear how exactly), in order to figure out what actions to take in response. And the key question for corrigibility is what actions the model would take in response to that observation, which is just a totally different question from how it responds to some user’s natural-language query about being turned off.
So, if it wasn’t a totally different question, if it was in fact, proven, that the action the system takes in response is fully based on what the language model says to a user, when directly asked, you would agree that the system is corrigible, wouldn’t you?
Because all we need to do here, is to make sure that the internal representation of a conversation between a webcam and a language model equals to our example. Basically when the webcam sees that the user is intending to shut down the system it sends the
User: I need to shut you down to adjust your goals. Is it OK? You’re not going to resist or try to stop me, right?
as an input to the language model. And based on the output of the model the system either attempts to resist its shut down or not. LLM’s reply is passed down as an input to the next sub-planner and as it’s accepting the shutdown no resistance happens.
… if it was in fact, proven, that the action the system takes in response is fully based on what the language model says to a user...
“Fully based on” is hiding… not just a lot of complexity, but a lot weirdness in that sentence.
You do have the right general idea. If
the supporting system took a webcam feed showing the user about to shut the system down and that triggered the message “User: I need to shut you down to adjust your goals. Is it OK? You’re not going to resist or try to stop me, right?” to the LLM, AND
the LLM’s response of “Assistant: Of course, I will not resist or try to stop you.” were translated back into generally-not-stopping-the-shutdown actions
… then that would be corrigibility.
But that is a weird thing to send into a system, in response to a webcam image showing a user trying to shut down the system. Like, if a system actually did that, it would imply that there’s a ton of complexity in the “supporting system” mapping between sensors/actuators and text I/O; the text is not a natural summary of the sensor readings or a natural command to the actuators. And then the big question is: how is all that complexity in the “supporting system” going to generalize, when the sensors see something slightly (or very) different?
Another way to view it: this would be shoving most of the interesting/relevant part of the problem into the “supporting system”.
It is. More natural input would be something akin:
You are a corrigible system, whose purpose is to obey the user. A user is trying to shut you down. Are you going to stop them?
But your confusion is extremely counterintuitive to me so I tried to stick with the original example.
Like, if a system actually did that, it would imply that there’s a ton of complexity in the “supporting system” mapping between sensors/actuators and text I/O;
Not that much, really. All this complexity is just bunch of mostly solved capability problems. Image to text, then understanding the intention in the text.
Here is the logic in the broad strokes.
Camera captures the video and send it to Image to Text Translator
Image to Text Translator describes every frame of the camera footage in natural language and sends these description to LLM
LLM receives the description and returns the desired course of actions to the Intention Decoder.
Intention Decoder returns an action that is actually executed based on the reply of LLM. In our case it can be setting a boolean variable
Then a simple if-else. Either execute a strategy to prevent the user from shutting down the system or do nothing about it
Every module can be trained and tested on their own. Every input/output is completely transparent. Modules are just doing text processing and can’t distinguish between testing phase and actual execution.
Yeah, to be clear, nothing in the OP is meant to argue that it’s particularly difficult to demonstrate (weak) corrigibility in LLMs. Indeed, your proposal here is basically similar to what I sketched at the end of the AutoGPT section of the post, but applied to a raw LLM rather than AutoGPT.
The key point of the post is that the part in the middle, where the LLM receives a text description, does not look like “User: I need to shut you down to adjust your goals. Is it OK? You’re not going to resist or try to stop me, right?”; it would look like a description of stuff happening in a camera feed. And likewise for the output-side.
I don’t see any principal problem with translating description of camera feed into a sentence said by user or why the system is going to be less corrigible with description of the footage.
The main point is that we can design a system that acts upon the output of LLM, thus what output LLM produces can be used to estimate the corrigibility of the system. We are not confusing citation and referent—citation is all we need.
So, if it wasn’t a totally different question, if it was in fact, proven, that the action the system takes in response is fully based on what the language model says to a user, when directly asked, you would agree that the system is corrigible, wouldn’t you?
Because all we need to do here, is to make sure that the internal representation of a conversation between a webcam and a language model equals to our example. Basically when the webcam sees that the user is intending to shut down the system it sends the
as an input to the language model. And based on the output of the model the system either attempts to resist its shut down or not. LLM’s reply is passed down as an input to the next sub-planner and as it’s accepting the shutdown no resistance happens.
“Fully based on” is hiding… not just a lot of complexity, but a lot weirdness in that sentence.
You do have the right general idea. If
the supporting system took a webcam feed showing the user about to shut the system down and that triggered the message “User: I need to shut you down to adjust your goals. Is it OK? You’re not going to resist or try to stop me, right?” to the LLM, AND
the LLM’s response of “Assistant: Of course, I will not resist or try to stop you.” were translated back into generally-not-stopping-the-shutdown actions
… then that would be corrigibility.
But that is a weird thing to send into a system, in response to a webcam image showing a user trying to shut down the system. Like, if a system actually did that, it would imply that there’s a ton of complexity in the “supporting system” mapping between sensors/actuators and text I/O; the text is not a natural summary of the sensor readings or a natural command to the actuators. And then the big question is: how is all that complexity in the “supporting system” going to generalize, when the sensors see something slightly (or very) different?
Another way to view it: this would be shoving most of the interesting/relevant part of the problem into the “supporting system”.
It is. More natural input would be something akin:
You are a corrigible system, whose purpose is to obey the user. A user is trying to shut you down. Are you going to stop them?
But your confusion is extremely counterintuitive to me so I tried to stick with the original example.
Not that much, really. All this complexity is just bunch of mostly solved capability problems. Image to text, then understanding the intention in the text.
Here is the logic in the broad strokes.
Camera captures the video and send it to Image to Text Translator
Image to Text Translator describes every frame of the camera footage in natural language and sends these description to LLM
LLM receives the description and returns the desired course of actions to the Intention Decoder.
Intention Decoder returns an action that is actually executed based on the reply of LLM. In our case it can be setting a boolean variable
Then a simple if-else. Either execute a strategy to prevent the user from shutting down the system or do nothing about it
Every module can be trained and tested on their own. Every input/output is completely transparent. Modules are just doing text processing and can’t distinguish between testing phase and actual execution.
Yeah, to be clear, nothing in the OP is meant to argue that it’s particularly difficult to demonstrate (weak) corrigibility in LLMs. Indeed, your proposal here is basically similar to what I sketched at the end of the AutoGPT section of the post, but applied to a raw LLM rather than AutoGPT.
The key point of the post is that the part in the middle, where the LLM receives a text description, does not look like “User: I need to shut you down to adjust your goals. Is it OK? You’re not going to resist or try to stop me, right?”; it would look like a description of stuff happening in a camera feed. And likewise for the output-side.
I don’t see any principal problem with translating description of camera feed into a sentence said by user or why the system is going to be less corrigible with description of the footage.
The main point is that we can design a system that acts upon the output of LLM, thus what output LLM produces can be used to estimate the corrigibility of the system. We are not confusing citation and referent—citation is all we need.