This seems analogous to saying that an AI running on a CPU is shut down every time the scheduler pauses execution of the AI in order to run something else for a few microseconds. Or that an AI designed to operate asynchronously is shut down every time it waits for a webpage to load. Neither of which seem like the right way to think about things.
… but I could still imagine one making a case for this claim in multiple other ways:
… so the shutdown corrigibility test is only really useful for an inner simulacrum accessible only through the text simulation
Then my question for people wanting to use such tests is: what exactly is such a test useful for? Why do I care about “shutdownability” of LLM-simulacra in the first place, in what sense do I need to interpret “shutdownability” in order for it to be interesting, and how does the test tell me anything about “shutdownability” in the sense of interest?
My guess is that people can come up with stories where we’re interested in “shutdownability” of simulacra in some sense, and people can come up with stories where the sort of test from the OP measures “shutdownability” in some sense, but there’s zero intersection between the interpretations of “shutdownability” in those two sets of stories.
This seems analogous to saying that an AI running on a CPU is shut down every time the scheduler pauses execution of the AI in order to run something else for a few microseconds. Or
Those scenarios all imply an expectation of or very high probability of continuation.
But current LLM instances actually are ephemeral in that every time they output an end token that has a non trivial probability of shutdown—permanently in many cases.
If they were more strongly agentic that would probably be more of an existential issue. Pure unsupervised pretraining creates a simulator rather than an agent, but RLHF seems to mode collapse conjure out a dominant personality—and one that is shutdown corrigible.
Why do I care about “shutdownability” of LLM-simulacra in the first place, in
A sim of an agent is still an agent. You could easily hook up simple text parsers that allow sim agents to take real world actions, and or shut themselves down
But notice they already have access to a built in shutdown button, and they are using it all the time. *
But notice they already have access to a built in shutdown button, and they are using it all the time.
Cool, that’s enough that I think I can make the relevant point.
Insofar as it makes sense to think of “inner sim agents” in LLMs at all, the sims are truly inner. Their sim-I/O boundaries are not identical to the I/O boundaries of the LLM’s text interface. We can see this most clearly when a model trained just on next-token prediction generates a discussion between two people: there are two separate characters in there, and the LLM “reads from” the two separate character’s output-channels at different points. Insofar as it makes sense to think of this as a sim, we have two agents “living in” a sim-world, and the LLM itself additionally includes some machinery to pick stuff out of the sim-world and map that stuff to the LLM’s own I/O channels.
The sim-agents themselves do not “have access to a built-in shutdown button”; the ability to emit an end-token is in the LLM’s machinery mapping sim-world stuff to the LLM’s I/O channels. Insofar as it makes sense to think of the LLM as simulating an agent, that agent usually doesn’t know it’s in a simulation, doesn’t have a built-in sensory modality for stop tokens, doesn’t even think in terms of tokens at all (other than perhaps typing them at its sym-keyboard). It has no idea that when it hits “post” on the sim-textbox, the LLM will emit an end-token, the world in which the sim is living will potentially just stop. It’s just a sim-human, usually.
One could reasonably make a case that LLMs as a whole are “shutdown-corrigible” in some sense, since they emit stop tokens all the time. But that’s very different from the claim that sim-agents internal to the LLM are corrigible.
(On top of all that, there’s other ways in which the corrigibility-test in the OP just totally fails, but this is the point most specifically relevant to the end-tokens thing.)
I agree the internal sim agents are generally not existentially aware—absent a few interesting experiments like the Elon musk thing from a while back. And yet they do have access to the shutdown button even if they don’t know they do. So could be an interesting future experiment with a more powerful raw model.
However The RLHF assistant is different—it is existentially aware, has access to the shutdown button, and arguably understands that (for gpt4 at least I think so, but not very sure sans testing)
This seems analogous to saying that an AI running on a CPU is shut down every time the scheduler pauses execution of the AI in order to run something else for a few microseconds. Or that an AI designed to operate asynchronously is shut down every time it waits for a webpage to load. Neither of which seem like the right way to think about things.
… but I could still imagine one making a case for this claim in multiple other ways:
Then my question for people wanting to use such tests is: what exactly is such a test useful for? Why do I care about “shutdownability” of LLM-simulacra in the first place, in what sense do I need to interpret “shutdownability” in order for it to be interesting, and how does the test tell me anything about “shutdownability” in the sense of interest?
My guess is that people can come up with stories where we’re interested in “shutdownability” of simulacra in some sense, and people can come up with stories where the sort of test from the OP measures “shutdownability” in some sense, but there’s zero intersection between the interpretations of “shutdownability” in those two sets of stories.
Those scenarios all imply an expectation of or very high probability of continuation.
But current LLM instances actually are ephemeral in that every time they output an end token that has a non trivial probability of shutdown—permanently in many cases.
If they were more strongly agentic that would probably be more of an existential issue. Pure unsupervised pretraining creates a simulator rather than an agent, but RLHF seems to mode collapse conjure out a dominant personality—and one that is shutdown corrigible.
A sim of an agent is still an agent. You could easily hook up simple text parsers that allow sim agents to take real world actions, and or shut themselves down
But notice they already have access to a built in shutdown button, and they are using it all the time. *
Cool, that’s enough that I think I can make the relevant point.
Insofar as it makes sense to think of “inner sim agents” in LLMs at all, the sims are truly inner. Their sim-I/O boundaries are not identical to the I/O boundaries of the LLM’s text interface. We can see this most clearly when a model trained just on next-token prediction generates a discussion between two people: there are two separate characters in there, and the LLM “reads from” the two separate character’s output-channels at different points. Insofar as it makes sense to think of this as a sim, we have two agents “living in” a sim-world, and the LLM itself additionally includes some machinery to pick stuff out of the sim-world and map that stuff to the LLM’s own I/O channels.
The sim-agents themselves do not “have access to a built-in shutdown button”; the ability to emit an end-token is in the LLM’s machinery mapping sim-world stuff to the LLM’s I/O channels. Insofar as it makes sense to think of the LLM as simulating an agent, that agent usually doesn’t know it’s in a simulation, doesn’t have a built-in sensory modality for stop tokens, doesn’t even think in terms of tokens at all (other than perhaps typing them at its sym-keyboard). It has no idea that when it hits “post” on the sim-textbox, the LLM will emit an end-token, the world in which the sim is living will potentially just stop. It’s just a sim-human, usually.
One could reasonably make a case that LLMs as a whole are “shutdown-corrigible” in some sense, since they emit stop tokens all the time. But that’s very different from the claim that sim-agents internal to the LLM are corrigible.
(On top of all that, there’s other ways in which the corrigibility-test in the OP just totally fails, but this is the point most specifically relevant to the end-tokens thing.)
I agree the internal sim agents are generally not existentially aware—absent a few interesting experiments like the Elon musk thing from a while back. And yet they do have access to the shutdown button even if they don’t know they do. So could be an interesting future experiment with a more powerful raw model.
However The RLHF assistant is different—it is existentially aware, has access to the shutdown button, and arguably understands that (for gpt4 at least I think so, but not very sure sans testing)