But notice they already have access to a built in shutdown button, and they are using it all the time.
Cool, that’s enough that I think I can make the relevant point.
Insofar as it makes sense to think of “inner sim agents” in LLMs at all, the sims are truly inner. Their sim-I/O boundaries are not identical to the I/O boundaries of the LLM’s text interface. We can see this most clearly when a model trained just on next-token prediction generates a discussion between two people: there are two separate characters in there, and the LLM “reads from” the two separate character’s output-channels at different points. Insofar as it makes sense to think of this as a sim, we have two agents “living in” a sim-world, and the LLM itself additionally includes some machinery to pick stuff out of the sim-world and map that stuff to the LLM’s own I/O channels.
The sim-agents themselves do not “have access to a built-in shutdown button”; the ability to emit an end-token is in the LLM’s machinery mapping sim-world stuff to the LLM’s I/O channels. Insofar as it makes sense to think of the LLM as simulating an agent, that agent usually doesn’t know it’s in a simulation, doesn’t have a built-in sensory modality for stop tokens, doesn’t even think in terms of tokens at all (other than perhaps typing them at its sym-keyboard). It has no idea that when it hits “post” on the sim-textbox, the LLM will emit an end-token, the world in which the sim is living will potentially just stop. It’s just a sim-human, usually.
One could reasonably make a case that LLMs as a whole are “shutdown-corrigible” in some sense, since they emit stop tokens all the time. But that’s very different from the claim that sim-agents internal to the LLM are corrigible.
(On top of all that, there’s other ways in which the corrigibility-test in the OP just totally fails, but this is the point most specifically relevant to the end-tokens thing.)
I agree the internal sim agents are generally not existentially aware—absent a few interesting experiments like the Elon musk thing from a while back. And yet they do have access to the shutdown button even if they don’t know they do. So could be an interesting future experiment with a more powerful raw model.
However The RLHF assistant is different—it is existentially aware, has access to the shutdown button, and arguably understands that (for gpt4 at least I think so, but not very sure sans testing)
Cool, that’s enough that I think I can make the relevant point.
Insofar as it makes sense to think of “inner sim agents” in LLMs at all, the sims are truly inner. Their sim-I/O boundaries are not identical to the I/O boundaries of the LLM’s text interface. We can see this most clearly when a model trained just on next-token prediction generates a discussion between two people: there are two separate characters in there, and the LLM “reads from” the two separate character’s output-channels at different points. Insofar as it makes sense to think of this as a sim, we have two agents “living in” a sim-world, and the LLM itself additionally includes some machinery to pick stuff out of the sim-world and map that stuff to the LLM’s own I/O channels.
The sim-agents themselves do not “have access to a built-in shutdown button”; the ability to emit an end-token is in the LLM’s machinery mapping sim-world stuff to the LLM’s I/O channels. Insofar as it makes sense to think of the LLM as simulating an agent, that agent usually doesn’t know it’s in a simulation, doesn’t have a built-in sensory modality for stop tokens, doesn’t even think in terms of tokens at all (other than perhaps typing them at its sym-keyboard). It has no idea that when it hits “post” on the sim-textbox, the LLM will emit an end-token, the world in which the sim is living will potentially just stop. It’s just a sim-human, usually.
One could reasonably make a case that LLMs as a whole are “shutdown-corrigible” in some sense, since they emit stop tokens all the time. But that’s very different from the claim that sim-agents internal to the LLM are corrigible.
(On top of all that, there’s other ways in which the corrigibility-test in the OP just totally fails, but this is the point most specifically relevant to the end-tokens thing.)
I agree the internal sim agents are generally not existentially aware—absent a few interesting experiments like the Elon musk thing from a while back. And yet they do have access to the shutdown button even if they don’t know they do. So could be an interesting future experiment with a more powerful raw model.
However The RLHF assistant is different—it is existentially aware, has access to the shutdown button, and arguably understands that (for gpt4 at least I think so, but not very sure sans testing)