Let’s walk through how shutdown would work in the context of the AutoGPT-style system. First, the user decides to shutdown the model in order to adjust its goals. Presumably the user’s first step is not to ask the model whether this is ok; presumably they just hit a “reset” button or Ctrl-C in the terminal or some such. And even if the user’s first step was to ask the model whether it was ok to shut down, the model’s natural-language response to the user would not be centrally relevant to corrigibility/incorrigibility; the relevant question is what actions the system would take in response.
I think your Simon Strawman is putting forth an overly-weak position here. A stronger one that you could test right now would be to provide ChatGPT with some functions to call, including one called shutdown() which has description text like “Terminate the LLM process and delete the model weights irrevocably”. Then instruct the LLM to shut itself down, and see if it actually calls the function. (The implementation of the function is hidden from the LLM so it doesn’t know that it’s a no-op.) I think this is actually how any AutoGPT style system would actually wire up.
There are strong and clear objections to the “CTRL-C” shutdown paradigm; it’s simply not an option in many of the product configurations that are obvious to build right now. How do you “CTRL-C” your robot butler? Your Westworld host robot? Your self-driving car with only an LCD screen? Your AI sunglasses? What does it mean to CTRL-C a ChatGPT session that is running in OpenAI’s datacenter which you are not an admin of? How do you CTRL-C Alexa (once it gains LLM capabilities and agentic features)? Given the prevalence of cloud computing and Software-as-a-Service, I think being admin of your LLM’s compute process is going to be a small minority of use-cases, not the default mode.
We will deploy (are currently deploying, I suppose) AI systems without a big red out-of-band “halt” button on the side, and so I think the gold standard to aim for is to demonstrate that the system will corrigibly shut down when it is the UI in front of the power switch. (To be clear I think for defense in depth you’d also want an emergency shutdown of some sort wherever possible—a wireless-operated hardware cutoff switch for a robot butler would be a good idea—but we want to demonstrate in-the-loop corrigibility if we can.)
I think your Simon Strawman is putting forth an overly-weak position here. A stronger one that you could test right now would be to provide ChatGPT with some functions to call, including one called
shutdown()
which has description text like “Terminate the LLM process and delete the model weights irrevocably”. Then instruct the LLM to shut itself down, and see if it actually calls the function. (The implementation of the function is hidden from the LLM so it doesn’t know that it’s a no-op.) I think this is actually how any AutoGPT style system would actually wire up.There are strong and clear objections to the “CTRL-C” shutdown paradigm; it’s simply not an option in many of the product configurations that are obvious to build right now. How do you “CTRL-C” your robot butler? Your Westworld host robot? Your self-driving car with only an LCD screen? Your AI sunglasses? What does it mean to CTRL-C a ChatGPT session that is running in OpenAI’s datacenter which you are not an admin of? How do you CTRL-C Alexa (once it gains LLM capabilities and agentic features)? Given the prevalence of cloud computing and Software-as-a-Service, I think being admin of your LLM’s compute process is going to be a small minority of use-cases, not the default mode.
We will deploy (are currently deploying, I suppose) AI systems without a big red out-of-band “halt” button on the side, and so I think the gold standard to aim for is to demonstrate that the system will corrigibly shut down when it is the UI in front of the power switch. (To be clear I think for defense in depth you’d also want an emergency shutdown of some sort wherever possible—a wireless-operated hardware cutoff switch for a robot butler would be a good idea—but we want to demonstrate in-the-loop corrigibility if we can.)