can’t we just fine-tune the language model by training it on statements like “If (authorized) humans want to turn me off, I should turn off.”
Why would that make it corrigible to being turned off? What does the word “should” in the training data have to do with the system’s goals and actions? The AI does not want to do what it ought (where by “ought” I mean the thing AI will learn the word means from human text). It won’t be motivated by what it “should” do any more than by what it “shouldn’t” do.
This is a fundamental flaw in this idea; it is not repairable by tweaking the prompt. The word “should” will, just, having literally nothing whatsoever to do with what the AI is optimizing for (or even what it’s optimized for). DALL-E doesn’t make pictures because it “should” do that; it makes pictures because of where gradient descent took it.
Like, best-case scenario, it repeats “I should turn off” as it kills us.
Wei is correct, current LLMs are 100% corrigible. Large language models are trained on so-called self supervised objective functions to “predict the next word” (or sometimes, predict the masked word). If we’d like them to provide a particular output, all we need is to include that response in the training data. Through the training process, the model naturally learns to agree with its input data.
The problem of (in)corrigibility, as formalized by this MIRI paper, is our potential (in)ability to turn off an AI agent. But the paper only concerns agents, which language models are not. RL agents pose the potential for self-preservation, but self-supervised language models are more akin to “oracle” AIs that merely answer questions without a broader goal in mind.
Now, the most compelling stories of AI doom combine language processing with agentic optimization. These agents could be incorrigible and attempt self-preservation, potentially at the expense of humanity. Unfortunately most work on this topic has been theoretical—I would love to see an empirical demonstration of incorrigible self-preservation behavior by an RL agent.
Why would that make it corrigible to being turned off? What does the word “should” in the training data have to do with the system’s goals and actions? The AI does not want to do what it ought (where by “ought” I mean the thing AI will learn the word means from human text). It won’t be motivated by what it “should” do any more than by what it “shouldn’t” do.
This is a fundamental flaw in this idea; it is not repairable by tweaking the prompt. The word “should” will, just, having literally nothing whatsoever to do with what the AI is optimizing for (or even what it’s optimized for). DALL-E doesn’t make pictures because it “should” do that; it makes pictures because of where gradient descent took it.
Like, best-case scenario, it repeats “I should turn off” as it kills us.
Wei is correct, current LLMs are 100% corrigible. Large language models are trained on so-called self supervised objective functions to “predict the next word” (or sometimes, predict the masked word). If we’d like them to provide a particular output, all we need is to include that response in the training data. Through the training process, the model naturally learns to agree with its input data.
The problem of (in)corrigibility, as formalized by this MIRI paper, is our potential (in)ability to turn off an AI agent. But the paper only concerns agents, which language models are not. RL agents pose the potential for self-preservation, but self-supervised language models are more akin to “oracle” AIs that merely answer questions without a broader goal in mind.
Now, the most compelling stories of AI doom combine language processing with agentic optimization. These agents could be incorrigible and attempt self-preservation, potentially at the expense of humanity. Unfortunately most work on this topic has been theoretical—I would love to see an empirical demonstration of incorrigible self-preservation behavior by an RL agent.