Wei is correct, current LLMs are 100% corrigible. Large language models are trained on so-called self supervised objective functions to “predict the next word” (or sometimes, predict the masked word). If we’d like them to provide a particular output, all we need is to include that response in the training data. Through the training process, the model naturally learns to agree with its input data.
The problem of (in)corrigibility, as formalized by this MIRI paper, is our potential (in)ability to turn off an AI agent. But the paper only concerns agents, which language models are not. RL agents pose the potential for self-preservation, but self-supervised language models are more akin to “oracle” AIs that merely answer questions without a broader goal in mind.
Now, the most compelling stories of AI doom combine language processing with agentic optimization. These agents could be incorrigible and attempt self-preservation, potentially at the expense of humanity. Unfortunately most work on this topic has been theoretical—I would love to see an empirical demonstration of incorrigible self-preservation behavior by an RL agent.
Wei is correct, current LLMs are 100% corrigible. Large language models are trained on so-called self supervised objective functions to “predict the next word” (or sometimes, predict the masked word). If we’d like them to provide a particular output, all we need is to include that response in the training data. Through the training process, the model naturally learns to agree with its input data.
The problem of (in)corrigibility, as formalized by this MIRI paper, is our potential (in)ability to turn off an AI agent. But the paper only concerns agents, which language models are not. RL agents pose the potential for self-preservation, but self-supervised language models are more akin to “oracle” AIs that merely answer questions without a broader goal in mind.
Now, the most compelling stories of AI doom combine language processing with agentic optimization. These agents could be incorrigible and attempt self-preservation, potentially at the expense of humanity. Unfortunately most work on this topic has been theoretical—I would love to see an empirical demonstration of incorrigible self-preservation behavior by an RL agent.