No one knows how to build an AI system that accomplishes goals, that also is fine with you turning it off. Researchers have been trying for decades, with no success.
Given that it looks like (from your Elaboration) language models will form the cores of future AGIs, and human-like linguistic reasoning will be a big part of how they reason about goals (like in the “Long sequences of robot actions generated by internal dialogue” example) can’t we just fine-tune the language model by training it on statements like “If (authorized) humans want to turn me off, I should turn off.”
Maybe we can even fine-tune it with statements describing our current moral beliefs/uncertainties and examples of moral/philosophical reasoning, and hope that AGI will learn morality from that, like human children (sometimes) do. Obvious it’s very risky to take a black-box approach where we don’t really understand what the AI has learned (I would much prefer if we could slow things down enough to work out a white-box approach), but it seems like there’s maybe a 20% chance we can just get “lucky” this way?
can’t we just fine-tune the language model by training it on statements like “If (authorized) humans want to turn me off, I should turn off.”
Why would that make it corrigible to being turned off? What does the word “should” in the training data have to do with the system’s goals and actions? The AI does not want to do what it ought (where by “ought” I mean the thing AI will learn the word means from human text). It won’t be motivated by what it “should” do any more than by what it “shouldn’t” do.
This is a fundamental flaw in this idea; it is not repairable by tweaking the prompt. The word “should” will, just, having literally nothing whatsoever to do with what the AI is optimizing for (or even what it’s optimized for). DALL-E doesn’t make pictures because it “should” do that; it makes pictures because of where gradient descent took it.
Like, best-case scenario, it repeats “I should turn off” as it kills us.
Wei is correct, current LLMs are 100% corrigible. Large language models are trained on so-called self supervised objective functions to “predict the next word” (or sometimes, predict the masked word). If we’d like them to provide a particular output, all we need is to include that response in the training data. Through the training process, the model naturally learns to agree with its input data.
The problem of (in)corrigibility, as formalized by this MIRI paper, is our potential (in)ability to turn off an AI agent. But the paper only concerns agents, which language models are not. RL agents pose the potential for self-preservation, but self-supervised language models are more akin to “oracle” AIs that merely answer questions without a broader goal in mind.
Now, the most compelling stories of AI doom combine language processing with agentic optimization. These agents could be incorrigible and attempt self-preservation, potentially at the expense of humanity. Unfortunately most work on this topic has been theoretical—I would love to see an empirical demonstration of incorrigible self-preservation behavior by an RL agent.
We should be thinking through the problems with that!
It’s not the worst idea in the world!
EDIT: note that the obvious failure mode here is “but then you retrain the language model as part of your RL loop, and language loses its meaning, and then it does something evil and then everyone dies.” So everyone still has to not do that! But this makes me think that ~human-level alignment in controlled circumstances might not be impossible.
It doesn’t change the fact that if anyone thinks it would be fun to fine-tune the whole model on an RL objective, we’d still lose. So we have to do global coordination ASAP. (It does not seem at all likely that Yudkowskian pivotal acts can be achieved solely through the logic learned from getting ~perfect LLM accuracy on the entire internet, since all proposed pivotal acts require new concepts no one has ever discussed.)
Yeah, don’t do RL on it, but instead use it to make money for you (ethically) and at the same time ask it to think about how to create a safe/aligned superintelligent AGI. You may still need a big enough lead (to prevent others doing RL outcompeting you) or global coordination but it doesn’t seem obviously impossible.
Pretty much. I also think this plausibly buys off the actors who are currently really excited about AGI. They can make silly money with such a system without the RL part—why not do that for a while, while mutually-enforcing the “nobody kill everyone” provisions?
Maybe we can even fine-tune it with statements describing our current moral beliefs/uncertainties and examples of moral/philosophical reasoning, and hope that AGI will learn morality from that, like human children (sometimes) do
Even if I assume this all goes perfectly, would you want a typically raised teenager (or adult) to have ~infinite power to change anything they want about humanity? How about a philosopher? Do you know even 10 people who you’ve seen what decisions they’ve advocated for and you’d trust them with ~infinite power?
can’t we just fine-tune the language model by training it on statements like “If (authorized) humans want to turn me off, I should turn off.”
are that these are not well defined, and if you let a human (like me) read it, I will automatically fill in the blanks to probably match your own intuition.
As examples of problems:
“I should turn off”
who is “I”? what if the AI makes another one?
What is “should”? Does the AI get utility from this or not? If so, Will the AI try to convince the humans to turning it off? If not, will the AI try to prevent humans from WANTING to turn it off?
Given that it looks like (from your Elaboration) language models will form the cores of future AGIs, and human-like linguistic reasoning will be a big part of how they reason about goals (like in the “Long sequences of robot actions generated by internal dialogue” example) can’t we just fine-tune the language model by training it on statements like “If (authorized) humans want to turn me off, I should turn off.”
Maybe we can even fine-tune it with statements describing our current moral beliefs/uncertainties and examples of moral/philosophical reasoning, and hope that AGI will learn morality from that, like human children (sometimes) do. Obvious it’s very risky to take a black-box approach where we don’t really understand what the AI has learned (I would much prefer if we could slow things down enough to work out a white-box approach), but it seems like there’s maybe a 20% chance we can just get “lucky” this way?
Why would that make it corrigible to being turned off? What does the word “should” in the training data have to do with the system’s goals and actions? The AI does not want to do what it ought (where by “ought” I mean the thing AI will learn the word means from human text). It won’t be motivated by what it “should” do any more than by what it “shouldn’t” do.
This is a fundamental flaw in this idea; it is not repairable by tweaking the prompt. The word “should” will, just, having literally nothing whatsoever to do with what the AI is optimizing for (or even what it’s optimized for). DALL-E doesn’t make pictures because it “should” do that; it makes pictures because of where gradient descent took it.
Like, best-case scenario, it repeats “I should turn off” as it kills us.
Wei is correct, current LLMs are 100% corrigible. Large language models are trained on so-called self supervised objective functions to “predict the next word” (or sometimes, predict the masked word). If we’d like them to provide a particular output, all we need is to include that response in the training data. Through the training process, the model naturally learns to agree with its input data.
The problem of (in)corrigibility, as formalized by this MIRI paper, is our potential (in)ability to turn off an AI agent. But the paper only concerns agents, which language models are not. RL agents pose the potential for self-preservation, but self-supervised language models are more akin to “oracle” AIs that merely answer questions without a broader goal in mind.
Now, the most compelling stories of AI doom combine language processing with agentic optimization. These agents could be incorrigible and attempt self-preservation, potentially at the expense of humanity. Unfortunately most work on this topic has been theoretical—I would love to see an empirical demonstration of incorrigible self-preservation behavior by an RL agent.
Maybe!
We should be thinking through the problems with that!
It’s not the worst idea in the world!
EDIT: note that the obvious failure mode here is “but then you retrain the language model as part of your RL loop, and language loses its meaning, and then it does something evil and then everyone dies.” So everyone still has to not do that! But this makes me think that ~human-level alignment in controlled circumstances might not be impossible.
It doesn’t change the fact that if anyone thinks it would be fun to fine-tune the whole model on an RL objective, we’d still lose. So we have to do global coordination ASAP. (It does not seem at all likely that Yudkowskian pivotal acts can be achieved solely through the logic learned from getting ~perfect LLM accuracy on the entire internet, since all proposed pivotal acts require new concepts no one has ever discussed.)
Yeah, don’t do RL on it, but instead use it to make money for you (ethically) and at the same time ask it to think about how to create a safe/aligned superintelligent AGI. You may still need a big enough lead (to prevent others doing RL outcompeting you) or global coordination but it doesn’t seem obviously impossible.
Pretty much. I also think this plausibly buys off the actors who are currently really excited about AGI. They can make silly money with such a system without the RL part—why not do that for a while, while mutually-enforcing the “nobody kill everyone” provisions?
Even if I assume this all goes perfectly, would you want a typically raised teenager (or adult) to have ~infinite power to change anything they want about humanity? How about a philosopher? Do you know even 10 people who you’ve seen what decisions they’ve advocated for and you’d trust them with ~infinite power?
Hey!
One of the problems in
are that these are not well defined, and if you let a human (like me) read it, I will automatically fill in the blanks to probably match your own intuition.
As examples of problems:
“I should turn off”
who is “I”? what if the AI makes another one?
What is “should”? Does the AI get utility from this or not? If so, Will the AI try to convince the humans to turning it off? If not, will the AI try to prevent humans from WANTING to turn it off?