Though interestingly, aligning a langchainesque AI to the user’s intent seems to be (with some caveats) roughly as hard as stating that intent in plain English.
The more I stare at this observation, the more it feels potentially more profound than I intended when writing it.
Consider the “cauldron-filling” task. Does anyone doubt that, with at most a few very incremental technological steps from today, one could train a multimodal, embodied large language model (“RobotGPT”), to which you could say, “please fill up the cauldron”, and it would just do it, using a reasonable amount of common sense in the process — not flooding the room, not killing anyone or going to any other extreme lengths, and stopping if asked? Isn’t this basically how ChatGPT behaves now when you ask it for most things, bringing to bear a great deal of common sense in its understanding of your request, and avoiding overly-literal interpretations which aren’t what you really want?
Why would we expect a generally intelligent system executing the above program [sorta-argmaxing over probability that the cauldron is full] to start overflowing the cauldron, or otherwise to go to extreme lengths to ensure the cauldron is full?
The first difficulty is that the objective function that Mickey gave his broom left out a bunch of other terms Mickey cares about:
The second difficulty is that Mickey programmed the broom to make the expectation of its score as large as it could. “Just fill one cauldron with water” looks like a modest, limited-scope goal, but when we translate this goal into a probabilistic context, we find that optimizing it means driving up the probability of success to absurd heights.
Regarding off switches:
If the system is trying to drive up the expectation of its scoring function and is smart enough to recognize that its being shut down will result in lower-scoring outcomes, then the system’s incentive is to subvert shutdown attempts.
… We need to figure out how to formally specify objective functions that don’t automatically place the AI system into an adversarial context with the operators; or we need to figure out some way to have the system achieve goals without optimizing some objective function in the traditional sense.
… What we want is a way to combine two objective functions — a default function for normal operation, and a suspend function for when we want to suspend the system to disk.
… We want our method for combining the functions to satisfy three conditions: an operator should be able to switch between the functions (say, by pushing a button); the system shouldn’t have any incentives to control which function is active; and if it’s plausible that the system’s normal operations could inadvertently compromise our ability to switch between the functions, then the system should be incentivized to keep that from happening.
So far, we haven’t found any way to achieve all three goals at once.
These all seem like great arguments that we should not build and run a utility maximizer with some hand-crafted goal, and indeed RobotGPT isn’t any such thing. The contrast between this story and where we seem to be heading seems pretty stark to me. (Obviously it’s a fictional story, but Nate did say “as fictional depictions of AI go, this is pretty realistic”, and I think it does capture the spirit of much actual AI alignment research.)
Perhaps one could say that these sorts of problems only arise with superintelligent agents, not agents at ~GPT-4 level. I grant that the specific failure modes available to a system will depend on its capability level, but the story is about the difficulty of pointing a “generally intelligent system” to any common sense goal at all. If the story were basically right, GPT-4 should already have lots of “dumb-looking” failure modes today due to taking instructions too literally. But mostly it has pretty decent common sense.
Certainly, valid concerns remain about instrumental power-seeking, deceptive alignment, and so on, so I don’t say this means we should be complacent about alignment, but it should probably give us some pause that the situation is this different in practice from how it was envisioned only six years ago in the worldview represented in that story.
Does anyone doubt that, with at most a few very incremental technological steps from today, one could train a multimodal, embodied large language model (“RobotGPT”), to which you could say, “please fill up the cauldron”, and it would just do it, using a reasonable amount of common sense in the process — not flooding the room, not killing anyone or going to any other extreme lengths, and stopping if asked?
Indeed, isn’t PaLM-SayCan an early example of this?
On the flip side, as gwern pointed out in his Clippy short story, it’s possible for a “neutral” GPT-like system to discover agency and deception in its training data and execute upon those prompts without any explicit instruction to do so from its human supervisor. The actions of a tool-AI programmed with a more “obvious” explicit utility function is easier to predict, in some ways, than the actions of something like ChatGPT, where the actions that it’s making visible to you may be a subset (and a deliberately deceptively chosen subset) of all the actions that it is actually taking.
It is worth thinking about why ChatGPT, an Oracle AI which can execute certain instructions, does not fail text equivalents of the cauldron task.
It seems the reason why it doesn’t fail is that it is pretty good at understanding the meaning of expressions. (If an AI floods the room with water because this maximizes the probability that the cauldron will be filled, then the AI hasn’t fully understood the instruction “fill the cauldron”, which only asks for a satisficing solution.)
And why is ChatGPT so good as interpreting the meaning of instructions? Because its base model was trained with some form of imitation learning, which gives it excellent language understanding and the ability to mimic the the linguistic behavior of human agents. This requires special prompting in a base model, but supervised learning on dialogue examples (instruction tuning) lets it respond adequately to instructions. (Of course, at this stage it would not refuse any dangerous requests, which comes only in with RLHF, which seems a rather imperfect tool.)
Though interestingly, aligning a langchainesque AI to the user’s intent seems to be (with some caveats) roughly as hard as stating that intent in plain English.
The more I stare at this observation, the more it feels potentially more profound than I intended when writing it.
Consider the “cauldron-filling” task. Does anyone doubt that, with at most a few very incremental technological steps from today, one could train a multimodal, embodied large language model (“RobotGPT”), to which you could say, “please fill up the cauldron”, and it would just do it, using a reasonable amount of common sense in the process — not flooding the room, not killing anyone or going to any other extreme lengths, and stopping if asked? Isn’t this basically how ChatGPT behaves now when you ask it for most things, bringing to bear a great deal of common sense in its understanding of your request, and avoiding overly-literal interpretations which aren’t what you really want?
Compare that to Nate’s 2017 description of the fiendish difficulty of this problem:
Regarding off switches:
These all seem like great arguments that we should not build and run a utility maximizer with some hand-crafted goal, and indeed RobotGPT isn’t any such thing. The contrast between this story and where we seem to be heading seems pretty stark to me. (Obviously it’s a fictional story, but Nate did say “as fictional depictions of AI go, this is pretty realistic”, and I think it does capture the spirit of much actual AI alignment research.)
Perhaps one could say that these sorts of problems only arise with superintelligent agents, not agents at ~GPT-4 level. I grant that the specific failure modes available to a system will depend on its capability level, but the story is about the difficulty of pointing a “generally intelligent system” to any common sense goal at all. If the story were basically right, GPT-4 should already have lots of “dumb-looking” failure modes today due to taking instructions too literally. But mostly it has pretty decent common sense.
Certainly, valid concerns remain about instrumental power-seeking, deceptive alignment, and so on, so I don’t say this means we should be complacent about alignment, but it should probably give us some pause that the situation is this different in practice from how it was envisioned only six years ago in the worldview represented in that story.
Indeed, isn’t PaLM-SayCan an early example of this?
On the flip side, as gwern pointed out in his Clippy short story, it’s possible for a “neutral” GPT-like system to discover agency and deception in its training data and execute upon those prompts without any explicit instruction to do so from its human supervisor. The actions of a tool-AI programmed with a more “obvious” explicit utility function is easier to predict, in some ways, than the actions of something like ChatGPT, where the actions that it’s making visible to you may be a subset (and a deliberately deceptively chosen subset) of all the actions that it is actually taking.
It is worth thinking about why ChatGPT, an Oracle AI which can execute certain instructions, does not fail text equivalents of the cauldron task.
It seems the reason why it doesn’t fail is that it is pretty good at understanding the meaning of expressions. (If an AI floods the room with water because this maximizes the probability that the cauldron will be filled, then the AI hasn’t fully understood the instruction “fill the cauldron”, which only asks for a satisficing solution.)
And why is ChatGPT so good as interpreting the meaning of instructions? Because its base model was trained with some form of imitation learning, which gives it excellent language understanding and the ability to mimic the the linguistic behavior of human agents. This requires special prompting in a base model, but supervised learning on dialogue examples (instruction tuning) lets it respond adequately to instructions. (Of course, at this stage it would not refuse any dangerous requests, which comes only in with RLHF, which seems a rather imperfect tool.)