The basic problem is that the training datasets we talk about wanting to construct, we cannot actually construct. We can talk abstractly about sampling from a distribution of “cases where the AI is obeying the human in the way we want,” but just because we can talk as if this distribution is a thing doesn’t mean we can actually sample from any such thing. What we can sample from are distributions like “cases where certain outside-observing humans think you’re obeying the human in the way they want.”
An AI that’s really good at learning the training distribution, when trained in the normal way on the “cases where certain outside-observing humans think you’re obeying the human in the way they want” distribution, will learn that distribution. This means it’s basically learning to act in a way that’s intended to deceive hypothetical observers. That’s bad.
A lot of alignment schemes (particularly prosaic ones) are predicated on resolving this by:
only giving the AI limited RL training data, and
having some inductive bias (including inductive biases in SGD) that favors human-favored interpretations of the data over the more accurate deception-causing interpretation.
Making SGD better without considering its inductive biases weakens these sorts of alignment schemes. And you have to try to solve this problem somehow (though there are other ways, see the other bullet points).
The basic problem is that the training datasets we talk about wanting to construct, we cannot actually construct. We can talk abstractly about sampling from a distribution of “cases where the AI is obeying the human in the way we want,” but just because we can talk as if this distribution is a thing doesn’t mean we can actually sample from any such thing. What we can sample from are distributions like “cases where certain outside-observing humans think you’re obeying the human in the way they want.”
An AI that’s really good at learning the training distribution, when trained in the normal way on the “cases where certain outside-observing humans think you’re obeying the human in the way they want” distribution, will learn that distribution. This means it’s basically learning to act in a way that’s intended to deceive hypothetical observers. That’s bad.
A lot of alignment schemes (particularly prosaic ones) are predicated on resolving this by:
only giving the AI limited RL training data, and
having some inductive bias (including inductive biases in SGD) that favors human-favored interpretations of the data over the more accurate deception-causing interpretation.
Making SGD better without considering its inductive biases weakens these sorts of alignment schemes. And you have to try to solve this problem somehow (though there are other ways, see the other bullet points).