Honestly optimizing according to the AI’s best model of the humans is deception. So that’s a problem with proposals that make SGD better.
There are several ways of framing solutions to this:
Fix the problem by controlling the data: We want SGD to be bad and get stuck on heuristics, because we might be able to set up a training environment, where sensible human-scale models are the favored heuristics (rather than prefictively powerful but manipulative models that might be the global optima).
Fix the problem by controlling the loss function: We want the AI to judge what’s good by referring not to its best-predicting model of humans, but to one we endorse for reasons in addition to predictive accuracy. So we need an architecture that allows this fistinction, and a training procedure that’s responsive to human feedback about how they want to be modeled.
Fix the problem by changing the priors or inductive biases: The AI would converge to human-approved cognition if we just incentivized it to use the right building blocks. So we might try to mimic human cognition and use that as a building block, or add in some regularization term for abstractions that incorporates data from humans about what abstractions are “human.” Then we could end up with an AI reasoning in human-approved ways.
The basic problem is that the training datasets we talk about wanting to construct, we cannot actually construct. We can talk abstractly about sampling from a distribution of “cases where the AI is obeying the human in the way we want,” but just because we can talk as if this distribution is a thing doesn’t mean we can actually sample from any such thing. What we can sample from are distributions like “cases where certain outside-observing humans think you’re obeying the human in the way they want.”
An AI that’s really good at learning the training distribution, when trained in the normal way on the “cases where certain outside-observing humans think you’re obeying the human in the way they want” distribution, will learn that distribution. This means it’s basically learning to act in a way that’s intended to deceive hypothetical observers. That’s bad.
A lot of alignment schemes (particularly prosaic ones) are predicated on resolving this by:
only giving the AI limited RL training data, and
having some inductive bias (including inductive biases in SGD) that favors human-favored interpretations of the data over the more accurate deception-causing interpretation.
Making SGD better without considering its inductive biases weakens these sorts of alignment schemes. And you have to try to solve this problem somehow (though there are other ways, see the other bullet points).
Honestly optimizing according to the AI’s best model of the humans is deception. So that’s a problem with proposals that make SGD better.
There are several ways of framing solutions to this:
Fix the problem by controlling the data: We want SGD to be bad and get stuck on heuristics, because we might be able to set up a training environment, where sensible human-scale models are the favored heuristics (rather than prefictively powerful but manipulative models that might be the global optima).
Fix the problem by controlling the loss function: We want the AI to judge what’s good by referring not to its best-predicting model of humans, but to one we endorse for reasons in addition to predictive accuracy. So we need an architecture that allows this fistinction, and a training procedure that’s responsive to human feedback about how they want to be modeled.
Fix the problem by changing the priors or inductive biases: The AI would converge to human-approved cognition if we just incentivized it to use the right building blocks. So we might try to mimic human cognition and use that as a building block, or add in some regularization term for abstractions that incorporates data from humans about what abstractions are “human.” Then we could end up with an AI reasoning in human-approved ways.
Can you explain why, exactly on this point.
The basic problem is that the training datasets we talk about wanting to construct, we cannot actually construct. We can talk abstractly about sampling from a distribution of “cases where the AI is obeying the human in the way we want,” but just because we can talk as if this distribution is a thing doesn’t mean we can actually sample from any such thing. What we can sample from are distributions like “cases where certain outside-observing humans think you’re obeying the human in the way they want.”
An AI that’s really good at learning the training distribution, when trained in the normal way on the “cases where certain outside-observing humans think you’re obeying the human in the way they want” distribution, will learn that distribution. This means it’s basically learning to act in a way that’s intended to deceive hypothetical observers. That’s bad.
A lot of alignment schemes (particularly prosaic ones) are predicated on resolving this by:
only giving the AI limited RL training data, and
having some inductive bias (including inductive biases in SGD) that favors human-favored interpretations of the data over the more accurate deception-causing interpretation.
Making SGD better without considering its inductive biases weakens these sorts of alignment schemes. And you have to try to solve this problem somehow (though there are other ways, see the other bullet points).