…That is, in any sufficiently diverse environment, the SGD just never converges to zero loss?
Realistically speaking, I think this is true. E.g., imagine how computationally expensive it would be to train a model to (near) zero loss on GPT-3’s training data. Compute optimal (or even “compute slightly efficient”) models are not trained nearly that much. I strongly expect this to be true of superintelligent models as well.
I disagree with:
it’d be behaviorally indistinguishable from a wrapper-mind [optimizing for R]
even in the limit of extreme overtraining on a fixed R (assuming an R optimizing wrapper mind is even learnable by your training process), this still does not get you a system that is perfectly internally aligned to R maximization. The reason for this is because real-world reward functions do not uniquely specify a single fixed point.
E.g., suppose R gives lots of reward for doing drugs. Do all fixed points do drugs? I think the answer is no. If the system refuses to explore drug use in any circumstance, then it won’t be updated towards drug use and can create a non-drug using fixed point. Such a system configuration would get lower reward than one that did use drugs, but the training process wouldn’t penalize it for that choice or change it to doing drugs.
The only training process I can imagine which might consistently converge to a pure R maximizer involved some form of exploration guarantees that ensure the agent tries out all possible rewarding trajectories arbitrarily often. Note that this is a far stronger condition than just being trained in many diverse environments, to the point that I’m fairly confident we’ll never do this with any realistic AGI agent. E.g., consider just how much of a challenge it is to create an image classifier that’s robust to arbitrary adversarial image perturbations, and consider how much vastly larger is the space of possible world histories over which R could be defined.
I agree that we aren’t going to actually get a pure wrapper-mind in practice, let alone an inner-aligned wrapper-mind. It very much only happens in the limit of a “perfect” training process.
But I argue that, inasmuch as training processes approximate this perfect ideal, so would the minds we get out of them approximate an R-aligned wrapper-mind. The fact that practical exploration policies fall short of an idealized “all possible rewarding trajectories” exploration policy is just another way for a training process to be an imperfect approximation; and the less of an approximation it is (the more exhaustive the exploration policy is), the more the agent we’ll get will approximate an R-maximizer.
For my argument to go through, we only need a exploration policy + reinforcement schedule that put some sufficient constraint on R, while simultaneously making the training environment diverse enough to make it necessary to re-target one’s heuristics/shards at R at runtime.
Hmm, maybe I’d underappreciated that last condition, actually. Imagine a training environment which often introduces scenarios that the agent never encountered before — that are OOD with regards to its earlier training. The only agents that can stay (roughly) aimed at R in this case are those that incorporate (a good proxy of) R in themselves, and can re-orient themselves back towards R (or in its rough direction) even when taken off-distribution. I think this is the “sufficient diversity” condition I’m talking about.
And then we can approximate this condition by postulating, e. g.:
an environment that sometimes takes agents to points that are on-distribution but far from its center, or
an environment which gradually changes in-episode such that the agent has to have some mechanism for keeping itself aimed at R through that, or
a combination of environment complexity + memory constraints such that the agent can only store an optimal set of heuristics for a subset of that environment, which requires the agent to have some mechanism for re-deriving new R-aligned heuristics at runtime, if it wants to move within that environment at runtime.
(And then I suspect that we only get to an AGI under such circumstances; any less adversity than that, and we indeed just get stuck with shallow heuristics that don’t generalize and can’t do anything genuinely exciting.)
Realistically speaking, I think this is true. E.g., imagine how computationally expensive it would be to train a model to (near) zero loss on GPT-3’s training data. Compute optimal (or even “compute slightly efficient”) models are not trained nearly that much. I strongly expect this to be true of superintelligent models as well.
I disagree with:
even in the limit of extreme overtraining on a fixed R (assuming an R optimizing wrapper mind is even learnable by your training process), this still does not get you a system that is perfectly internally aligned to R maximization. The reason for this is because real-world reward functions do not uniquely specify a single fixed point.
E.g., suppose R gives lots of reward for doing drugs. Do all fixed points do drugs? I think the answer is no. If the system refuses to explore drug use in any circumstance, then it won’t be updated towards drug use and can create a non-drug using fixed point. Such a system configuration would get lower reward than one that did use drugs, but the training process wouldn’t penalize it for that choice or change it to doing drugs.
The only training process I can imagine which might consistently converge to a pure R maximizer involved some form of exploration guarantees that ensure the agent tries out all possible rewarding trajectories arbitrarily often. Note that this is a far stronger condition than just being trained in many diverse environments, to the point that I’m fairly confident we’ll never do this with any realistic AGI agent. E.g., consider just how much of a challenge it is to create an image classifier that’s robust to arbitrary adversarial image perturbations, and consider how much vastly larger is the space of possible world histories over which R could be defined.
I agree that we aren’t going to actually get a pure wrapper-mind in practice, let alone an inner-aligned wrapper-mind. It very much only happens in the limit of a “perfect” training process.
But I argue that, inasmuch as training processes approximate this perfect ideal, so would the minds we get out of them approximate an R-aligned wrapper-mind. The fact that practical exploration policies fall short of an idealized “all possible rewarding trajectories” exploration policy is just another way for a training process to be an imperfect approximation; and the less of an approximation it is (the more exhaustive the exploration policy is), the more the agent we’ll get will approximate an R-maximizer.
For my argument to go through, we only need a exploration policy + reinforcement schedule that put some sufficient constraint on R, while simultaneously making the training environment diverse enough to make it necessary to re-target one’s heuristics/shards at R at runtime.
Hmm, maybe I’d underappreciated that last condition, actually. Imagine a training environment which often introduces scenarios that the agent never encountered before — that are OOD with regards to its earlier training. The only agents that can stay (roughly) aimed at R in this case are those that incorporate (a good proxy of) R in themselves, and can re-orient themselves back towards R (or in its rough direction) even when taken off-distribution. I think this is the “sufficient diversity” condition I’m talking about.
And then we can approximate this condition by postulating, e. g.:
an environment that sometimes takes agents to points that are on-distribution but far from its center, or
an environment which gradually changes in-episode such that the agent has to have some mechanism for keeping itself aimed at R through that, or
a combination of environment complexity + memory constraints such that the agent can only store an optimal set of heuristics for a subset of that environment, which requires the agent to have some mechanism for re-deriving new R-aligned heuristics at runtime, if it wants to move within that environment at runtime.
(And then I suspect that we only get to an AGI under such circumstances; any less adversity than that, and we indeed just get stuck with shallow heuristics that don’t generalize and can’t do anything genuinely exciting.)