But existence of such populations and weight settings doesn’t imply net local pressures or gradients in those directions.
How so? This seems like the core disagreement. Above, I think you’re agreeing that under a wide enough distribution on scenarios, the only zero-gradient agent-designs are those that optimize for R directly. Yet that somehow doesn’t imply that training an agent in a sufficiently diverse environment would shape it into an R-optimizer?
Are you just saying that there aren’t any gradients from initialization to an R-optimizer? That is, in any sufficiently diverse environment, the SGD just never converges to zero loss?
Of shard economies, you critique that “there’d be at least one environment where [the shard behavior] decouples from R.” But why? Why not just consider the economy which nails each training scenario (e.g. wins at chess or crosses the room). Those, too, are fix-points; there is zero policy gradient under such a scenario, where the shard economies form locally training-optimal policies.
Okay, sure. Let’s suppose that we have a shard economy that uniquely identifies R and always points itself in R’s direction. Would it not essentially act as an R-optimizing wrapper-mind? Because if not, it sounds like it’d underperform compared to an R-optimizer. And if so, if there exists a series of incremental updates that moves this shard economy towards an R-optimizing wrapper-mind, the SGD would make that series of updates.
Do you disagree that (1) it’d be behaviorally indistinguishable from a wrapper-mind, or that (2) it’d underperform on R compared to an R-optimizer, or that (3) there is such a series of incremental updates?
Edit: Also, see here on what I mean by a “wide enough distribution on scenarios”.
…That is, in any sufficiently diverse environment, the SGD just never converges to zero loss?
Realistically speaking, I think this is true. E.g., imagine how computationally expensive it would be to train a model to (near) zero loss on GPT-3’s training data. Compute optimal (or even “compute slightly efficient”) models are not trained nearly that much. I strongly expect this to be true of superintelligent models as well.
I disagree with:
it’d be behaviorally indistinguishable from a wrapper-mind [optimizing for R]
even in the limit of extreme overtraining on a fixed R (assuming an R optimizing wrapper mind is even learnable by your training process), this still does not get you a system that is perfectly internally aligned to R maximization. The reason for this is because real-world reward functions do not uniquely specify a single fixed point.
E.g., suppose R gives lots of reward for doing drugs. Do all fixed points do drugs? I think the answer is no. If the system refuses to explore drug use in any circumstance, then it won’t be updated towards drug use and can create a non-drug using fixed point. Such a system configuration would get lower reward than one that did use drugs, but the training process wouldn’t penalize it for that choice or change it to doing drugs.
The only training process I can imagine which might consistently converge to a pure R maximizer involved some form of exploration guarantees that ensure the agent tries out all possible rewarding trajectories arbitrarily often. Note that this is a far stronger condition than just being trained in many diverse environments, to the point that I’m fairly confident we’ll never do this with any realistic AGI agent. E.g., consider just how much of a challenge it is to create an image classifier that’s robust to arbitrary adversarial image perturbations, and consider how much vastly larger is the space of possible world histories over which R could be defined.
I agree that we aren’t going to actually get a pure wrapper-mind in practice, let alone an inner-aligned wrapper-mind. It very much only happens in the limit of a “perfect” training process.
But I argue that, inasmuch as training processes approximate this perfect ideal, so would the minds we get out of them approximate an R-aligned wrapper-mind. The fact that practical exploration policies fall short of an idealized “all possible rewarding trajectories” exploration policy is just another way for a training process to be an imperfect approximation; and the less of an approximation it is (the more exhaustive the exploration policy is), the more the agent we’ll get will approximate an R-maximizer.
For my argument to go through, we only need a exploration policy + reinforcement schedule that put some sufficient constraint on R, while simultaneously making the training environment diverse enough to make it necessary to re-target one’s heuristics/shards at R at runtime.
Hmm, maybe I’d underappreciated that last condition, actually. Imagine a training environment which often introduces scenarios that the agent never encountered before — that are OOD with regards to its earlier training. The only agents that can stay (roughly) aimed at R in this case are those that incorporate (a good proxy of) R in themselves, and can re-orient themselves back towards R (or in its rough direction) even when taken off-distribution. I think this is the “sufficient diversity” condition I’m talking about.
And then we can approximate this condition by postulating, e. g.:
an environment that sometimes takes agents to points that are on-distribution but far from its center, or
an environment which gradually changes in-episode such that the agent has to have some mechanism for keeping itself aimed at R through that, or
a combination of environment complexity + memory constraints such that the agent can only store an optimal set of heuristics for a subset of that environment, which requires the agent to have some mechanism for re-deriving new R-aligned heuristics at runtime, if it wants to move within that environment at runtime.
(And then I suspect that we only get to an AGI under such circumstances; any less adversity than that, and we indeed just get stuck with shallow heuristics that don’t generalize and can’t do anything genuinely exciting.)
How so? This seems like the core disagreement. Above, I think you’re agreeing that under a wide enough distribution on scenarios, the only zero-gradient agent-designs are those that optimize for R directly. Yet that somehow doesn’t imply that training an agent in a sufficiently diverse environment would shape it into an R-optimizer?
Are you just saying that there aren’t any gradients from initialization to an R-optimizer? That is, in any sufficiently diverse environment, the SGD just never converges to zero loss?
Okay, sure. Let’s suppose that we have a shard economy that uniquely identifies R and always points itself in R’s direction. Would it not essentially act as an R-optimizing wrapper-mind? Because if not, it sounds like it’d underperform compared to an R-optimizer. And if so, if there exists a series of incremental updates that moves this shard economy towards an R-optimizing wrapper-mind, the SGD would make that series of updates.
Do you disagree that (1) it’d be behaviorally indistinguishable from a wrapper-mind, or that (2) it’d underperform on R compared to an R-optimizer, or that (3) there is such a series of incremental updates?
Edit: Also, see here on what I mean by a “wide enough distribution on scenarios”.
Realistically speaking, I think this is true. E.g., imagine how computationally expensive it would be to train a model to (near) zero loss on GPT-3’s training data. Compute optimal (or even “compute slightly efficient”) models are not trained nearly that much. I strongly expect this to be true of superintelligent models as well.
I disagree with:
even in the limit of extreme overtraining on a fixed R (assuming an R optimizing wrapper mind is even learnable by your training process), this still does not get you a system that is perfectly internally aligned to R maximization. The reason for this is because real-world reward functions do not uniquely specify a single fixed point.
E.g., suppose R gives lots of reward for doing drugs. Do all fixed points do drugs? I think the answer is no. If the system refuses to explore drug use in any circumstance, then it won’t be updated towards drug use and can create a non-drug using fixed point. Such a system configuration would get lower reward than one that did use drugs, but the training process wouldn’t penalize it for that choice or change it to doing drugs.
The only training process I can imagine which might consistently converge to a pure R maximizer involved some form of exploration guarantees that ensure the agent tries out all possible rewarding trajectories arbitrarily often. Note that this is a far stronger condition than just being trained in many diverse environments, to the point that I’m fairly confident we’ll never do this with any realistic AGI agent. E.g., consider just how much of a challenge it is to create an image classifier that’s robust to arbitrary adversarial image perturbations, and consider how much vastly larger is the space of possible world histories over which R could be defined.
I agree that we aren’t going to actually get a pure wrapper-mind in practice, let alone an inner-aligned wrapper-mind. It very much only happens in the limit of a “perfect” training process.
But I argue that, inasmuch as training processes approximate this perfect ideal, so would the minds we get out of them approximate an R-aligned wrapper-mind. The fact that practical exploration policies fall short of an idealized “all possible rewarding trajectories” exploration policy is just another way for a training process to be an imperfect approximation; and the less of an approximation it is (the more exhaustive the exploration policy is), the more the agent we’ll get will approximate an R-maximizer.
For my argument to go through, we only need a exploration policy + reinforcement schedule that put some sufficient constraint on R, while simultaneously making the training environment diverse enough to make it necessary to re-target one’s heuristics/shards at R at runtime.
Hmm, maybe I’d underappreciated that last condition, actually. Imagine a training environment which often introduces scenarios that the agent never encountered before — that are OOD with regards to its earlier training. The only agents that can stay (roughly) aimed at R in this case are those that incorporate (a good proxy of) R in themselves, and can re-orient themselves back towards R (or in its rough direction) even when taken off-distribution. I think this is the “sufficient diversity” condition I’m talking about.
And then we can approximate this condition by postulating, e. g.:
an environment that sometimes takes agents to points that are on-distribution but far from its center, or
an environment which gradually changes in-episode such that the agent has to have some mechanism for keeping itself aimed at R through that, or
a combination of environment complexity + memory constraints such that the agent can only store an optimal set of heuristics for a subset of that environment, which requires the agent to have some mechanism for re-deriving new R-aligned heuristics at runtime, if it wants to move within that environment at runtime.
(And then I suspect that we only get to an AGI under such circumstances; any less adversity than that, and we indeed just get stuck with shallow heuristics that don’t generalize and can’t do anything genuinely exciting.)