I agree that we aren’t going to actually get a pure wrapper-mind in practice, let alone an inner-aligned wrapper-mind. It very much only happens in the limit of a “perfect” training process.
But I argue that, inasmuch as training processes approximate this perfect ideal, so would the minds we get out of them approximate an R-aligned wrapper-mind. The fact that practical exploration policies fall short of an idealized “all possible rewarding trajectories” exploration policy is just another way for a training process to be an imperfect approximation; and the less of an approximation it is (the more exhaustive the exploration policy is), the more the agent we’ll get will approximate an R-maximizer.
For my argument to go through, we only need a exploration policy + reinforcement schedule that put some sufficient constraint on R, while simultaneously making the training environment diverse enough to make it necessary to re-target one’s heuristics/shards at R at runtime.
Hmm, maybe I’d underappreciated that last condition, actually. Imagine a training environment which often introduces scenarios that the agent never encountered before — that are OOD with regards to its earlier training. The only agents that can stay (roughly) aimed at R in this case are those that incorporate (a good proxy of) R in themselves, and can re-orient themselves back towards R (or in its rough direction) even when taken off-distribution. I think this is the “sufficient diversity” condition I’m talking about.
And then we can approximate this condition by postulating, e. g.:
an environment that sometimes takes agents to points that are on-distribution but far from its center, or
an environment which gradually changes in-episode such that the agent has to have some mechanism for keeping itself aimed at R through that, or
a combination of environment complexity + memory constraints such that the agent can only store an optimal set of heuristics for a subset of that environment, which requires the agent to have some mechanism for re-deriving new R-aligned heuristics at runtime, if it wants to move within that environment at runtime.
(And then I suspect that we only get to an AGI under such circumstances; any less adversity than that, and we indeed just get stuck with shallow heuristics that don’t generalize and can’t do anything genuinely exciting.)
I agree that we aren’t going to actually get a pure wrapper-mind in practice, let alone an inner-aligned wrapper-mind. It very much only happens in the limit of a “perfect” training process.
But I argue that, inasmuch as training processes approximate this perfect ideal, so would the minds we get out of them approximate an R-aligned wrapper-mind. The fact that practical exploration policies fall short of an idealized “all possible rewarding trajectories” exploration policy is just another way for a training process to be an imperfect approximation; and the less of an approximation it is (the more exhaustive the exploration policy is), the more the agent we’ll get will approximate an R-maximizer.
For my argument to go through, we only need a exploration policy + reinforcement schedule that put some sufficient constraint on R, while simultaneously making the training environment diverse enough to make it necessary to re-target one’s heuristics/shards at R at runtime.
Hmm, maybe I’d underappreciated that last condition, actually. Imagine a training environment which often introduces scenarios that the agent never encountered before — that are OOD with regards to its earlier training. The only agents that can stay (roughly) aimed at R in this case are those that incorporate (a good proxy of) R in themselves, and can re-orient themselves back towards R (or in its rough direction) even when taken off-distribution. I think this is the “sufficient diversity” condition I’m talking about.
And then we can approximate this condition by postulating, e. g.:
an environment that sometimes takes agents to points that are on-distribution but far from its center, or
an environment which gradually changes in-episode such that the agent has to have some mechanism for keeping itself aimed at R through that, or
a combination of environment complexity + memory constraints such that the agent can only store an optimal set of heuristics for a subset of that environment, which requires the agent to have some mechanism for re-deriving new R-aligned heuristics at runtime, if it wants to move within that environment at runtime.
(And then I suspect that we only get to an AGI under such circumstances; any less adversity than that, and we indeed just get stuck with shallow heuristics that don’t generalize and can’t do anything genuinely exciting.)