I think that Paul Christiano’s ALBA proposal is good in practice, but has conceptual problems in principle.
Specifically, I don’t think it makes sense to talk about bootstrapping an “aligned” agent to one that is still “aligned” but that has an increased capacity.
The main reason being that I don’t see “aligned” as being a definition that makes sense distinct from capacity.
These are not the lands of your forefathers
Here’s a simple example: let r be a reward function that is perfectly aligned with human happiness within ordinary circumstances (and within a few un-ordinary circumstances that humans can think up).
Then the initial agent - B0, a human—trains a reward r1 for an agent A1. This agent is limited in some way—maybe it doesn’t have much speed or time—but the aim is for r1 to ensure that A1 is aligned with B0.
Then the capacity of A1 is increased to B1, a slow powerful agent. It computers the reward r2 to ensure the alignment of A2, and so on.
The nature of the Bj agents is not defined—they might be algorithms calling Ai for i≤j as subroutines, humans may be involved, and so on.
If the humans are unimaginative and don’t deliberately seek out more extreme and exotic test cases, the best case scenario is for ri→r as i→∞.
And eventually there will be an agent An that is powerful enough to overwhelm the whole system and take over. It will do this in full agreement with Bn−1, because they share the same objective. And then An will push the world into extra-ordinary circumstance and proceed to maximise r, with likely disastrous results for us humans.
The nature of the problem
So what went wrong? At what point did the agents go out of alignment?
In one sense, at An. In another sense, at A1 (and, in another interesting sense, at B0, the human). The reward r was aligned, as long as the agent stayed near the bounds of the ordinary. As soon as it was no longer restricted to that, it went out of alignment, not because of a goal drift, but because of a capacity increase.
ALBA: can you be “aligned” at increased “capacity”?
I think that Paul Christiano’s ALBA proposal is good in practice, but has conceptual problems in principle.
Specifically, I don’t think it makes sense to talk about bootstrapping an “aligned” agent to one that is still “aligned” but that has an increased capacity.
The main reason being that I don’t see “aligned” as being a definition that makes sense distinct from capacity.
These are not the lands of your forefathers
Here’s a simple example: let r be a reward function that is perfectly aligned with human happiness within ordinary circumstances (and within a few un-ordinary circumstances that humans can think up).
Then the initial agent - B0, a human—trains a reward r1 for an agent A1. This agent is limited in some way—maybe it doesn’t have much speed or time—but the aim is for r1 to ensure that A1 is aligned with B0.
Then the capacity of A1 is increased to B1, a slow powerful agent. It computers the reward r2 to ensure the alignment of A2, and so on.
The nature of the Bj agents is not defined—they might be algorithms calling Ai for i≤j as subroutines, humans may be involved, and so on.
If the humans are unimaginative and don’t deliberately seek out more extreme and exotic test cases, the best case scenario is for ri→r as i→∞.
And eventually there will be an agent An that is powerful enough to overwhelm the whole system and take over. It will do this in full agreement with Bn−1, because they share the same objective. And then An will push the world into extra-ordinary circumstance and proceed to maximise r, with likely disastrous results for us humans.
The nature of the problem
So what went wrong? At what point did the agents go out of alignment?
In one sense, at An. In another sense, at A1 (and, in another interesting sense, at B0, the human). The reward r was aligned, as long as the agent stayed near the bounds of the ordinary. As soon as it was no longer restricted to that, it went out of alignment, not because of a goal drift, but because of a capacity increase.