A comment on the IDA-AlphaGoZero metaphor; capabilities versus alignment
Iterated Distillation and Amplification has often been compared to AlphaGo Zero. But in the case of a Go-playing AI, people often don’t see a distinction between alignment and capability, instead measuring success on a single axis: ability to win games of Go. But I think a useful analogy can still be drawn that separates out ways in which a Go-playing AI is aligned from ways in which it is capable.
A well-aligned but not very capable Go-playing agent would be an agent that is best modeled as trying to win at Go, as opposed to trying to optimize some other feature of sequence of board positions, but still does not win against moderately skilled opponents. A very capable but poorly aligned Go-playing agent would be an agent that is very good at causing the Go games that it plays to have certain properties other than the agent winning. One way to create such an agent would be to take a value network that rates board positions somewhat arbitrarily, and then use Monte Carlo tree search to find policies that cause the game to progress through board states that are rated highly by the value network.
To a certain extent, this could actually happen to something like AlphaGo Zero. If some bad board positions are mistakenly assigned high values, AlphaGo Zero will use Monte Carlo tree search to search for ways of getting to those board positions (and similarly for ways to avoid good board positions that are mistakenly assigned low values). But it will simultaneously be correcting the mistaken values of these board positions as it notices that its predictions of the values of future board positions that follow from them are biased. Thus the Monte Carlo tree search stage increases both capability and alignment. Meanwhile, the policy network retraining stage should be expected to decrease both capability and alignment, since it is merely approximating the results of the Monte Carlo tree search. The fact that this works shows that the extent to which Monte Carlo tree search increases alignment and capability exceeds the extent to which policy network retraining decreases them.
It seems to me that this is the model that Iterated Distillation and Amplification should be trying to emulate. The amplification stage will both increase capability by increasing the available computing power, and increase alignment because of the human overseer. The distillation stage will decrease both capability and alignment by imperfectly approximating the amplified agent. The important thing is not to drive the extent to which distillation decreases alignment down to zero (which may be impossible), but to ensure that the distillation stage does not cause alignment problems that will not be corrected in the subsequent amplification stage.
Correcting all problems in the subsequent amplification stage would be a nice property to have, but I think IDA can still work even if it corrects errors with multiple A/D steps in between (as long as all catastrophic errors are caught before deployment). For example, I could think of the agent initially using some rules for how to solve math problems where distillation introduces some mistake, but later in the IDA process the agent learns how to rederive those rules and realizes the mistake.