michaelcohen comments on Formal Solution to the Inner Alignment Problem

michaelcohen 18 Feb 2021 19:39 UTC
LW: 2 AF: 2
AF
This is really misleading...
Edited to clarify. Thank you for this comment.
In the event that any other other approach to AI can be superhuman, then imitation learning would be uncompetitive
100% agree. Intelligent agency can be broken into intelligent prediction and intelligent planning. This work introduces a method for intelligent prediction that avoids an inner alignment failure. The original concern about inner alignment was that an idealized prediction algorithm (Bayesian reasoning) could be commandeered by mesa-optimizers. Idealized planning, on the other hand is an expectimax tree, and I don’t think anyone has claimed mesa-optimizers could be introduced by a perfect planner. I’m not sure what it would even mean. There is nothing internal in the expectimax algorithm that could make the output something other than what the prediction algorithm would agree is the best plan. Expectimax, by definition, produces a policy perfectly aligned with the “goals” of the prediction algorithm.
Tl;dr: I think that in theory, the inner alignment problem regards prediction, not planning, so that’s the place to test solutions.
If you want to see the inner alignment problem neutralized in the full RL setup, you can see we use a similar approach in this agent’s prediction subroutine. So you can maybe say that work solved the inner alignment problem. But we didn’t prove finite error bounds the way we have here, and I think RL setup obscures the extent to which rogue predictive models are dismissed, so it’s a little harder to see than it is here.
sampling the top models according to a parameter, and following what the sampled model does
Not exactly. No models are ever sampled. The top models are collected, and they all can contribute to the estimated probabilities of actions. Then an action is sampled according to those probabilities, which sum to less than one, and queries the demonstrator if it comes up empty.
Even if you limit the drift for one step of imitation learning, the model could drift further and further at each distillation step.
Yes, that’s right. The bounds should chain together, I think, but they would definitely grow.