Here’s my short summary after reading the slides and scanning the paper.
Because human demonstrator are safe (in the sense of almost never doing catastrophic actions), a model that imitates closely enough the demonstrator should be safe. The algorithm in this paper does that by keeping multiple models of the demonstrator, sampling the top models according to a parameter, and following what the sampled model does (or querying the demonstrator if the sample is “empty”). The probability that this algorithm does a very unlikely action for the demonstrator can be bounded above, and driven down.
If this is broadly correct (and if not, please tell me what’s wrong), then I feel like this fall short of solving the inner alignment problem. I agree with most of the reasoning regarding imitation learning and its safety when close enough to the demonstrator. But the big issue with imitation learning by itself is that it cannot do much better than the demonstrator. In the event that any other other approach to AI can be superhuman, then imitation learning would be uncompetitive and there would be a massive incentive to ditch it.
Slide 8 actually points towards a way to use imitation learning to hopefully make a competitive AI: IDA. Yet in this case, I’m not sure that your result implies safety. For IDA isn’t a one shot imitation learning problem; it’s many successive imitation learning problem. Even if you limit the drift for one step of imitation learning, the model could drift further and further at each distillation step. (If you think it wouldn’t, I’m interested by the argument)
Sorry if this feels a bit rough. Honestly, the result looks exciting in the context of imitation learning, but I feel it is a very bad policy to present a research as solving a major AI Alignment problem when it only does in a very, very limited setting, that doesn’t feel that relevant to the actual risk.
Almost no mathematical background is required to follow the proofs. We feel our bounds could be made much tighter, and we’d love help investigating that.
This is really misleading for anyone that isn’t used to online learning theory. I guess what you mean is that it doesn’t rely on more uncommon fields of maths like gauge theory or category theory, but you still use ideas like measures and martingales which are far from trivial for someone with no mathematical background.
Slide 8 actually points towards a way to use imitation learning to hopefully make a competitive AI: IDA. Yet in this case, I’m not sure that your result implies safety. For IDA isn’t a one shot imitation learning problem; it’s many successive imitation learning problem. Even if you limit the drift for one step of imitation learning, the model could drift further and further at each distillation step.
I don’t think this is a lethal problem. The setting is not one-shot, it’s imitation over some duration of time. IDA just increases the effective duration of time, so you only need to tune how cautious the learning is (which I think is controlled by α in this work) accordingly: there is a cost, but it’s bounded. You also need to deal with non-realizability (after enough amplifications the system is too complex for exact simulation, even if it wasn’t to begin with), but this should be doable using infra-Bayesianism (I already have some notion how that would work). Another problem with imitation-based IDA is that external unaligned AI might leak into the system either from the future or from counterfactual scenarios in which such an AI is instantiated. This is not an issue with amplifying by parallelism (like in the presentation) but at the cost of requiring parallelizability.
I think this looks fine for IDA—the two problems remain the practical one of implementing Bayesian reasoning in a complicated world, and the philosophical one that probably IDA on human imitations doesn’t work because humans have bad safety properties.
In the event that any other other approach to AI can be superhuman, then imitation learning would be uncompetitive
100% agree. Intelligent agency can be broken into intelligent prediction and intelligent planning. This work introduces a method for intelligent prediction that avoids an inner alignment failure. The original concern about inner alignment was that an idealized prediction algorithm (Bayesian reasoning) could be commandeered by mesa-optimizers. Idealized planning, on the other hand is an expectimax tree, and I don’t think anyone has claimed mesa-optimizers could be introduced by a perfect planner. I’m not sure what it would even mean. There is nothing internal in the expectimax algorithm that could make the output something other than what the prediction algorithm would agree is the best plan. Expectimax, by definition, produces a policy perfectly aligned with the “goals” of the prediction algorithm.
Tl;dr: I think that in theory, the inner alignment problem regards prediction, not planning, so that’s the place to test solutions.
If you want to see the inner alignment problem neutralized in the full RL setup, you can see we use a similar approach in this agent’s prediction subroutine. So you can maybe say that work solved the inner alignment problem. But we didn’t prove finite error bounds the way we have here, and I think RL setup obscures the extent to which rogue predictive models are dismissed, so it’s a little harder to see than it is here.
sampling the top models according to a parameter, and following what the sampled model does
Not exactly. No models are ever sampled. The top models are collected, and they all can contribute to the estimated probabilities of actions. Then an action is sampled according to those probabilities, which sum to less than one, and queries the demonstrator if it comes up empty.
Even if you limit the drift for one step of imitation learning, the model could drift further and further at each distillation step.
Yes, that’s right. The bounds should chain together, I think, but they would definitely grow.
Thanks for sharing this work!
Here’s my short summary after reading the slides and scanning the paper.
If this is broadly correct (and if not, please tell me what’s wrong), then I feel like this fall short of solving the inner alignment problem. I agree with most of the reasoning regarding imitation learning and its safety when close enough to the demonstrator. But the big issue with imitation learning by itself is that it cannot do much better than the demonstrator. In the event that any other other approach to AI can be superhuman, then imitation learning would be uncompetitive and there would be a massive incentive to ditch it.
Slide 8 actually points towards a way to use imitation learning to hopefully make a competitive AI: IDA. Yet in this case, I’m not sure that your result implies safety. For IDA isn’t a one shot imitation learning problem; it’s many successive imitation learning problem. Even if you limit the drift for one step of imitation learning, the model could drift further and further at each distillation step. (If you think it wouldn’t, I’m interested by the argument)
Sorry if this feels a bit rough. Honestly, the result looks exciting in the context of imitation learning, but I feel it is a very bad policy to present a research as solving a major AI Alignment problem when it only does in a very, very limited setting, that doesn’t feel that relevant to the actual risk.
This is really misleading for anyone that isn’t used to online learning theory. I guess what you mean is that it doesn’t rely on more uncommon fields of maths like gauge theory or category theory, but you still use ideas like measures and martingales which are far from trivial for someone with no mathematical background.
I don’t think this is a lethal problem. The setting is not one-shot, it’s imitation over some duration of time. IDA just increases the effective duration of time, so you only need to tune how cautious the learning is (which I think is controlled by α in this work) accordingly: there is a cost, but it’s bounded. You also need to deal with non-realizability (after enough amplifications the system is too complex for exact simulation, even if it wasn’t to begin with), but this should be doable using infra-Bayesianism (I already have some notion how that would work). Another problem with imitation-based IDA is that external unaligned AI might leak into the system either from the future or from counterfactual scenarios in which such an AI is instantiated. This is not an issue with amplifying by parallelism (like in the presentation) but at the cost of requiring parallelizability.
I think this looks fine for IDA—the two problems remain the practical one of implementing Bayesian reasoning in a complicated world, and the philosophical one that probably IDA on human imitations doesn’t work because humans have bad safety properties.
Edited to clarify. Thank you for this comment.
100% agree. Intelligent agency can be broken into intelligent prediction and intelligent planning. This work introduces a method for intelligent prediction that avoids an inner alignment failure. The original concern about inner alignment was that an idealized prediction algorithm (Bayesian reasoning) could be commandeered by mesa-optimizers. Idealized planning, on the other hand is an expectimax tree, and I don’t think anyone has claimed mesa-optimizers could be introduced by a perfect planner. I’m not sure what it would even mean. There is nothing internal in the expectimax algorithm that could make the output something other than what the prediction algorithm would agree is the best plan. Expectimax, by definition, produces a policy perfectly aligned with the “goals” of the prediction algorithm.
Tl;dr: I think that in theory, the inner alignment problem regards prediction, not planning, so that’s the place to test solutions.
If you want to see the inner alignment problem neutralized in the full RL setup, you can see we use a similar approach in this agent’s prediction subroutine. So you can maybe say that work solved the inner alignment problem. But we didn’t prove finite error bounds the way we have here, and I think RL setup obscures the extent to which rogue predictive models are dismissed, so it’s a little harder to see than it is here.
Not exactly. No models are ever sampled. The top models are collected, and they all can contribute to the estimated probabilities of actions. Then an action is sampled according to those probabilities, which sum to less than one, and queries the demonstrator if it comes up empty.
Yes, that’s right. The bounds should chain together, I think, but they would definitely grow.