What are the plans for solving the inner alignment problem?
Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process.
Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?
As an example, evolution is an optimization force that itself ‘designed’ optimizers (humans) to achieve its goals. However, humans do not primarily maximize reproductive success, they instead use birth control while still attaining the pleasure that evolution meant as a reward for attempts at reproduction. This is a failure of inner alignment.
What are the main proposals for solving the problem? Are there any posts/articles/papers specifically dedicated to addressing this issue?
Evolution was working within tight computational efficiency limits (the human brain burns roughly 1⁄6 of our total calories), using a evolutionary algorithm rather than gradient descent training scheme which is significantly less efficient, and we’re now running the human brain well outside it’s training distribution (there were no condoms on the Savannah) — nevertheless, the human population is 8 billion and counting, and we dominate basically every terrestrial ecosystem on the planet. I think some people overplay how much inner alignment failure there is between human instincts and human genetic fitness.
So:
Use a model large enough to learn what you’re trying to teach it
Use stochastic gradient descent
Ask your AI to monitor for inner alignment problems (we do know Doritos are bad for us)
Retrain if you find yourself far enough outside your training distribution that inner alignment issues are becoming a problem
Some combination of:
Interpretability
Just check if the AI is planning to do bad stuff, by learning how to inspect its internal representations.
Regularization
Evolution got humans who like Doritos more than health food, but evolution didn’t have gradient descent. Use regularization during training to penalize hidden reasoning.
Shard / developmental prediction
Model-free RL will predictably use simple heuristics for the reward signal. If we can predict and maybe control how this happens, this gives us at least a tamer version of inner misalignment.
Self-modeling
Make it so that the AI has an accurate model of whether it’s going to do bad stuff. Then use this to get the AI not to do it.
Control
If inner misalignment is a problem when you use AI’s off-distribution and give them unchecked power, then don’t do that.
Personally, I think the most impactful will be Regularization, then Interpretability.
My personal ranking of impact would be regularization, then AI control (at least for automated alignment schemes), with interpretability a distant 3rd or 4th at best.
I’m pretty certain that we will do a lot better than evolution, but whether that’s good enough is an empirical question for us.
The Aisafety[.]info group has collated some very helpful maps of “who is doing what” in AI safety, including this recent spreadsheet account of technical alignment actors and their problem domains / approaches as of 2024 [they also have an AI policy map, on the off chance you would be interested in that].
I expect “who is working on inner alignment?” to be a highly contested category boundary, so I would encourage you not to take my word for it, and to look through the spreadsheet and probably the collation post yourself [the post contains possibly-clueful-for-your-purposes short descriptions of what each group is working on], and make your own call as to who does and doesn’t [likely] meet your criteria.
But for my own part, it looks to me like the major actors currently working on inner alignment are the Alignment Research Center [ARC] and Orthogonal.
You probably can’t beat reading old MIRI papers and the Arbital AI alignment page. It’s “outdated”, but it hasn’t actually been definitively improved on.