For the alignment to work its theory should not only tell humans how to create aligned super-human AGI, but also tell AGI how to self-improve without destroying its own values. Good alignment theory should work across all intelligence levels. Otherwise how does paperclips optimizer which is marginally smarter than human make sure that its next iteration will still care about paperclips?
Just realized that stability of goals under self-improvement is kinda similar to stability of goals of mesa-optimizers; so there vingian reflection paradigm and mesa-optimization paradigm should fit.
How does AGI solves it’s own alignment problem?
For the alignment to work its theory should not only tell humans how to create aligned super-human AGI, but also tell AGI how to self-improve without destroying its own values. Good alignment theory should work across all intelligence levels. Otherwise how does paperclips optimizer which is marginally smarter than human make sure that its next iteration will still care about paperclips?
Excellent question! MIRI’s entire vingian reflection paradigm is about stability of goals under self-improvement and designing successors.
Just realized that stability of goals under self-improvement is kinda similar to stability of goals of mesa-optimizers; so there vingian reflection paradigm and mesa-optimization paradigm should fit.