I agree, this is only a proposal for a solution to the outer alignment problem.
On the optimizer’s curse, information value and risk aversion aspects you mention, I think I agree that a sufficiently rational agent should already be thinking like that: any GAI that is somehow still treating the universe like a black-box multi-armed bandit isn’t going to live very long and should fairly easy to defeat (hand it 1/epsilon opportunities to make a fatal mistake, all labeled with symbols it has never seen before).
Optimizing while not allowing for the optimizer’s curse is also treating the universe like a multi-armed bandit, not even with probability epsilon of exploring: you’re doing a cheap all-exploration strategy on your utility uncertainty estimates, which will cause you to sequentially pull the handles on all your overestimates until you discover the hard way that they’re all just overestimates. This is not rational behavior for a powerful optimizer, at least in the presence of the possibility of a really bad outcome, so not doing it should be convergent, and we shouldn’t build a near-human AI that is still making that mistake.
I agree, this is only a proposal for a solution to the outer alignment problem.
On the optimizer’s curse, information value and risk aversion aspects you mention, I think I agree that a sufficiently rational agent should already be thinking like that: any GAI that is somehow still treating the universe like a black-box multi-armed bandit isn’t going to live very long and should fairly easy to defeat (hand it 1/epsilon opportunities to make a fatal mistake, all labeled with symbols it has never seen before).
Optimizing while not allowing for the optimizer’s curse is also treating the universe like a multi-armed bandit, not even with probability epsilon of exploring: you’re doing a cheap all-exploration strategy on your utility uncertainty estimates, which will cause you to sequentially pull the handles on all your overestimates until you discover the hard way that they’re all just overestimates. This is not rational behavior for a powerful optimizer, at least in the presence of the possibility of a really bad outcome, so not doing it should be convergent, and we shouldn’t build a near-human AI that is still making that mistake.
Edit: I expanded this comment into a post, at: https://www.lesswrong.com/posts/ZqTQtEvBQhiGy6y7p/breaking-the-optimizer-s-curse-and-consequences-for-1