The arguments you outline are the sort of arguments that have been considered at CHAI and MIRI quite a bit (at least historically). The main issue I have with this sort of work is that it talks about how an agent should reason, whereas in my view the problem is that even if we knew how an agent should reason we wouldn’t know how to build an agent that efficiently implements that reasoning (particularly in the neural network paradigm). So I personally work more on the latter problem: supposing we know how we want the agent to reason, how do we get it to actually reason in that way.
On your actual proposals, talking just about “how the agent should reason” (and not how we actually get it to reason that way):
1) Yeah I really like this idea—it was the motivation for my work on inferring human preferences from the world state, which eventually turned into my dissertation. (The framing we used was that humans optimized the environment, but we also thought about the fact that humans were optimized to like the environment.) I still basically agree that this is a great way to learn about human preferences (particularly about what things humans prefer you not change), if somehow that ends up being the bottleneck.
2) I think you might be conflating a few different mechanisms here.
First, there’s the optimizer’s curse, where the action with max value will tend to be an overestimate of the actual value. As you note, one natural solution is to have a correction based on an estimate of how much the overestimate is. For this to make a difference, your estimates of overestimates have to be different across different actions; I don’t have great ideas on how this should be done. (You mention have different standard deviations + different numbers of statistically-independent variables, but it’s not clear where those come from.)
Second, there’s information value, where the agent should ask about utilities in states that it is uncertain about, rather than charging in blindly. You seem to be thinking of this as something we have to program into the AI system, but it actually emerges naturally from reward uncertainty by itself. See this paper for more details and examples—Appendix D also talks about the connection to impact regularization.
Third, there’s risk aversion, where you explicitly program the AI system to be conservative (instead of maximizing expected utility). I tend to think that in principle this shouldn’t be necessary and you can get the same benefits from other mechanisms, but maybe we’d want to do it anyway for safety margins. I don’t think it’s necessary for any of the other claims you’re making, except perhaps quantilization (but I don’t really see how any of these mechanisms lead to acting like a quantilizer except in a loose sense).
I agree, this is only a proposal for a solution to the outer alignment problem.
On the optimizer’s curse, information value and risk aversion aspects you mention, I think I agree that a sufficiently rational agent should already be thinking like that: any GAI that is somehow still treating the universe like a black-box multi-armed bandit isn’t going to live very long and should fairly easy to defeat (hand it 1/epsilon opportunities to make a fatal mistake, all labeled with symbols it has never seen before).
Optimizing while not allowing for the optimizer’s curse is also treating the universe like a multi-armed bandit, not even with probability epsilon of exploring: you’re doing a cheap all-exploration strategy on your utility uncertainty estimates, which will cause you to sequentially pull the handles on all your overestimates until you discover the hard way that they’re all just overestimates. This is not rational behavior for a powerful optimizer, at least in the presence of the possibility of a really bad outcome, so not doing it should be convergent, and we shouldn’t build a near-human AI that is still making that mistake.
Nice comment!
The arguments you outline are the sort of arguments that have been considered at CHAI and MIRI quite a bit (at least historically). The main issue I have with this sort of work is that it talks about how an agent should reason, whereas in my view the problem is that even if we knew how an agent should reason we wouldn’t know how to build an agent that efficiently implements that reasoning (particularly in the neural network paradigm). So I personally work more on the latter problem: supposing we know how we want the agent to reason, how do we get it to actually reason in that way.
On your actual proposals, talking just about “how the agent should reason” (and not how we actually get it to reason that way):
1) Yeah I really like this idea—it was the motivation for my work on inferring human preferences from the world state, which eventually turned into my dissertation. (The framing we used was that humans optimized the environment, but we also thought about the fact that humans were optimized to like the environment.) I still basically agree that this is a great way to learn about human preferences (particularly about what things humans prefer you not change), if somehow that ends up being the bottleneck.
2) I think you might be conflating a few different mechanisms here.
First, there’s the optimizer’s curse, where the action with max value will tend to be an overestimate of the actual value. As you note, one natural solution is to have a correction based on an estimate of how much the overestimate is. For this to make a difference, your estimates of overestimates have to be different across different actions; I don’t have great ideas on how this should be done. (You mention have different standard deviations + different numbers of statistically-independent variables, but it’s not clear where those come from.)
Second, there’s information value, where the agent should ask about utilities in states that it is uncertain about, rather than charging in blindly. You seem to be thinking of this as something we have to program into the AI system, but it actually emerges naturally from reward uncertainty by itself. See this paper for more details and examples—Appendix D also talks about the connection to impact regularization.
Third, there’s risk aversion, where you explicitly program the AI system to be conservative (instead of maximizing expected utility). I tend to think that in principle this shouldn’t be necessary and you can get the same benefits from other mechanisms, but maybe we’d want to do it anyway for safety margins. I don’t think it’s necessary for any of the other claims you’re making, except perhaps quantilization (but I don’t really see how any of these mechanisms lead to acting like a quantilizer except in a loose sense).
I agree, this is only a proposal for a solution to the outer alignment problem.
On the optimizer’s curse, information value and risk aversion aspects you mention, I think I agree that a sufficiently rational agent should already be thinking like that: any GAI that is somehow still treating the universe like a black-box multi-armed bandit isn’t going to live very long and should fairly easy to defeat (hand it 1/epsilon opportunities to make a fatal mistake, all labeled with symbols it has never seen before).
Optimizing while not allowing for the optimizer’s curse is also treating the universe like a multi-armed bandit, not even with probability epsilon of exploring: you’re doing a cheap all-exploration strategy on your utility uncertainty estimates, which will cause you to sequentially pull the handles on all your overestimates until you discover the hard way that they’re all just overestimates. This is not rational behavior for a powerful optimizer, at least in the presence of the possibility of a really bad outcome, so not doing it should be convergent, and we shouldn’t build a near-human AI that is still making that mistake.
Edit: I expanded this comment into a post, at: https://www.lesswrong.com/posts/ZqTQtEvBQhiGy6y7p/breaking-the-optimizer-s-curse-and-consequences-for-1