Many people expect that it’s sane to discuss unbounded utility functions (which violates continuity, see e.g. this post).
It’s not yet clear that VNM agents can be corrigible, and while the VNM axioms themselves seem pretty sane, there may be some hidden assumption that prevents us from easily designing agents that exhibit corrigible (or other desirable) behavior.
Because intelligent agents with preferences characterized by a utility function seem very prone to what Bostrom calls “perverse instantiations”, some expect that we might want to put effort into figuring out how to build a system that is very tightly constrained in some way such that asking about their preference ordering over universe-histories does not make sense.
Others variously point out that humans are incoherent, or that perfect coherence requires logical omniscience and is generally impractical.
Finally, there are a number of people who reason that if agents tend to converge on preferences characterizable by utility functions then getting superintelligence right is really hard, and that would be bad, so therefore it would be bad to “use utility functions.”
(Personally, I think the first objection is valid, the second is plausible, the third is dubious, the fourth is missing the point, and the fifth is motivated reasoning, but this is a can of worms that I was hoping to avoid in the research guide.)
I don’t think that 1 is a valid objection, because if you have non-Archimedean preferences, then it is rational to forget about your “weak” preferences (preferences that can be reversed by mixing in arbitrarily small probabilities of something else) so that you can free up computational resources to optimize for your “strong” preferences, and if you do that, your remaining preferences are Archimedean. See the “Doing without Continuity” section of http://lesswrong.com/lw/fu1/why_you_must_maximize_expected_utility/.
I suppose it’s conceivable that 2 could be a reason to give up VNM-rationality, but I’m pretty skeptical. One possible alternative to addressing the corrigibility problem is to just get indirect normativity right in the first place, so you don’t need the programmer to be able to tamper with the value system later. Of course, you might not want to rely on that, but I suspect that if a solution to corrigibility involves abandoning VNM-rationality, you end up increasing the chance that both the value system isn’t what you wanted and the corrigibility solution doesn’t work either, since now the framework for the value system is probably an awful kludge instead of a utility function.
(By the way, I like the guide as a whole and I appreciate you putting it together. I feel kind of bad about my original comment consisting entirely of a minor quibble, but I have a habit of only commenting about things I disagree with.)
Thanks! FWIW, I don’t strongly expect that we’ll have to abandon VNM at this point, but I do still have a lot of uncertainty in the area.
I will be much more willing to discard objection 1 after resolving my confusion surrounding Pascal muggings, and I’ll be much more willing to discard objection 2 once I have a formalization of “corrigibility.” The fact that I still have some related confusions, combined with the fact that (in my experience) mathematical insights often reveal hidden assumptions in things I thought were watertight, combined with a dash of outside view, leads me to decently high credence (~15%, with high anticipated variance) that VNM won’t cut it.
Yep. It arises in various forms:
Many people expect that it’s sane to discuss unbounded utility functions (which violates continuity, see e.g. this post).
It’s not yet clear that VNM agents can be corrigible, and while the VNM axioms themselves seem pretty sane, there may be some hidden assumption that prevents us from easily designing agents that exhibit corrigible (or other desirable) behavior.
Because intelligent agents with preferences characterized by a utility function seem very prone to what Bostrom calls “perverse instantiations”, some expect that we might want to put effort into figuring out how to build a system that is very tightly constrained in some way such that asking about their preference ordering over universe-histories does not make sense.
Others variously point out that humans are incoherent, or that perfect coherence requires logical omniscience and is generally impractical.
Finally, there are a number of people who reason that if agents tend to converge on preferences characterizable by utility functions then getting superintelligence right is really hard, and that would be bad, so therefore it would be bad to “use utility functions.”
(Personally, I think the first objection is valid, the second is plausible, the third is dubious, the fourth is missing the point, and the fifth is motivated reasoning, but this is a can of worms that I was hoping to avoid in the research guide.)
I agree with you about 3, 4, and 5.
I don’t think that 1 is a valid objection, because if you have non-Archimedean preferences, then it is rational to forget about your “weak” preferences (preferences that can be reversed by mixing in arbitrarily small probabilities of something else) so that you can free up computational resources to optimize for your “strong” preferences, and if you do that, your remaining preferences are Archimedean. See the “Doing without Continuity” section of http://lesswrong.com/lw/fu1/why_you_must_maximize_expected_utility/.
I suppose it’s conceivable that 2 could be a reason to give up VNM-rationality, but I’m pretty skeptical. One possible alternative to addressing the corrigibility problem is to just get indirect normativity right in the first place, so you don’t need the programmer to be able to tamper with the value system later. Of course, you might not want to rely on that, but I suspect that if a solution to corrigibility involves abandoning VNM-rationality, you end up increasing the chance that both the value system isn’t what you wanted and the corrigibility solution doesn’t work either, since now the framework for the value system is probably an awful kludge instead of a utility function.
(By the way, I like the guide as a whole and I appreciate you putting it together. I feel kind of bad about my original comment consisting entirely of a minor quibble, but I have a habit of only commenting about things I disagree with.)
Thanks! FWIW, I don’t strongly expect that we’ll have to abandon VNM at this point, but I do still have a lot of uncertainty in the area.
I will be much more willing to discard objection 1 after resolving my confusion surrounding Pascal muggings, and I’ll be much more willing to discard objection 2 once I have a formalization of “corrigibility.” The fact that I still have some related confusions, combined with the fact that (in my experience) mathematical insights often reveal hidden assumptions in things I thought were watertight, combined with a dash of outside view, leads me to decently high credence (~15%, with high anticipated variance) that VNM won’t cut it.
What’s the definition of preferences used in boundedly-rational agent research?