If correct incentives were the only desideratum, I don’t see how we’d avoid [post-singularity ‘hell’ (with some probability) for those who’re reckless with AGI].
(some very mild spoilers for yudkowsky’s planecrash glowfic (very mild as in this mostly does not talk about the story, but you could deduce things about where the story goes by the fact that characters in it are discussing this))
[edit: links in spoiler tags are bugged. in the spoiler, “speculates about” should link to here and “have the stance that” to here]
“The Negative stance is that everyone just needs to stop calculating how to pessimize anybody else’s utility function, ever, period. That’s a simple guideline for how realness can end up mostly concentrated inside of events that agents want, instead of mostly events that agents hate.”
“If at any point you’re calculating how to pessimize a utility function, you’re doing it wrong. If at any point you’re thinking about how much somebody might get hurt by something, for a purpose other than avoiding doing that, you’re doing it wrong.)”
i think this is a pretty solid principle. i’m very much not a fan of anyone’s utility function getting pessimized.
so pessimising a utility function is a bad idea. but we can still produce correct incentive gradients in other ways! for example, we could say that every moral patient starts with 1 unit of utility function handshake, but if you destroy the world you lose some of your share. maybe if you take actions that cause ⅔ of timelines to die, you only get ⅓ units of utility function handshake, and the more damage you do the less handshake you get.
it never gets into the negative, that way we never go out of our way to pessimize someone’s utility function; but it does get increasingly close to 0.
(this isn’t necessarily a scheme i’m committed to, it’s just an idea i’ve had for a scheme that provides the correct incentives for not destroying the world, without having to create hells / pessimize utility functions)
Hmmm, I don’t think that kind of thing is going to give correct world-saving incentives for the selfish part of people (unless failing to save the world counts as destroying it—in which case almost everyone is going to get approximately no influence). More fundamentally, I don’t think it works out in this kind of case due to logical uncertainty.
If I’m uncertain about a particular plan, and my estimate is {80% everyone dies; 20% I save the world}, that’s not {in 80% of timelines everyone dies; in 20% of timelines I save the world}.
It’s closer to [there’s an 80% chance that {in ~99% of timelines everyone dies}; there’s a 20% chance that {in ~99% of timelines I save the world}].
So, conditional on my saving the world in some timeline by taking some action, I saved the world in most timelines where I took that action and would get a load of influence. This won’t disincentivize risky gambles for selfish/power-hungry people. (at least of the form [let’s train this model and see what happens] - most of the danger there being a logical uncertainty thing)
I think influence would need to be based on expected social value given the ‘correct’ level of logical uncertainty—probably something like [what (expected value | your action) is justified by your beliefs, and valid arguments you’d make for them based on information you have]. Or at least some subjective perspective seems to be necessary—and something that doesn’t give more points for overconfident people.
(some very mild spoilers for yudkowsky’s planecrash glowfic (very mild as in this mostly does not talk about the story, but you could deduce things about where the story goes by the fact that characters in it are discussing this))
[edit: links in spoiler tags are bugged. in the spoiler, “speculates about” should link to here and “have the stance that” to here]
“The Negative stance is that everyone just needs to stop calculating how to pessimize anybody else’s utility function, ever, period. That’s a simple guideline for how realness can end up mostly concentrated inside of events that agents want, instead of mostly events that agents hate.”
“If at any point you’re calculating how to pessimize a utility function, you’re doing it wrong. If at any point you’re thinking about how much somebody might get hurt by something, for a purpose other than avoiding doing that, you’re doing it wrong.)”
i think this is a pretty solid principle. i’m very much not a fan of anyone’s utility function getting pessimized.
so pessimising a utility function is a bad idea. but we can still produce correct incentive gradients in other ways! for example, we could say that every moral patient starts with 1 unit of utility function handshake, but if you destroy the world you lose some of your share. maybe if you take actions that cause ⅔ of timelines to die, you only get ⅓ units of utility function handshake, and the more damage you do the less handshake you get.
it never gets into the negative, that way we never go out of our way to pessimize someone’s utility function; but it does get increasingly close to 0.
(this isn’t necessarily a scheme i’m committed to, it’s just an idea i’ve had for a scheme that provides the correct incentives for not destroying the world, without having to create hells / pessimize utility functions)
Hmmm, I don’t think that kind of thing is going to give correct world-saving incentives for the selfish part of people (unless failing to save the world counts as destroying it—in which case almost everyone is going to get approximately no influence).
More fundamentally, I don’t think it works out in this kind of case due to logical uncertainty.
If I’m uncertain about a particular plan, and my estimate is {80% everyone dies; 20% I save the world}, that’s not {in 80% of timelines everyone dies; in 20% of timelines I save the world}.
It’s closer to [there’s an 80% chance that {in ~99% of timelines everyone dies}; there’s a 20% chance that {in ~99% of timelines I save the world}].
So, conditional on my saving the world in some timeline by taking some action, I saved the world in most timelines where I took that action and would get a load of influence. This won’t disincentivize risky gambles for selfish/power-hungry people. (at least of the form [let’s train this model and see what happens] - most of the danger there being a logical uncertainty thing)
I think influence would need to be based on expected social value given the ‘correct’ level of logical uncertainty—probably something like [what (expected value | your action) is justified by your beliefs, and valid arguments you’d make for them based on information you have].
Or at least some subjective perspective seems to be necessary—and something that doesn’t give more points for overconfident people.