I think that (3) does create strong incentives right now—at least for anyone who assumes [without any special prejudice given to existing people] amounts to [and it’s fine to disassemble everyone who currently exists if it’s the u/v/h/g/etc maximising policy]. This seems probable to me, though not entirely clear (I’m not an optimal configuration, and smoothly, consciousness-preservingly transitioning me to something optimal seems likely to take more resources than unceremoniously recycling me).
Incentives now include:
Prevent (3) happening.
To the extent that you expect (3) and are selfish, live for the pre-(3) time interval, for (3) will bring your doom.
On (4), “This solution obviously creates the best incentives for current agents” seems badly mistaken unless I’m misunderstanding you.
Something in this spirit would need to be based on a notion of [expected social value], not on actual contributions, since in the cases where we die we don’t get to award negative points.
For example, suppose my choice is between: A: {90% chance doom for everyone; 10% I save the world} B: {85% chance doom for everyone; 15% someone else saves the world}
To the extent that I’m selfish, and willing to risk some chance of death for greater control over the future, I’m going to pick A under (4). The more selfish, reckless and power-hungry I am, and the more what I want deviates from that most people want, the more likely I am to actively put myself in position to take an A-like action.
Moreover, if the aim is to get ideal incentives, it seems unavoidable to have symmetry and include punishments rather than only [you don’t get many resources]. Otherwise the incentive is to shoot for huge magnitude of impact, without worrying much about the sign, since no-one can do worse than zero resources.
If correct incentives were the only desideratum, I don’t see how we’d avoid [post-singularity ‘hell’ (with some probability) for those who’re reckless with AGI]. For any nicer approach I think we’d either be incenting huge impact with uncertain sign, or failing to incent large sacrifice in order to save the world.
Perhaps the latter is best?? I.e. cap the max resources for any individual at a fairly low level, so that e.g. [this person was in the top percentile of helpfulness] and [this person saved the world] might get you about the same resource allocation. It has the upsides both of making ‘hell’ less necessary, and of giving a lower incentive to overconfident people with high-impact schemes. (but still probably incents particularly selfish people to pick A over B)
If correct incentives were the only desideratum, I don’t see how we’d avoid [post-singularity ‘hell’ (with some probability) for those who’re reckless with AGI].
(some very mild spoilers for yudkowsky’s planecrash glowfic (very mild as in this mostly does not talk about the story, but you could deduce things about where the story goes by the fact that characters in it are discussing this))
[edit: links in spoiler tags are bugged. in the spoiler, “speculates about” should link to here and “have the stance that” to here]
“The Negative stance is that everyone just needs to stop calculating how to pessimize anybody else’s utility function, ever, period. That’s a simple guideline for how realness can end up mostly concentrated inside of events that agents want, instead of mostly events that agents hate.”
“If at any point you’re calculating how to pessimize a utility function, you’re doing it wrong. If at any point you’re thinking about how much somebody might get hurt by something, for a purpose other than avoiding doing that, you’re doing it wrong.)”
i think this is a pretty solid principle. i’m very much not a fan of anyone’s utility function getting pessimized.
so pessimising a utility function is a bad idea. but we can still produce correct incentive gradients in other ways! for example, we could say that every moral patient starts with 1 unit of utility function handshake, but if you destroy the world you lose some of your share. maybe if you take actions that cause ⅔ of timelines to die, you only get ⅓ units of utility function handshake, and the more damage you do the less handshake you get.
it never gets into the negative, that way we never go out of our way to pessimize someone’s utility function; but it does get increasingly close to 0.
(this isn’t necessarily a scheme i’m committed to, it’s just an idea i’ve had for a scheme that provides the correct incentives for not destroying the world, without having to create hells / pessimize utility functions)
Hmmm, I don’t think that kind of thing is going to give correct world-saving incentives for the selfish part of people (unless failing to save the world counts as destroying it—in which case almost everyone is going to get approximately no influence). More fundamentally, I don’t think it works out in this kind of case due to logical uncertainty.
If I’m uncertain about a particular plan, and my estimate is {80% everyone dies; 20% I save the world}, that’s not {in 80% of timelines everyone dies; in 20% of timelines I save the world}.
It’s closer to [there’s an 80% chance that {in ~99% of timelines everyone dies}; there’s a 20% chance that {in ~99% of timelines I save the world}].
So, conditional on my saving the world in some timeline by taking some action, I saved the world in most timelines where I took that action and would get a load of influence. This won’t disincentivize risky gambles for selfish/power-hungry people. (at least of the form [let’s train this model and see what happens] - most of the danger there being a logical uncertainty thing)
I think influence would need to be based on expected social value given the ‘correct’ level of logical uncertainty—probably something like [what (expected value | your action) is justified by your beliefs, and valid arguments you’d make for them based on information you have]. Or at least some subjective perspective seems to be necessary—and something that doesn’t give more points for overconfident people.
A couple of thoughts:
I think that (3) does create strong incentives right now—at least for anyone who assumes [without any special prejudice given to existing people] amounts to [and it’s fine to disassemble everyone who currently exists if it’s the u/v/h/g/etc maximising policy]. This seems probable to me, though not entirely clear (I’m not an optimal configuration, and smoothly, consciousness-preservingly transitioning me to something optimal seems likely to take more resources than unceremoniously recycling me).
Incentives now include:
Prevent (3) happening.
To the extent that you expect (3) and are selfish, live for the pre-(3) time interval, for (3) will bring your doom.
On (4), “This solution obviously creates the best incentives for current agents” seems badly mistaken unless I’m misunderstanding you.
Something in this spirit would need to be based on a notion of [expected social value], not on actual contributions, since in the cases where we die we don’t get to award negative points.
For example, suppose my choice is between:
A: {90% chance doom for everyone; 10% I save the world}
B: {85% chance doom for everyone; 15% someone else saves the world}
To the extent that I’m selfish, and willing to risk some chance of death for greater control over the future, I’m going to pick A under (4).
The more selfish, reckless and power-hungry I am, and the more what I want deviates from that most people want, the more likely I am to actively put myself in position to take an A-like action.
Moreover, if the aim is to get ideal incentives, it seems unavoidable to have symmetry and include punishments rather than only [you don’t get many resources]. Otherwise the incentive is to shoot for huge magnitude of impact, without worrying much about the sign, since no-one can do worse than zero resources.
If correct incentives were the only desideratum, I don’t see how we’d avoid [post-singularity ‘hell’ (with some probability) for those who’re reckless with AGI].
For any nicer approach I think we’d either be incenting huge impact with uncertain sign, or failing to incent large sacrifice in order to save the world.
Perhaps the latter is best??
I.e. cap the max resources for any individual at a fairly low level, so that e.g. [this person was in the top percentile of helpfulness] and [this person saved the world] might get you about the same resource allocation.
It has the upsides both of making ‘hell’ less necessary, and of giving a lower incentive to overconfident people with high-impact schemes. (but still probably incents particularly selfish people to pick A over B)
(some very mild spoilers for yudkowsky’s planecrash glowfic (very mild as in this mostly does not talk about the story, but you could deduce things about where the story goes by the fact that characters in it are discussing this))
[edit: links in spoiler tags are bugged. in the spoiler, “speculates about” should link to here and “have the stance that” to here]
“The Negative stance is that everyone just needs to stop calculating how to pessimize anybody else’s utility function, ever, period. That’s a simple guideline for how realness can end up mostly concentrated inside of events that agents want, instead of mostly events that agents hate.”
“If at any point you’re calculating how to pessimize a utility function, you’re doing it wrong. If at any point you’re thinking about how much somebody might get hurt by something, for a purpose other than avoiding doing that, you’re doing it wrong.)”
i think this is a pretty solid principle. i’m very much not a fan of anyone’s utility function getting pessimized.
so pessimising a utility function is a bad idea. but we can still produce correct incentive gradients in other ways! for example, we could say that every moral patient starts with 1 unit of utility function handshake, but if you destroy the world you lose some of your share. maybe if you take actions that cause ⅔ of timelines to die, you only get ⅓ units of utility function handshake, and the more damage you do the less handshake you get.
it never gets into the negative, that way we never go out of our way to pessimize someone’s utility function; but it does get increasingly close to 0.
(this isn’t necessarily a scheme i’m committed to, it’s just an idea i’ve had for a scheme that provides the correct incentives for not destroying the world, without having to create hells / pessimize utility functions)
Hmmm, I don’t think that kind of thing is going to give correct world-saving incentives for the selfish part of people (unless failing to save the world counts as destroying it—in which case almost everyone is going to get approximately no influence).
More fundamentally, I don’t think it works out in this kind of case due to logical uncertainty.
If I’m uncertain about a particular plan, and my estimate is {80% everyone dies; 20% I save the world}, that’s not {in 80% of timelines everyone dies; in 20% of timelines I save the world}.
It’s closer to [there’s an 80% chance that {in ~99% of timelines everyone dies}; there’s a 20% chance that {in ~99% of timelines I save the world}].
So, conditional on my saving the world in some timeline by taking some action, I saved the world in most timelines where I took that action and would get a load of influence. This won’t disincentivize risky gambles for selfish/power-hungry people. (at least of the form [let’s train this model and see what happens] - most of the danger there being a logical uncertainty thing)
I think influence would need to be based on expected social value given the ‘correct’ level of logical uncertainty—probably something like [what (expected value | your action) is justified by your beliefs, and valid arguments you’d make for them based on information you have].
Or at least some subjective perspective seems to be necessary—and something that doesn’t give more points for overconfident people.