Hmmm, I don’t think that kind of thing is going to give correct world-saving incentives for the selfish part of people (unless failing to save the world counts as destroying it—in which case almost everyone is going to get approximately no influence). More fundamentally, I don’t think it works out in this kind of case due to logical uncertainty.
If I’m uncertain about a particular plan, and my estimate is {80% everyone dies; 20% I save the world}, that’s not {in 80% of timelines everyone dies; in 20% of timelines I save the world}.
It’s closer to [there’s an 80% chance that {in ~99% of timelines everyone dies}; there’s a 20% chance that {in ~99% of timelines I save the world}].
So, conditional on my saving the world in some timeline by taking some action, I saved the world in most timelines where I took that action and would get a load of influence. This won’t disincentivize risky gambles for selfish/power-hungry people. (at least of the form [let’s train this model and see what happens] - most of the danger there being a logical uncertainty thing)
I think influence would need to be based on expected social value given the ‘correct’ level of logical uncertainty—probably something like [what (expected value | your action) is justified by your beliefs, and valid arguments you’d make for them based on information you have]. Or at least some subjective perspective seems to be necessary—and something that doesn’t give more points for overconfident people.
Hmmm, I don’t think that kind of thing is going to give correct world-saving incentives for the selfish part of people (unless failing to save the world counts as destroying it—in which case almost everyone is going to get approximately no influence).
More fundamentally, I don’t think it works out in this kind of case due to logical uncertainty.
If I’m uncertain about a particular plan, and my estimate is {80% everyone dies; 20% I save the world}, that’s not {in 80% of timelines everyone dies; in 20% of timelines I save the world}.
It’s closer to [there’s an 80% chance that {in ~99% of timelines everyone dies}; there’s a 20% chance that {in ~99% of timelines I save the world}].
So, conditional on my saving the world in some timeline by taking some action, I saved the world in most timelines where I took that action and would get a load of influence. This won’t disincentivize risky gambles for selfish/power-hungry people. (at least of the form [let’s train this model and see what happens] - most of the danger there being a logical uncertainty thing)
I think influence would need to be based on expected social value given the ‘correct’ level of logical uncertainty—probably something like [what (expected value | your action) is justified by your beliefs, and valid arguments you’d make for them based on information you have].
Or at least some subjective perspective seems to be necessary—and something that doesn’t give more points for overconfident people.