“If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I’m pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.”
I’m not saying that this isn’t true, but it sounds like a world wrecking assumption that can’t be mathematically proved, so it looks worthwhile to question it. Suppose you take the policy that the goal function of everyone in the AI alignment—transhumanism movement is basically similar, and you will let individuals get that sort of power. You have created a huge incentive for people with other goals to pretend to have that goal, work your way into the community, and then turn round and do something different.
If we consider blog posts as a good indication of someones values, which can be used to decide who gets such power, then they stop becoming good indicators as people lie. If you don’t hand people that power just because they claim to have good values,then the claimed values do indicate real values. Goodharts law in action.
It’s even possible that no-one can be trusted with that power. Suppose that Fair Utopia has a utility of 99 to everyone, and person X in charge has a utility of 100 to person X, and 0 to everyone else.
I’m similarly concerned about loose talk about assessing the alignment of specific humans given there seems generally not agreed upon precise criteria by which to assess alignment.
“If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I’m pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.”
I’m not saying that this isn’t true, but it sounds like a world wrecking assumption that can’t be mathematically proved, so it looks worthwhile to question it. Suppose you take the policy that the goal function of everyone in the AI alignment—transhumanism movement is basically similar, and you will let individuals get that sort of power. You have created a huge incentive for people with other goals to pretend to have that goal, work your way into the community, and then turn round and do something different.
If we consider blog posts as a good indication of someones values, which can be used to decide who gets such power, then they stop becoming good indicators as people lie. If you don’t hand people that power just because they claim to have good values,then the claimed values do indicate real values. Goodharts law in action.
It’s even possible that no-one can be trusted with that power. Suppose that Fair Utopia has a utility of 99 to everyone, and person X in charge has a utility of 100 to person X, and 0 to everyone else.
I’m similarly concerned about loose talk about assessing the alignment of specific humans given there seems generally not agreed upon precise criteria by which to assess alignment.