Any one crude heuristic will be goodhearted, but what about a pile of crude heuristics.
A bunch of humans have say, 1 week in a box to write a crude heuristic for a human value function (bounded on [0,1] )
Before they start, an AI is switched on, given a bunch of info, and asked to predict a probability distribution over what the humans write.
Then an AI maximizes the average over that distribution.
The humans in the box know the whole plan. They can do things like flip a quantum coin, and use that to decide which part of their value function they write down.
Do all the mistakes cancel out? Is it too hard to goodheart all the heuristics in a way that’s still bad? Can we write any small part of our utility function?
Another dumb Alignment idea.
Any one crude heuristic will be goodhearted, but what about a pile of crude heuristics.
A bunch of humans have say, 1 week in a box to write a crude heuristic for a human value function (bounded on [0,1] )
Before they start, an AI is switched on, given a bunch of info, and asked to predict a probability distribution over what the humans write.
Then an AI maximizes the average over that distribution.
The humans in the box know the whole plan. They can do things like flip a quantum coin, and use that to decide which part of their value function they write down.
Do all the mistakes cancel out? Is it too hard to goodheart all the heuristics in a way that’s still bad? Can we write any small part of our utility function?