“If someone creates “bad” AI we could measure that, and use the measurement for a counter program.”
(I’m just going to address this point in this comment.) The space of potential bad programs is vast—and the opposite of a disastrous values misalignment is almost always a different values misalignment, not alignment.
In two dimensions, think of a misaligned wheel; it’s very unlikely to be exactly 180 degrees (or 90 degrees) away from proper alignment. Pointing the car in a relatively nice direction is better than pointing it straight at the highway divider wall—but even a slight misalignment will eventually lead to going off-road. And the worry is that we need to have a general solution before we allow the car to get to 55 MPH, much less 100+. But you argue that we can measure the misalignment. True! If we had a way to measure the angle between its alignment and the correct one, we could ignore the misaligned wheel angle, and simple minimize the misalignment -which means the measure of divergence implicitly contains the correct alignment.
For an AI value function, the same is true. If we had a measure of misalignment, we could minimize it. The tricky part is that we don’t have such a metric, and any correct such metric would be implicitly equivalent to solving the original problem. Perhaps this is a fruitful avenue, since recasting the problem this way can help—and it’s similar to some of the approaches I’ve heard Dario Amodei mention regarding value alignment in machine learning systems. So it’s potentially a good insight, but insufficient on its own.
“If someone creates “bad” AI we could measure that, and use the measurement for a counter program.”
(I’m just going to address this point in this comment.) The space of potential bad programs is vast—and the opposite of a disastrous values misalignment is almost always a different values misalignment, not alignment.
In two dimensions, think of a misaligned wheel; it’s very unlikely to be exactly 180 degrees (or 90 degrees) away from proper alignment. Pointing the car in a relatively nice direction is better than pointing it straight at the highway divider wall—but even a slight misalignment will eventually lead to going off-road. And the worry is that we need to have a general solution before we allow the car to get to 55 MPH, much less 100+. But you argue that we can measure the misalignment. True! If we had a way to measure the angle between its alignment and the correct one, we could ignore the misaligned wheel angle, and simple minimize the misalignment -which means the measure of divergence implicitly contains the correct alignment.
For an AI value function, the same is true. If we had a measure of misalignment, we could minimize it. The tricky part is that we don’t have such a metric, and any correct such metric would be implicitly equivalent to solving the original problem. Perhaps this is a fruitful avenue, since recasting the problem this way can help—and it’s similar to some of the approaches I’ve heard Dario Amodei mention regarding value alignment in machine learning systems. So it’s potentially a good insight, but insufficient on its own.