Human values aren’t a repeller, but they’re a very narrow target to hit.
As optimization pressure is applied the AI becomes more capable. In particular it will develop a more detailed model of people and their values. So it seems to me there is actually a basin around schemes like CEV which course correct towards true human values.
Yes. :) A thing I considered including in my comment (but left out) is ‘capabilities are the Grand Canyon; alignment with human values is a teacup’.
In both cases, something can land partway in the basin and then roll the rest of the way down. But it’s easier to hit the target ‘anywhere inside the Grand Canyon’ than to hit the target ‘anywhere inside the teacup’.
(And, at risk of mixing the metaphors way too much: if something rapidly rolls down a slope in the Grand Canyon, and it’s carrying a teacup, the teacup is likely to break. I.e., insofar as your pre-left-turn system was aligned, the huge changes involved in rolling down the Grand Canyon are likely to break your guarantees. Human values are a narrow target to hit, and the sharp left turn is an extreme change that makes it hard to preserve fragile targets like that by default.)
Corrigibility is like… a mountain with an empty swimming pool at the top? If you can land in the pool, you’ll tend to stay there, and it’s easy to roll from the shallow end of the pool to the deep end. And the pool seems like a much easier target to hit than the teacup. But if you miss the pool, you’ll slide all the way down the mountain.
Also, the swimming pool is lined with explosives that are wired to blow up whenever you travel deeper into the Grand Canyon.
(OK, maybe some metaphors are not meant to be mixed...)
Not an answer, but I think of “adversarial coherence” (the agent keeps optimizing for the same utility function even under perturbations by weaker optimizing processes, like how humans will fix errors in building a house or AlphaZero can win a game of Go even when an opponent tries to disrupt its strategy) as a property that training processes could select for. Adversarial coherence and corrigibility are incompatible.
As optimization pressure is applied the AI becomes more capable. In particular it will develop a more detailed model of people and their values. So it seems to me there is actually a basin around schemes like CEV which course correct towards true human values.
This of course doesn’t help with corrigibility.
Yes. :) A thing I considered including in my comment (but left out) is ‘capabilities are the Grand Canyon; alignment with human values is a teacup’.
In both cases, something can land partway in the basin and then roll the rest of the way down. But it’s easier to hit the target ‘anywhere inside the Grand Canyon’ than to hit the target ‘anywhere inside the teacup’.
(And, at risk of mixing the metaphors way too much: if something rapidly rolls down a slope in the Grand Canyon, and it’s carrying a teacup, the teacup is likely to break. I.e., insofar as your pre-left-turn system was aligned, the huge changes involved in rolling down the Grand Canyon are likely to break your guarantees. Human values are a narrow target to hit, and the sharp left turn is an extreme change that makes it hard to preserve fragile targets like that by default.)
Corrigibility is like… a mountain with an empty swimming pool at the top? If you can land in the pool, you’ll tend to stay there, and it’s easy to roll from the shallow end of the pool to the deep end. And the pool seems like a much easier target to hit than the teacup. But if you miss the pool, you’ll slide all the way down the mountain.
Also, the swimming pool is lined with explosives that are wired to blow up whenever you travel deeper into the Grand Canyon.
(OK, maybe some metaphors are not meant to be mixed...)
The best resource that I have found on why corrigibility is so hard is the arbital post, are there other good summaries that I should read?
Not an answer, but I think of “adversarial coherence” (the agent keeps optimizing for the same utility function even under perturbations by weaker optimizing processes, like how humans will fix errors in building a house or AlphaZero can win a game of Go even when an opponent tries to disrupt its strategy) as a property that training processes could select for. Adversarial coherence and corrigibility are incompatible.