The visualization of capabilities improvements as an attractor basin is pretty well accepted and useful, I think. I kind of like the analogous idea of an alignment target as a repeller cone / dome. The true target is approximately infinitely small and attempts to hit it slide off as optimization pressure is applied. I’m curious if other share this model and if it’s been refined / explored in more detail by others.
The sharpness of the left turn strikes me as a major crux. Some (most?) alignment proposals seem to rely on developing an AI just a bit smarter than humans but not yet dangerous. (An implicit assumption here may be that intelligence continues to develop in straight lines.) The sharp left turn model implies this sweet spot will pass by in the blink of an eye. (An implicit assumption here may be that there are discrete leaps.) Interesting to note that Nate explicitly says RSI is not a core part of his model. I’d like to see more arguments on both sides of this debate.
In the sense of moving a system towards many possible goals? But I think in a more appropriate space (where the aiming should take place) it’s again an attractor. Corrigibility is not a goal, a corrigible system doesn’t necessarily have any well-defined goals, traditional goal-directed agents can’t be corrigible in a robust way, and it should be possible to use it for corrigibility towards corrigibility, making this aspect stronger if that’s what the operators work towards happening.
More generally, non-agentic aspects of behavior can systematically reinforce non-agentic character of each other, preventing any opposing convergent drives (including the drive towards agency) from manifesting if they’ve been set up to do so. Sufficient intelligence/planning advantage pushes this past exploitability hazards, repelling selection theorems, even as some of the non-agentic behaviors might be about maintaining specific forms of exploitability.
Human values aren’t a repeller, but they’re a very narrow target to hit.
As optimization pressure is applied the AI becomes more capable. In particular it will develop a more detailed model of people and their values. So it seems to me there is actually a basin around schemes like CEV which course correct towards true human values.
Yes. :) A thing I considered including in my comment (but left out) is ‘capabilities are the Grand Canyon; alignment with human values is a teacup’.
In both cases, something can land partway in the basin and then roll the rest of the way down. But it’s easier to hit the target ‘anywhere inside the Grand Canyon’ than to hit the target ‘anywhere inside the teacup’.
(And, at risk of mixing the metaphors way too much: if something rapidly rolls down a slope in the Grand Canyon, and it’s carrying a teacup, the teacup is likely to break. I.e., insofar as your pre-left-turn system was aligned, the huge changes involved in rolling down the Grand Canyon are likely to break your guarantees. Human values are a narrow target to hit, and the sharp left turn is an extreme change that makes it hard to preserve fragile targets like that by default.)
Corrigibility is like… a mountain with an empty swimming pool at the top? If you can land in the pool, you’ll tend to stay there, and it’s easy to roll from the shallow end of the pool to the deep end. And the pool seems like a much easier target to hit than the teacup. But if you miss the pool, you’ll slide all the way down the mountain.
Also, the swimming pool is lined with explosives that are wired to blow up whenever you travel deeper into the Grand Canyon.
(OK, maybe some metaphors are not meant to be mixed...)
Not an answer, but I think of “adversarial coherence” (the agent keeps optimizing for the same utility function even under perturbations by weaker optimizing processes, like how humans will fix errors in building a house or AlphaZero can win a game of Go even when an opponent tries to disrupt its strategy) as a property that training processes could select for. Adversarial coherence and corrigibility are incompatible.
Two points:
The visualization of capabilities improvements as an attractor basin is pretty well accepted and useful, I think. I kind of like the analogous idea of an alignment target as a repeller cone / dome. The true target is approximately infinitely small and attempts to hit it slide off as optimization pressure is applied. I’m curious if other share this model and if it’s been refined / explored in more detail by others.
The sharpness of the left turn strikes me as a major crux. Some (most?) alignment proposals seem to rely on developing an AI just a bit smarter than humans but not yet dangerous. (An implicit assumption here may be that intelligence continues to develop in straight lines.) The sharp left turn model implies this sweet spot will pass by in the blink of an eye. (An implicit assumption here may be that there are discrete leaps.) Interesting to note that Nate explicitly says RSI is not a core part of his model. I’d like to see more arguments on both sides of this debate.
Corrigibility is a repeller. Human values aren’t a repeller, but they’re a very narrow target to hit.
In the sense of moving a system towards many possible goals? But I think in a more appropriate space (where the aiming should take place) it’s again an attractor. Corrigibility is not a goal, a corrigible system doesn’t necessarily have any well-defined goals, traditional goal-directed agents can’t be corrigible in a robust way, and it should be possible to use it for corrigibility towards corrigibility, making this aspect stronger if that’s what the operators work towards happening.
More generally, non-agentic aspects of behavior can systematically reinforce non-agentic character of each other, preventing any opposing convergent drives (including the drive towards agency) from manifesting if they’ve been set up to do so. Sufficient intelligence/planning advantage pushes this past exploitability hazards, repelling selection theorems, even as some of the non-agentic behaviors might be about maintaining specific forms of exploitability.
As optimization pressure is applied the AI becomes more capable. In particular it will develop a more detailed model of people and their values. So it seems to me there is actually a basin around schemes like CEV which course correct towards true human values.
This of course doesn’t help with corrigibility.
Yes. :) A thing I considered including in my comment (but left out) is ‘capabilities are the Grand Canyon; alignment with human values is a teacup’.
In both cases, something can land partway in the basin and then roll the rest of the way down. But it’s easier to hit the target ‘anywhere inside the Grand Canyon’ than to hit the target ‘anywhere inside the teacup’.
(And, at risk of mixing the metaphors way too much: if something rapidly rolls down a slope in the Grand Canyon, and it’s carrying a teacup, the teacup is likely to break. I.e., insofar as your pre-left-turn system was aligned, the huge changes involved in rolling down the Grand Canyon are likely to break your guarantees. Human values are a narrow target to hit, and the sharp left turn is an extreme change that makes it hard to preserve fragile targets like that by default.)
Corrigibility is like… a mountain with an empty swimming pool at the top? If you can land in the pool, you’ll tend to stay there, and it’s easy to roll from the shallow end of the pool to the deep end. And the pool seems like a much easier target to hit than the teacup. But if you miss the pool, you’ll slide all the way down the mountain.
Also, the swimming pool is lined with explosives that are wired to blow up whenever you travel deeper into the Grand Canyon.
(OK, maybe some metaphors are not meant to be mixed...)
The best resource that I have found on why corrigibility is so hard is the arbital post, are there other good summaries that I should read?
Not an answer, but I think of “adversarial coherence” (the agent keeps optimizing for the same utility function even under perturbations by weaker optimizing processes, like how humans will fix errors in building a house or AlphaZero can win a game of Go even when an opponent tries to disrupt its strategy) as a property that training processes could select for. Adversarial coherence and corrigibility are incompatible.