However, I think it’s not at all obvious to me that corrigibility doesn’t have a “small central core”. It does seem to me like the “you are incomplete, you will never be complete” angle captures a lot of what we mean by corrigibility.
I think all three of Eliezer, you, and I share the sense that corrigibility is perhaps philosophically simple. The problem is that for it to actually have a small central core / be a natural stance, you need the ‘import philosophy’ bit to also have a small central core / be natural, and I think those bits aren’t true.
Like, the ‘map territory’ distinction seems to me like a simple thing that’s near the core of human sanity. But… how do I make an AI that sees the map territory distinction? How do I check that its plans are correctly determining the causal structure such that it can tell the difference between manipulating its map and manipulating the territory?
[And, importantly, this ‘philosophical’ AI seems to me like it’s possibly alignable, and a ‘nonphilosophical’ AI that views its projections as ‘the territory’ is probably not alignable. But it’s really spooky that all of our formal models are of this projective AI, and maybe we will be able to make really capable systems using it, and rather than finding the core of philosophical competence that makes the system able to understand the map-territory distinction, we’ll just find patches for all of the obvious problems that come up (like the abulia trap, where the AI system discovers how to wirehead itself and then accomplishes nothing in the real world) and then we’re killed by the non-obvious problems.]
I’m not sure why you mean by ‘philosophically’ simple?
Do you agree that other problems in AI Alignment don’t have “philosophically’ simple cores? It seems to me that, say, scaling human supervision to a powerful AI or getting an AI that’s robust to ‘turning up the dial’ seem much harder and intractable problems than corrigibility.
I’m not sure why you mean by ‘philosophically’ simple?
I think if we had the right conception of goals, the difference between ‘corrigibility’ and ‘incorrigibility’ would be a short sentence in that language. (For example, if you have a causal graph that goes from “the state of the world” to “my observations”, you specify what you want in terms of the link between the state of the world and your observations, instead of the observations.)
This is in contrast to, like, ‘practically simple’, where you’ve programmed in rules to not do any of the ten thousand things it could do to corrupt things.
I think all three of Eliezer, you, and I share the sense that corrigibility is perhaps philosophically simple. The problem is that for it to actually have a small central core / be a natural stance, you need the ‘import philosophy’ bit to also have a small central core / be natural, and I think those bits aren’t true.
Like, the ‘map territory’ distinction seems to me like a simple thing that’s near the core of human sanity. But… how do I make an AI that sees the map territory distinction? How do I check that its plans are correctly determining the causal structure such that it can tell the difference between manipulating its map and manipulating the territory?
[And, importantly, this ‘philosophical’ AI seems to me like it’s possibly alignable, and a ‘nonphilosophical’ AI that views its projections as ‘the territory’ is probably not alignable. But it’s really spooky that all of our formal models are of this projective AI, and maybe we will be able to make really capable systems using it, and rather than finding the core of philosophical competence that makes the system able to understand the map-territory distinction, we’ll just find patches for all of the obvious problems that come up (like the abulia trap, where the AI system discovers how to wirehead itself and then accomplishes nothing in the real world) and then we’re killed by the non-obvious problems.]
I’m not sure why you mean by ‘philosophically’ simple?
Do you agree that other problems in AI Alignment don’t have “philosophically’ simple cores? It seems to me that, say, scaling human supervision to a powerful AI or getting an AI that’s robust to ‘turning up the dial’ seem much harder and intractable problems than corrigibility.
I think if we had the right conception of goals, the difference between ‘corrigibility’ and ‘incorrigibility’ would be a short sentence in that language. (For example, if you have a causal graph that goes from “the state of the world” to “my observations”, you specify what you want in terms of the link between the state of the world and your observations, instead of the observations.)
This is in contrast to, like, ‘practically simple’, where you’ve programmed in rules to not do any of the ten thousand things it could do to corrupt things.