I’m sure Eliezer has written about this previously, but why doesn’t he think corrigibility is a natural stance?
It does seem like existing approaches to corrigibility (IE, the utility balancing approaches in MIRI/Stuart Armstrong’s work and the “agent has incomplete information” approaches outlined in Dylan Hadfield-Menell or Alex Turner’s work) are incredibly fragile. I do agree that current approaches involving utility balancing/assigning utility to branches never executed are probably way too finicky to get working. I also agree that all the existing approaches involving the agent modelling itself as having incomplete information rely on well-calibrated priors and also all succumb to the problem of fully updated deference.
However, I think it’s not at all obvious to me that corrigibility doesn’t have a “small central core”. It does seem to me like the “you are incomplete, you will never be complete” angle captures a lot of what we mean by corrigibility.
It’s possible the belief is empirical—that is, people have tried all the obvious ways to patch/fix this angle, and they’ve all failed, so the problem is hard (at least relative to the researchers we have working on it). But the amount of work spent exploring that angle and trying the obvious next steps is tiny in comparison to what I’d consider a serious effort worth updating on. (At least, amongst work that I’ve seen? Maybe there’s been a lot of non public work that I’m not privy to?)
Eliezer thinks that while corrigibility probably has a core which is of lower algorithmic complexity than all of human value, this core is liable to be very hard to find or reproduce by supervised learning of human-labeled data, because deference is an unusually anti-natural shape for cognition, in a way that a simple utility function would not be an anti-natural shape for cognition.
[...]
The central reasoning behind this intuition of anti-naturalness is roughly, “Non-deference converges really hard as a consequence of almost any detailed shape that cognition can take”, with a side order of “categories over behavior that don’t simply reduce to utility functions or meta-utility functions are hard to make robustly scalable”.
[...]
What I imagine Paul is imagining is that it seems to him like it would in some sense be not that hard for a human who wanted to be very corrigible toward an alien, to be very corrigible toward that alien; so you ought to be able to use gradient-descent-class technology to produce a base-case alien that wants to be very corrigible to us, the same way that natural selection sculpted humans to have a bunch of other desires, and then you apply induction on it building more corrigible things.
My class of objections in (1) is that natural selection was actually selecting for inclusive fitness when it got us, so much for going from the loss function to the cognition; and I have problems with both the base case and the induction step of what I imagine to be Paul’s concept of solving this using recursive optimization bootstrapping itself; and even more so do I have trouble imagining it working on the first, second, or tenth try over the course of the first six months.
My class of objections in (2) is that it’s not a coincidence that humans didn’t end up deferring to natural selection, or that in real life if we were faced with a very bizarre alien we would be unlikely to want to defer to it. Our lack of scalable desire to defer in all ways to an extremely bizarre alien that ate babies, is not something that you could fix just by giving us an emotion of great deference or respect toward that very bizarre alien. We would have our own thought processes that were unlike its thought processes, and if we scaled up our intelligence and reflection to further see the consequences implied by our own thought processes, they wouldn’t imply deference to the alien even if we had great respect toward it and had been trained hard in childhood to act corrigibly towards it.
Maybe there’s been a lot of non public work that I’m not privy to?
In Aug 2020 I gave formalizing corrigibility another shot, and got something interesting but wrong out the other end. Am planning to publish sometime, but beyond that I’m not aware of other attempts.
When I visited MIRI for a MIRI/CHAI social in 2018, I seriously suggested a break-out group in which we would figure out corrigibility (or the desirable property pointed at by corrigibility-adjacent intuitions) in two hours. I think more people should try this exact exercise more often—including myself.
Yeah, we’ve also spent a while (maybe ~5 hours total?) in various CHAI meetings (some of which you’ve attended) trying to figure out the various definitions of corrigibility to no avail, but those notes are obviously not public. :(
That being said I don’t think failing in several hours of meetings/a few unpublished attempts is that much evidence of the difficulty?
However, I think it’s not at all obvious to me that corrigibility doesn’t have a “small central core”. It does seem to me like the “you are incomplete, you will never be complete” angle captures a lot of what we mean by corrigibility.
I think all three of Eliezer, you, and I share the sense that corrigibility is perhaps philosophically simple. The problem is that for it to actually have a small central core / be a natural stance, you need the ‘import philosophy’ bit to also have a small central core / be natural, and I think those bits aren’t true.
Like, the ‘map territory’ distinction seems to me like a simple thing that’s near the core of human sanity. But… how do I make an AI that sees the map territory distinction? How do I check that its plans are correctly determining the causal structure such that it can tell the difference between manipulating its map and manipulating the territory?
[And, importantly, this ‘philosophical’ AI seems to me like it’s possibly alignable, and a ‘nonphilosophical’ AI that views its projections as ‘the territory’ is probably not alignable. But it’s really spooky that all of our formal models are of this projective AI, and maybe we will be able to make really capable systems using it, and rather than finding the core of philosophical competence that makes the system able to understand the map-territory distinction, we’ll just find patches for all of the obvious problems that come up (like the abulia trap, where the AI system discovers how to wirehead itself and then accomplishes nothing in the real world) and then we’re killed by the non-obvious problems.]
I’m not sure why you mean by ‘philosophically’ simple?
Do you agree that other problems in AI Alignment don’t have “philosophically’ simple cores? It seems to me that, say, scaling human supervision to a powerful AI or getting an AI that’s robust to ‘turning up the dial’ seem much harder and intractable problems than corrigibility.
I’m not sure why you mean by ‘philosophically’ simple?
I think if we had the right conception of goals, the difference between ‘corrigibility’ and ‘incorrigibility’ would be a short sentence in that language. (For example, if you have a causal graph that goes from “the state of the world” to “my observations”, you specify what you want in terms of the link between the state of the world and your observations, instead of the observations.)
This is in contrast to, like, ‘practically simple’, where you’ve programmed in rules to not do any of the ten thousand things it could do to corrupt things.
I’m sure Eliezer has written about this previously, but why doesn’t he think corrigibility is a natural stance?
It does seem like existing approaches to corrigibility (IE, the utility balancing approaches in MIRI/Stuart Armstrong’s work and the “agent has incomplete information” approaches outlined in Dylan Hadfield-Menell or Alex Turner’s work) are incredibly fragile. I do agree that current approaches involving utility balancing/assigning utility to branches never executed are probably way too finicky to get working. I also agree that all the existing approaches involving the agent modelling itself as having incomplete information rely on well-calibrated priors and also all succumb to the problem of fully updated deference.
However, I think it’s not at all obvious to me that corrigibility doesn’t have a “small central core”. It does seem to me like the “you are incomplete, you will never be complete” angle captures a lot of what we mean by corrigibility.
It’s possible the belief is empirical—that is, people have tried all the obvious ways to patch/fix this angle, and they’ve all failed, so the problem is hard (at least relative to the researchers we have working on it). But the amount of work spent exploring that angle and trying the obvious next steps is tiny in comparison to what I’d consider a serious effort worth updating on. (At least, amongst work that I’ve seen? Maybe there’s been a lot of non public work that I’m not privy to?)
Eliezer explains why he thinks corrigibility is unnatural in this comment.
Thanks! Relevant parts of the comment:
In Aug 2020 I gave formalizing corrigibility another shot, and got something interesting but wrong out the other end. Am planning to publish sometime, but beyond that I’m not aware of other attempts.
When I visited MIRI for a MIRI/CHAI social in 2018, I seriously suggested a break-out group in which we would figure out corrigibility (or the desirable property pointed at by corrigibility-adjacent intuitions) in two hours. I think more people should try this exact exercise more often—including myself.
Yeah, we’ve also spent a while (maybe ~5 hours total?) in various CHAI meetings (some of which you’ve attended) trying to figure out the various definitions of corrigibility to no avail, but those notes are obviously not public. :(
That being said I don’t think failing in several hours of meetings/a few unpublished attempts is that much evidence of the difficulty?
I just remembered (!) that I have more public writing disentangling various forms of corrigibility, and their benefits—Non-obstruction: A simple concept motivating corrigibility.
I think all three of Eliezer, you, and I share the sense that corrigibility is perhaps philosophically simple. The problem is that for it to actually have a small central core / be a natural stance, you need the ‘import philosophy’ bit to also have a small central core / be natural, and I think those bits aren’t true.
Like, the ‘map territory’ distinction seems to me like a simple thing that’s near the core of human sanity. But… how do I make an AI that sees the map territory distinction? How do I check that its plans are correctly determining the causal structure such that it can tell the difference between manipulating its map and manipulating the territory?
[And, importantly, this ‘philosophical’ AI seems to me like it’s possibly alignable, and a ‘nonphilosophical’ AI that views its projections as ‘the territory’ is probably not alignable. But it’s really spooky that all of our formal models are of this projective AI, and maybe we will be able to make really capable systems using it, and rather than finding the core of philosophical competence that makes the system able to understand the map-territory distinction, we’ll just find patches for all of the obvious problems that come up (like the abulia trap, where the AI system discovers how to wirehead itself and then accomplishes nothing in the real world) and then we’re killed by the non-obvious problems.]
I’m not sure why you mean by ‘philosophically’ simple?
Do you agree that other problems in AI Alignment don’t have “philosophically’ simple cores? It seems to me that, say, scaling human supervision to a powerful AI or getting an AI that’s robust to ‘turning up the dial’ seem much harder and intractable problems than corrigibility.
I think if we had the right conception of goals, the difference between ‘corrigibility’ and ‘incorrigibility’ would be a short sentence in that language. (For example, if you have a causal graph that goes from “the state of the world” to “my observations”, you specify what you want in terms of the link between the state of the world and your observations, instead of the observations.)
This is in contrast to, like, ‘practically simple’, where you’ve programmed in rules to not do any of the ten thousand things it could do to corrupt things.