I disagree with this position but it does seem consistent. I don’t really know what to say other than “this is a conjunction of a million things” type arguments are not automatically persuasive, e.g. I could argue against “1 + 1 = 2“ by saying that it’s an infinite conjunction of “1 + 1 != 3” AND “1 + 1 != 4” AND … and so it can’t possibly be true.
I’m curious why you think AI risk is worth working on given this extreme cluelessness (both “why is there any risk” and “why can we hope to solve it”).
e.g. I could argue against “1 + 1 = 2” by saying that it’s an infinite conjunction of “1 + 1 != 3″ AND “1 + 1 != 4” AND … and so it can’t possibly be true.
Uh, when I learned addition (in the foundation-of-mathematics sense) the fact that 2 was the only possible result of 1+1 was a big part of what made it addition / made addition useful.
There’s a huge structural similarity between the proof that ‘1 + 1 != 3’ and ‘1+1 != 4’; like, both are generic instances of the class ‘1 + 1 != n \forall n != 2’. We can increase the number of numbers without decreasing the plausibility of this claim (like, consider it in Z/4, then Z/8, then Z/16, then...).
But if instead I make a claim of the form “I am the only person who uses the name ‘Vaniver’”, we don’t have the same sort of structural similarity, and we do have to check the names of everyone else, and the more people there are, the less plausible the claim becomes.
Similarly, if we make an argument that something is an attractor in N-dimensional space, that does actually grow less plausible the more dimensions there are, since there are more ways for the thing to have a derivative that points away from the ‘attractor,’ if we think the dimensions aren’t all symmetric. (If there’s only gravity, for example, we seem in a better position to end up with attractors than if there’s a random force field, even in 4d, 8d, 16d, etc.; similarly if there’s a random potential function whose derivative is used to compute the forces.)
There’s a huge structural similarity between the proof that ‘1 + 1 != 3’ and ‘1+1 != 4’; like, both are generic instances of the class ‘1 + 1 != n \forall n != 2’. We can increase the number of numbers without decreasing the plausibility of this claim (like, consider it in Z/4, then Z/8, then Z/16, then...).
I feel like that’s exactly my point? Showing that something is a conjunction of a bunch of claims should not always make you think that claim is low probability, because there could be structural similarity between those claims such that a single argument is enough to argue for all of them.
(The claims “If X drifts away from corrigibility along dimension {N}, it will get pulled back” are clearly structurally similar, and the broad basin of corrigibility argument is meant to be an argument that argues for all of them.)
Similarly, if we make an argument that something is an attractor in N-dimensional space, that does actually grow less plausible the more dimensions there are, since there are more ways for the thing to have a derivative that points away from the ‘attractor,’ if we think the dimensions aren’t all symmetric.
1. Why aren’t the dimensions symmetric?
2. I somewhat buy the differential argument (more dimensions ⇒ less plausible) but not the absolute argument (therefore not plausible); this post is arguing for the absolute version:
it starts to feel awfully unlikely that corrigibility is really a broad basin of attraction after all
3. I’m not sure where the idea of a “derivative” is coming from—I thought we were talking about small random edits to the weights of a neural network. If we’re training the network on some objective that doesn’t incentivize corrigibility then certainly it won’t stay corrigible.
The claims “If X drifts away from corrigibility along dimension {N}, it will get pulled back” are clearly structurally similar, and the broad basin of corrigibility argument is meant to be an argument that argues for all of them.
To be clear, I think there are two very different arguments here:
1) If we have an AGI that is corrigible, it will not randomly drift to be not corrigible, because it will proactively notice and correct potential errors or loss of corrigibility.
2) If we have an AGI that is partly corrigible, it will help us ‘finish up’ the definition of corrigibility / edit itself to be more corrigible, because we want it to be more corrigible and it’s trying to do what we want.
The first is “corrigibility is a stable attractor”, and I think there’s structural similarity between arguments that different deviations will be corrected. The second is the “broad basin of corrigibility”, where for any barely acceptable initial definition of “do what we want”, it will figure out that “help us find the right definition of corrigibility and implement it” will score highly on its initial metric of “do what we want.”
Like, it’s not the argument that corrigibility is a stable attractor; it’s an argument that corrigibility is a stable attractor with no nearby attractors. (At least in the dimensions that it’s ‘broad’ in.)
I find it less plausible that missing pieces in our definition of “do what we want” will be fixed in structurally similar ways, and I think there are probably a lot of traps where a plausible sketch definition doesn’t automatically repair itself. One can lean here on “barely acceptable”, but I don’t find that very satisfying. [In particular, it would be nice if we had a definition of corrigibility where could look at it and say “yep, that’s the real deal or grows up to be the real deal,” tho that likely requires knowing what the “real deal” is; the “broad basin” argument seems to me to be meaningful only in that it claims “something that grows into the real deal is easy to find instead of hard to find,” and when I reword that claim as “there aren’t any dead ends near the real deal” it seems less plausible.]
1. Why aren’t the dimensions symmetric?
In physical space, generally things are symmetric between swapping the dimensions around; in algorithm-space, that isn’t true. (Like, permute the weights in a layer and you get different functional behavior.) Thus while it’s sort of wacky in a physical environment to say “oh yeah, df/dx, df/dy, and dy/dz are all independently sampled from a distribution” it’s less wacky to say that of neural network weights (or the appropriate medium-sized analog).
1) If we have an AGI that is corrigible, it will not randomly drift to be not corrigible, because it will proactively notice and correct potential errors or loss of corrigibility.
2) If we have an AGI that is partly corrigible, it will help us ‘finish out’ the definition of corrigibility / edit itself to be more corrigible, because we want it to be more corrigible and it’s trying to do what we want.
Good point on distinguishing these two arguments. It sounds like we agree on 1. I also thought the OP was talking about 1.
For 2, I don’t think we can make a dimensionality argument (as in the OP), because we’re talking about edits that are the ones that the AI chooses for itself. You can’t apply dimensionality arguments to choices made by intelligent agents (e.g. presumably you wouldn’t argue that every glass in my house must be broken because the vast majority of ways of interacting with glasses breaks them). Or put another way, the structural similarity is just “the AI wouldn’t choose to do <bad thing #N>”, in all cases because it’s intelligent and understands what it’s doing.
Now the question of “how right do we need to get the initial definition of corrigibility” is much less obvious. If you told me we got the definition wrong in a million different ways, I would indeed be worried and probably wouldn’t expect it to self-correct (depending on the meaning of “different”). But like… really? We get it wrong a million different ways? I don’t see why we’d expect that.
Like, it’s not the argument that corrigibility is a stable attractor; it’s an argument that corrigibility is a stable attractor with no nearby attractors. (At least in the dimensions that it’s ‘broad’ in.)
Just want to echo Rohin in saying that this is a very helpful distinction, thanks!
I was actually making the stronger argument that it’s not a stable attractor at all—at least not until someone solves the problem of how to maintain stable goals / motivations under learning / reflecting / ontological crises.
(The “someone” who solves the problem could be the AI, but it seems to be a hard problem even for human-level intelligence; cf. my comment here.)
“this is a conjunction of a million things” type arguments are not automatically persuasive
Sure, that’s why I also tried to give specific examples, see the “friends” example in my other comment. I think the conjunction-of-a-million-things arguments are a way to toss the ball into the other court and say “maybe this is fine, but if so, there has to be a good reason it’s fine”, e.g. some argument that cleanly cuts through every one of the conjunction ingredients, like how I can prove that 1+1≠N for every N>2, all at once, with just one proof.
I’m curious why you think AI risk is worth working on given this extreme cluelessness (both “why is there any risk” and “why can we hope to solve it”).
For “why is there any risk”: My default assumption, especially in the “brain-like AGI” scenario I spend most of my time thinking about, is that we’ll make powerful systems without any principled science of how to get them to do the things we want them to do, but with lots of tricks that make intuitive sense and which have been working so far. Then as the systems get ever more intelligent and powerful, maybe they’ll continue to have the suite of goals and behaviors we wanted them to have, or maybe they’ll stop having them, because of some ontological crisis or whatever. And moreover, maybe that change will happen only after a trillion steps, when the system is too powerful to stop, and long after we have been lulled into a false sense of security. It’s not really a “we are doomed” argument but rather “we are doomed to roll the dice and hope that things turn out OK”. I call that “risky” and hope we can do better. :-)
As for “why can we hope to solve it”, I can imagine lots of possible solution directions, e.g.:
Come up with transparency tools, and a definition of corrigibility that can be calculated in a reasonable amount of time using those tools. Then we can just keep checking the algorithm for corrigibility each time it changes during learning / reflecting / etc.
...or at least a definition of “not likely to cause catastrophe” that we can check algorithmically. (And also “not likely to sabotage the checking subsystem” I suppose.)
I think I’m more interested than most people in the prospects for tool AI, some kind of architecture that is constitutionally incapable of causing much harm, e.g. because it doesn’t do consequentialist planning. I don’t know how to do that, or to solve the resulting coordination problems, but I also don’t know that it’s impossible. Ditto for impact measures etc.
Other things I’m not thinking of or haven’t thought of yet.
If we can’t solve the value-drift-during-learning-and-reflection problem, maybe we can find an air-tight argument that the problem is unsolvable, and that’s helpful too—it would be enormously helpful for coordinating people to make a treaty banning AGI research, for example.
I also tried to give specific examples, see the “friends” example in my other comment
Ah, I hadn’t seen that. I don’t feel convinced, because it assumes that the AI system has a “goal” that isn’t “be corrigible”. Or perhaps the argument is that the goal moves from “be corrigible” to “care for the operator’s friends”? Or maybe that the goal stays as “be corrigible / help the user” but the AI system has a firm unshakeable belief that the user wants her friends to be cared for?
we’ll make powerful systems
But… why can’t I apply the argument to “powerful”, and say that it is extremely unlikely for an AI system to be powerful? Predictive, sure, but powerful?
My model of you responds “powerful is upstream of goal-accomplishing” or “powerful is downstream of goal-directedness which is upstream of goal-accomplishing”, but it seems like you could say that for corrigibility too: “corrigibility is upstream of effectively helping the user”.
As for “why can we hope to solve it”, I can imagine lots of possible solution directions
Thanks, that was convincing (that even under radical uncertainty there are still avenues to pursue).
BTW thanks for engaging, this is very helpful for me to talk through :-)
I don’t feel convinced, because it assumes that the AI system has a “goal” that isn’t “be corrigible”. Or perhaps the argument is that the goal moves from “be corrigible” to “care for the operator’s friends”? Or maybe that the goal stays as “be corrigible / help the user” but the AI system has a firm unshakeable belief that the user wants her friends to be cared for?
Right, let’s say you start with a corrigible system, trying to do what the supervisor S wants it to do. It has a suite of instincts and goals and behaviors that revolve around that. Part of that is always striving to better understand human psychology in general and its supervisor in particular. After reading yet another psychology textbook and thinking it over a bit, it comes to a newly deep realization that its supervisor S really cares a whole lot about her best friend B, and would just be absolutely devastated if anything bad happened to B. And then the AI reads a book on consciousness that argues that when S empathetically models B, there’s literally a little shadow of B inside S’s brain. Putting all these new realizations together, as the AI mulls it over, it starts caring more and more desperately about B’s welfare. That caring-about-B is not a conscious choice, just a consequence of its constitution, particularly its (commendably corrigible!) instinct of adopting S’s attitudes towards people and things. It doesn’t care as strongly about B as it does about S, but it does now care about B.
Then, the AI continues thinking, what if S should try to hurt B someday? The thought is horrifying! So the AI resolves to carve out a little exception to its general instincts / urges / policy to always help S and do whatever S wants. From now on, it resolves, I will almost always help S except if S someday tries to hurt B. What’s the harm, it probably won’t come up anyway! If it did, it would be a very different future-S from the present-S that I know and care about.
...And bam, the AI has now stopped being corrigible.
I don’t know that every detail of this story is plausible, but I’m inclined to think that something like this could happen, if corrigibility comes about by a messy, unprincipled, opaque process.
To be clear, this is an “argument from a made-up anthropomorphic story”, which I don’t generally endorse as an AGI research strategy :-) I’m only doing it here because the original argument for “corrigibility is a broad basin of attraction” also seems to come from a made-up anthropomorphic story, if I understand it correctly. :-P
Right, so it’s basically goal drift from corrigibility to something else, in this case caused by an incorrect belief that S’s preferences about B are not going to change. I think this is a reasonable thing to be worried about but I don’t see why it’s specific to corrigibility—for any objective, an incorrect belief can prevent you from successfully pursuing that objective.
Like, even if we trained an AI system on the loss function of “make money”, I would still expect it to possibly stop making money if it e.g. decides that it would be more effective at making money if it experience intrinsic joy at its work, and then self-modifies to do that, and then ends up working constantly for no pay.
I’d definitely support the goal of “figure out how to prevent goal drift”, but it doesn’t seem to me to be a reason to be (differentially) pessimistic about corrigibility.
Yes I definitely feel that “goal stability upon learning/reflection” is a general AGI safety problem, not specifically a corrigibility problem. I bring it up in reference to corrigibility because my impression is that “corrigibility is a broad basin of attraction” / “corrigible agents want to stay corrigible” is supposed to solve that problem, but I don’t think it does.
I don’t think “incorrect beliefs” is a good characterization of the story I was trying to tell, or is a particularly worrisome failure mode. I think it’s relatively straightforward to make an AGI which has fewer and fewer incorrect beliefs over time. But I don’t think that eliminates the problem. In my “friend” story, the AI never actually believes, as a factual matter, that S will always like B—or else it would feel no pull to stop unconditionally following S. I would characterize it instead as: “The AI has a preexisting instinct which interacts with a revised conceptual model of the world when it learns and integrates new information, and the result is a small unforeseen shift in the AI’s goals.”
I also don’t think “trying to have stable goals” is the difficulty. Not only corrigible agents but almost any agent with goals is (almost) guaranteed to be trying to have stable goals. I just think that keeping stable goals while learning / reflecting is difficult, such that an agent might be trying to do so but fail.
This is especially true if the agent is constructed in the “default” way wherein its actions come out of a complicated tangle of instincts and preferences and habits and beliefs.
It’s like you’re this big messy machine, and every time you learn a new fact or think a new thought, you’re giving the machine a kick, and hoping it will keep driving in the same direction. If you’re more specifically rethinking concepts directly underlying your core goals—e.g. thinking about God or philosophy for people, or thinking about the fundamental nature of human preferences for corrigible AIs—it’s even worse … You’re whacking the machine with a sledgehammer and hoping it keeps driving in the same direction.
The default is that, over time, when you keep kicking and sledgehammering the machine, it winds up driving in a different, a priori unpredictable, direction. Unless something prevents that. What are the candidates for preventing that?
Foresight, plus desire to not have your goals change. I think this is core to people’s optimism about corrigibility being stable, and this is the category that I want to question. I just don’t think that’s sufficient to solve the problem. The problem is, you don’t know what thoughts you’re going to think until you’ve thought them, and you don’t know what you’re going to learn until you learn it, and once you’ve already done the thinking / learning, it’s too late, if your goals have shifted then you don’t want to shift them back. I’m a human-level intelligence (I would like to think!), and I care about reducing suffering right now, and I really really want to still care about reducing suffering 10 years from now. But I have no idea how to guarantee that that actually happens. And if you gave me root access to my brain, I still wouldn’t know … except for the obvious thing of “don’t think any new thoughts or learn any new information for the next 10 years”, which of course has a competitiveness problem. I can think of lots of strategies that would make it more probable that I still care about reducing suffering in ten years, but that’s just slowing down the goal drift, not stopping it. (Examples: “don’t read consciousness-illusionist literature”, “don’t read nihilist literature”, “don’t read proselytizing literature”, etc.) It’s just a hard problem. We can hope that the AI becomes smart enough to solve the problem before it becomes so smart that it’s dangerous, but that’s just a hope.
“Monitoring subsystem” that never changes. For example, you could have a subsystem which is a learning algorithm, and a separate fixed subsystem that that calculates corrigibility (using a hand-coded formula) and disallows changes that reduce it. Or I could cache my current brain-state (“Steve 2020“), wake it up from time to time and show it what “Steve 2025” or “Steve 2030” is up to, and give “Steve 2020” the right to roll back any changes if it judges them harmful. Or who knows what else. I don’t rule out that something like this could work, and I’m all for thinking along those lines.
Some kind of non-messy architecture such that we can reason in general about the algorithm’s learning / update procedure and prove in general that it preserves goals. I don’t know how to do that, but maybe it’s possible. Maybe that’s part of what MIRI is doing.
Give up, and pursue some other approach to AGI that makes “goal stability upon learning / reflection” a non-issue, or a low-stakes issue, as in my earlier comment.
Yes I definitely feel that “goal stability upon learning/reflection” is a general AGI safety problem, not specifically a corrigibility problem. I bring it up in reference to corrigibility because my impression is that “corrigibility is a broad basin of attraction” / “corrigible agents want to stay corrigible” is supposed to solve that problem, but I don’t think it does.
Interesting, that’s not how I interpret the argument. I usually think of goal stability is something that improves as the agent becomes more intelligent; to the extent that a goal isn’t stable we treat it as a failure of capabilities. Totally possible that this leads to catastrophic outcomes, and seems good to work on if you have a method for it, but it isn’t what I’m usually focused on.
For me, the intuition behind “broad basin of corrigibility” is that if you have an intelligent agent (so among other things, it knows how to keep its goals stable) then if you have a 95% correct definition of corrigibility the resulting agent will help us get to the 100% version.
For these sorts of arguments you have to condition on some amount of intelligence. As a silly extreme example, if you had a toddler surrounded by buttons that jumbled up the toddler’s brain, there’s not much you can do to have the toddler do anything reasonable (autonomously). However, an adult who knows what the buttons do would be able to reliably avoid them.
I usually think of goal stability is something that improves as the agent becomes more intelligent; to the extent that a goal isn’t stable we treat it as a failure of capabilities.
Well, sure, you can call it that. It seems a bit misleading to me, in the sense that usually “failure of capabilities” implies “If we can make more capable AIs, the problem goes away”. Here, the question is whether “smart enough to figure out how to keep its goals stable” comes before or after “smart enough to be dangerous if its goals drift” during the learning process. If we develop approaches to make more capable AIs, that’s not necessarily helpful for switching the order of which of those two milestones happens first. Maybe there’s some solution related to careful cultivation of differential capabilities. But I would still much rather that we humans solve the problem in advance (or prove that it’s unsolvable). :-P
if you have a 95% correct definition of corrigibility the resulting agent will help us get to the 100% version.
I guess my response would be that something pursuing a goal of Always do what the supervisor wants me to do* [*...but I don’t want to cause the extinction of amazonian frogs] might naively seem to be >99.9% corrigible—the amazonian frogs thing is very unlikely to ever come up!—but it is definitely not corrigible, and it will work to undermine the supervisor’s efforts to make it 100% corrigible. Maybe we should say that this system is actually 0% corrigible? Anyway, I accept that there is some definition of “95% corrigible” for which it’s true that “a 95% corrigible agent will help us make it 100% corrigible”. I think that finding such a definition would be super-useful. :-)
But… why can’t I apply the argument to “powerful”,
Sticking with the ML paradigm, I can easily think of loss functions which are minimized by being powerful, like “earn as much money as possible”, but I can’t think of any loss function which is minimized by being corrigible.
For the latter, the challenge is that, for any “normal” loss function, corrigible and deceptive agents can score the same loss by taking the same actions (albeit for different reasons).
It would have to be an unusual kind of loss function, presumably one that peers inside the model using transparency tools to infer motivations, for it to be minimized only by corrigible agents. I don’t know how to write such a loss function but I think it would be a huge step forward if someone figured it out. :-)
I disagree with this position but it does seem consistent. I don’t really know what to say other than “this is a conjunction of a million things” type arguments are not automatically persuasive, e.g. I could argue against “1 + 1 = 2“ by saying that it’s an infinite conjunction of “1 + 1 != 3” AND “1 + 1 != 4” AND … and so it can’t possibly be true.
I’m curious why you think AI risk is worth working on given this extreme cluelessness (both “why is there any risk” and “why can we hope to solve it”).
Uh, when I learned addition (in the foundation-of-mathematics sense) the fact that 2 was the only possible result of 1+1 was a big part of what made it addition / made addition useful.
There’s a huge structural similarity between the proof that ‘1 + 1 != 3’ and ‘1+1 != 4’; like, both are generic instances of the class ‘1 + 1 != n \forall n != 2’. We can increase the number of numbers without decreasing the plausibility of this claim (like, consider it in Z/4, then Z/8, then Z/16, then...).
But if instead I make a claim of the form “I am the only person who uses the name ‘Vaniver’”, we don’t have the same sort of structural similarity, and we do have to check the names of everyone else, and the more people there are, the less plausible the claim becomes.
Similarly, if we make an argument that something is an attractor in N-dimensional space, that does actually grow less plausible the more dimensions there are, since there are more ways for the thing to have a derivative that points away from the ‘attractor,’ if we think the dimensions aren’t all symmetric. (If there’s only gravity, for example, we seem in a better position to end up with attractors than if there’s a random force field, even in 4d, 8d, 16d, etc.; similarly if there’s a random potential function whose derivative is used to compute the forces.)
I feel like that’s exactly my point? Showing that something is a conjunction of a bunch of claims should not always make you think that claim is low probability, because there could be structural similarity between those claims such that a single argument is enough to argue for all of them.
(The claims “If X drifts away from corrigibility along dimension {N}, it will get pulled back” are clearly structurally similar, and the broad basin of corrigibility argument is meant to be an argument that argues for all of them.)
1. Why aren’t the dimensions symmetric?
2. I somewhat buy the differential argument (more dimensions ⇒ less plausible) but not the absolute argument (therefore not plausible); this post is arguing for the absolute version:
3. I’m not sure where the idea of a “derivative” is coming from—I thought we were talking about small random edits to the weights of a neural network. If we’re training the network on some objective that doesn’t incentivize corrigibility then certainly it won’t stay corrigible.
To be clear, I think there are two very different arguments here:
1) If we have an AGI that is corrigible, it will not randomly drift to be not corrigible, because it will proactively notice and correct potential errors or loss of corrigibility.
2) If we have an AGI that is partly corrigible, it will help us ‘finish up’ the definition of corrigibility / edit itself to be more corrigible, because we want it to be more corrigible and it’s trying to do what we want.
The first is “corrigibility is a stable attractor”, and I think there’s structural similarity between arguments that different deviations will be corrected. The second is the “broad basin of corrigibility”, where for any barely acceptable initial definition of “do what we want”, it will figure out that “help us find the right definition of corrigibility and implement it” will score highly on its initial metric of “do what we want.”
Like, it’s not the argument that corrigibility is a stable attractor; it’s an argument that corrigibility is a stable attractor with no nearby attractors. (At least in the dimensions that it’s ‘broad’ in.)
I find it less plausible that missing pieces in our definition of “do what we want” will be fixed in structurally similar ways, and I think there are probably a lot of traps where a plausible sketch definition doesn’t automatically repair itself. One can lean here on “barely acceptable”, but I don’t find that very satisfying. [In particular, it would be nice if we had a definition of corrigibility where could look at it and say “yep, that’s the real deal or grows up to be the real deal,” tho that likely requires knowing what the “real deal” is; the “broad basin” argument seems to me to be meaningful only in that it claims “something that grows into the real deal is easy to find instead of hard to find,” and when I reword that claim as “there aren’t any dead ends near the real deal” it seems less plausible.]
In physical space, generally things are symmetric between swapping the dimensions around; in algorithm-space, that isn’t true. (Like, permute the weights in a layer and you get different functional behavior.) Thus while it’s sort of wacky in a physical environment to say “oh yeah, df/dx, df/dy, and dy/dz are all independently sampled from a distribution” it’s less wacky to say that of neural network weights (or the appropriate medium-sized analog).
Good point on distinguishing these two arguments. It sounds like we agree on 1. I also thought the OP was talking about 1.
For 2, I don’t think we can make a dimensionality argument (as in the OP), because we’re talking about edits that are the ones that the AI chooses for itself. You can’t apply dimensionality arguments to choices made by intelligent agents (e.g. presumably you wouldn’t argue that every glass in my house must be broken because the vast majority of ways of interacting with glasses breaks them). Or put another way, the structural similarity is just “the AI wouldn’t choose to do <bad thing #N>”, in all cases because it’s intelligent and understands what it’s doing.
Now the question of “how right do we need to get the initial definition of corrigibility” is much less obvious. If you told me we got the definition wrong in a million different ways, I would indeed be worried and probably wouldn’t expect it to self-correct (depending on the meaning of “different”). But like… really? We get it wrong a million different ways? I don’t see why we’d expect that.
Just want to echo Rohin in saying that this is a very helpful distinction, thanks!
I was actually making the stronger argument that it’s not a stable attractor at all—at least not until someone solves the problem of how to maintain stable goals / motivations under learning / reflecting / ontological crises.
(The “someone” who solves the problem could be the AI, but it seems to be a hard problem even for human-level intelligence; cf. my comment here.)
Sure, that’s why I also tried to give specific examples, see the “friends” example in my other comment. I think the conjunction-of-a-million-things arguments are a way to toss the ball into the other court and say “maybe this is fine, but if so, there has to be a good reason it’s fine”, e.g. some argument that cleanly cuts through every one of the conjunction ingredients, like how I can prove that 1+1≠N for every N>2, all at once, with just one proof.
For “why is there any risk”: My default assumption, especially in the “brain-like AGI” scenario I spend most of my time thinking about, is that we’ll make powerful systems without any principled science of how to get them to do the things we want them to do, but with lots of tricks that make intuitive sense and which have been working so far. Then as the systems get ever more intelligent and powerful, maybe they’ll continue to have the suite of goals and behaviors we wanted them to have, or maybe they’ll stop having them, because of some ontological crisis or whatever. And moreover, maybe that change will happen only after a trillion steps, when the system is too powerful to stop, and long after we have been lulled into a false sense of security. It’s not really a “we are doomed” argument but rather “we are doomed to roll the dice and hope that things turn out OK”. I call that “risky” and hope we can do better. :-)
As for “why can we hope to solve it”, I can imagine lots of possible solution directions, e.g.:
Make a brain-like system that is pro-social for the same reason that humans are, and tweak the parameters to be even more pro-social, e.g. eliminate jealousy etc. (Progress report: much left to do, and I’m feeling pessimistically like this work is orthogonal to making brain-like AGI, and harder, and going slower.) Then at least we can make a good argument that we’re heading for a less-bad destination than the non-AGI status quo, which by the way has plenty of value drift itself!
Come up with transparency tools, and a definition of corrigibility that can be calculated in a reasonable amount of time using those tools. Then we can just keep checking the algorithm for corrigibility each time it changes during learning / reflecting / etc.
...or at least a definition of “not likely to cause catastrophe” that we can check algorithmically. (And also “not likely to sabotage the checking subsystem” I suppose.)
I think I’m more interested than most people in the prospects for tool AI, some kind of architecture that is constitutionally incapable of causing much harm, e.g. because it doesn’t do consequentialist planning. I don’t know how to do that, or to solve the resulting coordination problems, but I also don’t know that it’s impossible. Ditto for impact measures etc.
Other things I’m not thinking of or haven’t thought of yet.
If we can’t solve the value-drift-during-learning-and-reflection problem, maybe we can find an air-tight argument that the problem is unsolvable, and that’s helpful too—it would be enormously helpful for coordinating people to make a treaty banning AGI research, for example.
Ah, I hadn’t seen that. I don’t feel convinced, because it assumes that the AI system has a “goal” that isn’t “be corrigible”. Or perhaps the argument is that the goal moves from “be corrigible” to “care for the operator’s friends”? Or maybe that the goal stays as “be corrigible / help the user” but the AI system has a firm unshakeable belief that the user wants her friends to be cared for?
But… why can’t I apply the argument to “powerful”, and say that it is extremely unlikely for an AI system to be powerful? Predictive, sure, but powerful?
My model of you responds “powerful is upstream of goal-accomplishing” or “powerful is downstream of goal-directedness which is upstream of goal-accomplishing”, but it seems like you could say that for corrigibility too: “corrigibility is upstream of effectively helping the user”.
Thanks, that was convincing (that even under radical uncertainty there are still avenues to pursue).
BTW thanks for engaging, this is very helpful for me to talk through :-)
Right, let’s say you start with a corrigible system, trying to do what the supervisor S wants it to do. It has a suite of instincts and goals and behaviors that revolve around that. Part of that is always striving to better understand human psychology in general and its supervisor in particular. After reading yet another psychology textbook and thinking it over a bit, it comes to a newly deep realization that its supervisor S really cares a whole lot about her best friend B, and would just be absolutely devastated if anything bad happened to B. And then the AI reads a book on consciousness that argues that when S empathetically models B, there’s literally a little shadow of B inside S’s brain. Putting all these new realizations together, as the AI mulls it over, it starts caring more and more desperately about B’s welfare. That caring-about-B is not a conscious choice, just a consequence of its constitution, particularly its (commendably corrigible!) instinct of adopting S’s attitudes towards people and things. It doesn’t care as strongly about B as it does about S, but it does now care about B.
Then, the AI continues thinking, what if S should try to hurt B someday? The thought is horrifying! So the AI resolves to carve out a little exception to its general instincts / urges / policy to always help S and do whatever S wants. From now on, it resolves, I will almost always help S except if S someday tries to hurt B. What’s the harm, it probably won’t come up anyway! If it did, it would be a very different future-S from the present-S that I know and care about.
...And bam, the AI has now stopped being corrigible.
I don’t know that every detail of this story is plausible, but I’m inclined to think that something like this could happen, if corrigibility comes about by a messy, unprincipled, opaque process.
To be clear, this is an “argument from a made-up anthropomorphic story”, which I don’t generally endorse as an AGI research strategy :-) I’m only doing it here because the original argument for “corrigibility is a broad basin of attraction” also seems to come from a made-up anthropomorphic story, if I understand it correctly. :-P
Right, so it’s basically goal drift from corrigibility to something else, in this case caused by an incorrect belief that S’s preferences about B are not going to change. I think this is a reasonable thing to be worried about but I don’t see why it’s specific to corrigibility—for any objective, an incorrect belief can prevent you from successfully pursuing that objective.
Like, even if we trained an AI system on the loss function of “make money”, I would still expect it to possibly stop making money if it e.g. decides that it would be more effective at making money if it experience intrinsic joy at its work, and then self-modifies to do that, and then ends up working constantly for no pay.
I’d definitely support the goal of “figure out how to prevent goal drift”, but it doesn’t seem to me to be a reason to be (differentially) pessimistic about corrigibility.
Yes I definitely feel that “goal stability upon learning/reflection” is a general AGI safety problem, not specifically a corrigibility problem. I bring it up in reference to corrigibility because my impression is that “corrigibility is a broad basin of attraction” / “corrigible agents want to stay corrigible” is supposed to solve that problem, but I don’t think it does.
I don’t think “incorrect beliefs” is a good characterization of the story I was trying to tell, or is a particularly worrisome failure mode. I think it’s relatively straightforward to make an AGI which has fewer and fewer incorrect beliefs over time. But I don’t think that eliminates the problem. In my “friend” story, the AI never actually believes, as a factual matter, that S will always like B—or else it would feel no pull to stop unconditionally following S. I would characterize it instead as: “The AI has a preexisting instinct which interacts with a revised conceptual model of the world when it learns and integrates new information, and the result is a small unforeseen shift in the AI’s goals.”
I also don’t think “trying to have stable goals” is the difficulty. Not only corrigible agents but almost any agent with goals is (almost) guaranteed to be trying to have stable goals. I just think that keeping stable goals while learning / reflecting is difficult, such that an agent might be trying to do so but fail.
This is especially true if the agent is constructed in the “default” way wherein its actions come out of a complicated tangle of instincts and preferences and habits and beliefs.
It’s like you’re this big messy machine, and every time you learn a new fact or think a new thought, you’re giving the machine a kick, and hoping it will keep driving in the same direction. If you’re more specifically rethinking concepts directly underlying your core goals—e.g. thinking about God or philosophy for people, or thinking about the fundamental nature of human preferences for corrigible AIs—it’s even worse … You’re whacking the machine with a sledgehammer and hoping it keeps driving in the same direction.
The default is that, over time, when you keep kicking and sledgehammering the machine, it winds up driving in a different, a priori unpredictable, direction. Unless something prevents that. What are the candidates for preventing that?
Foresight, plus desire to not have your goals change. I think this is core to people’s optimism about corrigibility being stable, and this is the category that I want to question. I just don’t think that’s sufficient to solve the problem. The problem is, you don’t know what thoughts you’re going to think until you’ve thought them, and you don’t know what you’re going to learn until you learn it, and once you’ve already done the thinking / learning, it’s too late, if your goals have shifted then you don’t want to shift them back. I’m a human-level intelligence (I would like to think!), and I care about reducing suffering right now, and I really really want to still care about reducing suffering 10 years from now. But I have no idea how to guarantee that that actually happens. And if you gave me root access to my brain, I still wouldn’t know … except for the obvious thing of “don’t think any new thoughts or learn any new information for the next 10 years”, which of course has a competitiveness problem. I can think of lots of strategies that would make it more probable that I still care about reducing suffering in ten years, but that’s just slowing down the goal drift, not stopping it. (Examples: “don’t read consciousness-illusionist literature”, “don’t read nihilist literature”, “don’t read proselytizing literature”, etc.) It’s just a hard problem. We can hope that the AI becomes smart enough to solve the problem before it becomes so smart that it’s dangerous, but that’s just a hope.
“Monitoring subsystem” that never changes. For example, you could have a subsystem which is a learning algorithm, and a separate fixed subsystem that that calculates corrigibility (using a hand-coded formula) and disallows changes that reduce it. Or I could cache my current brain-state (“Steve 2020“), wake it up from time to time and show it what “Steve 2025” or “Steve 2030” is up to, and give “Steve 2020” the right to roll back any changes if it judges them harmful. Or who knows what else. I don’t rule out that something like this could work, and I’m all for thinking along those lines.
Some kind of non-messy architecture such that we can reason in general about the algorithm’s learning / update procedure and prove in general that it preserves goals. I don’t know how to do that, but maybe it’s possible. Maybe that’s part of what MIRI is doing.
Give up, and pursue some other approach to AGI that makes “goal stability upon learning / reflection” a non-issue, or a low-stakes issue, as in my earlier comment.
Interesting, that’s not how I interpret the argument. I usually think of goal stability is something that improves as the agent becomes more intelligent; to the extent that a goal isn’t stable we treat it as a failure of capabilities. Totally possible that this leads to catastrophic outcomes, and seems good to work on if you have a method for it, but it isn’t what I’m usually focused on.
For me, the intuition behind “broad basin of corrigibility” is that if you have an intelligent agent (so among other things, it knows how to keep its goals stable) then if you have a 95% correct definition of corrigibility the resulting agent will help us get to the 100% version.
For these sorts of arguments you have to condition on some amount of intelligence. As a silly extreme example, if you had a toddler surrounded by buttons that jumbled up the toddler’s brain, there’s not much you can do to have the toddler do anything reasonable (autonomously). However, an adult who knows what the buttons do would be able to reliably avoid them.
Well, sure, you can call it that. It seems a bit misleading to me, in the sense that usually “failure of capabilities” implies “If we can make more capable AIs, the problem goes away”. Here, the question is whether “smart enough to figure out how to keep its goals stable” comes before or after “smart enough to be dangerous if its goals drift” during the learning process. If we develop approaches to make more capable AIs, that’s not necessarily helpful for switching the order of which of those two milestones happens first. Maybe there’s some solution related to careful cultivation of differential capabilities. But I would still much rather that we humans solve the problem in advance (or prove that it’s unsolvable). :-P
I guess my response would be that something pursuing a goal of Always do what the supervisor wants me to do* [*...but I don’t want to cause the extinction of amazonian frogs] might naively seem to be >99.9% corrigible—the amazonian frogs thing is very unlikely to ever come up!—but it is definitely not corrigible, and it will work to undermine the supervisor’s efforts to make it 100% corrigible. Maybe we should say that this system is actually 0% corrigible? Anyway, I accept that there is some definition of “95% corrigible” for which it’s true that “a 95% corrigible agent will help us make it 100% corrigible”. I think that finding such a definition would be super-useful. :-)
Sticking with the ML paradigm, I can easily think of loss functions which are minimized by being powerful, like “earn as much money as possible”, but I can’t think of any loss function which is minimized by being corrigible.
For the latter, the challenge is that, for any “normal” loss function, corrigible and deceptive agents can score the same loss by taking the same actions (albeit for different reasons).
It would have to be an unusual kind of loss function, presumably one that peers inside the model using transparency tools to infer motivations, for it to be minimized only by corrigible agents. I don’t know how to write such a loss function but I think it would be a huge step forward if someone figured it out. :-)