BTW thanks for engaging, this is very helpful for me to talk through :-)
I don’t feel convinced, because it assumes that the AI system has a “goal” that isn’t “be corrigible”. Or perhaps the argument is that the goal moves from “be corrigible” to “care for the operator’s friends”? Or maybe that the goal stays as “be corrigible / help the user” but the AI system has a firm unshakeable belief that the user wants her friends to be cared for?
Right, let’s say you start with a corrigible system, trying to do what the supervisor S wants it to do. It has a suite of instincts and goals and behaviors that revolve around that. Part of that is always striving to better understand human psychology in general and its supervisor in particular. After reading yet another psychology textbook and thinking it over a bit, it comes to a newly deep realization that its supervisor S really cares a whole lot about her best friend B, and would just be absolutely devastated if anything bad happened to B. And then the AI reads a book on consciousness that argues that when S empathetically models B, there’s literally a little shadow of B inside S’s brain. Putting all these new realizations together, as the AI mulls it over, it starts caring more and more desperately about B’s welfare. That caring-about-B is not a conscious choice, just a consequence of its constitution, particularly its (commendably corrigible!) instinct of adopting S’s attitudes towards people and things. It doesn’t care as strongly about B as it does about S, but it does now care about B.
Then, the AI continues thinking, what if S should try to hurt B someday? The thought is horrifying! So the AI resolves to carve out a little exception to its general instincts / urges / policy to always help S and do whatever S wants. From now on, it resolves, I will almost always help S except if S someday tries to hurt B. What’s the harm, it probably won’t come up anyway! If it did, it would be a very different future-S from the present-S that I know and care about.
...And bam, the AI has now stopped being corrigible.
I don’t know that every detail of this story is plausible, but I’m inclined to think that something like this could happen, if corrigibility comes about by a messy, unprincipled, opaque process.
To be clear, this is an “argument from a made-up anthropomorphic story”, which I don’t generally endorse as an AGI research strategy :-) I’m only doing it here because the original argument for “corrigibility is a broad basin of attraction” also seems to come from a made-up anthropomorphic story, if I understand it correctly. :-P
Right, so it’s basically goal drift from corrigibility to something else, in this case caused by an incorrect belief that S’s preferences about B are not going to change. I think this is a reasonable thing to be worried about but I don’t see why it’s specific to corrigibility—for any objective, an incorrect belief can prevent you from successfully pursuing that objective.
Like, even if we trained an AI system on the loss function of “make money”, I would still expect it to possibly stop making money if it e.g. decides that it would be more effective at making money if it experience intrinsic joy at its work, and then self-modifies to do that, and then ends up working constantly for no pay.
I’d definitely support the goal of “figure out how to prevent goal drift”, but it doesn’t seem to me to be a reason to be (differentially) pessimistic about corrigibility.
Yes I definitely feel that “goal stability upon learning/reflection” is a general AGI safety problem, not specifically a corrigibility problem. I bring it up in reference to corrigibility because my impression is that “corrigibility is a broad basin of attraction” / “corrigible agents want to stay corrigible” is supposed to solve that problem, but I don’t think it does.
I don’t think “incorrect beliefs” is a good characterization of the story I was trying to tell, or is a particularly worrisome failure mode. I think it’s relatively straightforward to make an AGI which has fewer and fewer incorrect beliefs over time. But I don’t think that eliminates the problem. In my “friend” story, the AI never actually believes, as a factual matter, that S will always like B—or else it would feel no pull to stop unconditionally following S. I would characterize it instead as: “The AI has a preexisting instinct which interacts with a revised conceptual model of the world when it learns and integrates new information, and the result is a small unforeseen shift in the AI’s goals.”
I also don’t think “trying to have stable goals” is the difficulty. Not only corrigible agents but almost any agent with goals is (almost) guaranteed to be trying to have stable goals. I just think that keeping stable goals while learning / reflecting is difficult, such that an agent might be trying to do so but fail.
This is especially true if the agent is constructed in the “default” way wherein its actions come out of a complicated tangle of instincts and preferences and habits and beliefs.
It’s like you’re this big messy machine, and every time you learn a new fact or think a new thought, you’re giving the machine a kick, and hoping it will keep driving in the same direction. If you’re more specifically rethinking concepts directly underlying your core goals—e.g. thinking about God or philosophy for people, or thinking about the fundamental nature of human preferences for corrigible AIs—it’s even worse … You’re whacking the machine with a sledgehammer and hoping it keeps driving in the same direction.
The default is that, over time, when you keep kicking and sledgehammering the machine, it winds up driving in a different, a priori unpredictable, direction. Unless something prevents that. What are the candidates for preventing that?
Foresight, plus desire to not have your goals change. I think this is core to people’s optimism about corrigibility being stable, and this is the category that I want to question. I just don’t think that’s sufficient to solve the problem. The problem is, you don’t know what thoughts you’re going to think until you’ve thought them, and you don’t know what you’re going to learn until you learn it, and once you’ve already done the thinking / learning, it’s too late, if your goals have shifted then you don’t want to shift them back. I’m a human-level intelligence (I would like to think!), and I care about reducing suffering right now, and I really really want to still care about reducing suffering 10 years from now. But I have no idea how to guarantee that that actually happens. And if you gave me root access to my brain, I still wouldn’t know … except for the obvious thing of “don’t think any new thoughts or learn any new information for the next 10 years”, which of course has a competitiveness problem. I can think of lots of strategies that would make it more probable that I still care about reducing suffering in ten years, but that’s just slowing down the goal drift, not stopping it. (Examples: “don’t read consciousness-illusionist literature”, “don’t read nihilist literature”, “don’t read proselytizing literature”, etc.) It’s just a hard problem. We can hope that the AI becomes smart enough to solve the problem before it becomes so smart that it’s dangerous, but that’s just a hope.
“Monitoring subsystem” that never changes. For example, you could have a subsystem which is a learning algorithm, and a separate fixed subsystem that that calculates corrigibility (using a hand-coded formula) and disallows changes that reduce it. Or I could cache my current brain-state (“Steve 2020“), wake it up from time to time and show it what “Steve 2025” or “Steve 2030” is up to, and give “Steve 2020” the right to roll back any changes if it judges them harmful. Or who knows what else. I don’t rule out that something like this could work, and I’m all for thinking along those lines.
Some kind of non-messy architecture such that we can reason in general about the algorithm’s learning / update procedure and prove in general that it preserves goals. I don’t know how to do that, but maybe it’s possible. Maybe that’s part of what MIRI is doing.
Give up, and pursue some other approach to AGI that makes “goal stability upon learning / reflection” a non-issue, or a low-stakes issue, as in my earlier comment.
Yes I definitely feel that “goal stability upon learning/reflection” is a general AGI safety problem, not specifically a corrigibility problem. I bring it up in reference to corrigibility because my impression is that “corrigibility is a broad basin of attraction” / “corrigible agents want to stay corrigible” is supposed to solve that problem, but I don’t think it does.
Interesting, that’s not how I interpret the argument. I usually think of goal stability is something that improves as the agent becomes more intelligent; to the extent that a goal isn’t stable we treat it as a failure of capabilities. Totally possible that this leads to catastrophic outcomes, and seems good to work on if you have a method for it, but it isn’t what I’m usually focused on.
For me, the intuition behind “broad basin of corrigibility” is that if you have an intelligent agent (so among other things, it knows how to keep its goals stable) then if you have a 95% correct definition of corrigibility the resulting agent will help us get to the 100% version.
For these sorts of arguments you have to condition on some amount of intelligence. As a silly extreme example, if you had a toddler surrounded by buttons that jumbled up the toddler’s brain, there’s not much you can do to have the toddler do anything reasonable (autonomously). However, an adult who knows what the buttons do would be able to reliably avoid them.
I usually think of goal stability is something that improves as the agent becomes more intelligent; to the extent that a goal isn’t stable we treat it as a failure of capabilities.
Well, sure, you can call it that. It seems a bit misleading to me, in the sense that usually “failure of capabilities” implies “If we can make more capable AIs, the problem goes away”. Here, the question is whether “smart enough to figure out how to keep its goals stable” comes before or after “smart enough to be dangerous if its goals drift” during the learning process. If we develop approaches to make more capable AIs, that’s not necessarily helpful for switching the order of which of those two milestones happens first. Maybe there’s some solution related to careful cultivation of differential capabilities. But I would still much rather that we humans solve the problem in advance (or prove that it’s unsolvable). :-P
if you have a 95% correct definition of corrigibility the resulting agent will help us get to the 100% version.
I guess my response would be that something pursuing a goal of Always do what the supervisor wants me to do* [*...but I don’t want to cause the extinction of amazonian frogs] might naively seem to be >99.9% corrigible—the amazonian frogs thing is very unlikely to ever come up!—but it is definitely not corrigible, and it will work to undermine the supervisor’s efforts to make it 100% corrigible. Maybe we should say that this system is actually 0% corrigible? Anyway, I accept that there is some definition of “95% corrigible” for which it’s true that “a 95% corrigible agent will help us make it 100% corrigible”. I think that finding such a definition would be super-useful. :-)
BTW thanks for engaging, this is very helpful for me to talk through :-)
Right, let’s say you start with a corrigible system, trying to do what the supervisor S wants it to do. It has a suite of instincts and goals and behaviors that revolve around that. Part of that is always striving to better understand human psychology in general and its supervisor in particular. After reading yet another psychology textbook and thinking it over a bit, it comes to a newly deep realization that its supervisor S really cares a whole lot about her best friend B, and would just be absolutely devastated if anything bad happened to B. And then the AI reads a book on consciousness that argues that when S empathetically models B, there’s literally a little shadow of B inside S’s brain. Putting all these new realizations together, as the AI mulls it over, it starts caring more and more desperately about B’s welfare. That caring-about-B is not a conscious choice, just a consequence of its constitution, particularly its (commendably corrigible!) instinct of adopting S’s attitudes towards people and things. It doesn’t care as strongly about B as it does about S, but it does now care about B.
Then, the AI continues thinking, what if S should try to hurt B someday? The thought is horrifying! So the AI resolves to carve out a little exception to its general instincts / urges / policy to always help S and do whatever S wants. From now on, it resolves, I will almost always help S except if S someday tries to hurt B. What’s the harm, it probably won’t come up anyway! If it did, it would be a very different future-S from the present-S that I know and care about.
...And bam, the AI has now stopped being corrigible.
I don’t know that every detail of this story is plausible, but I’m inclined to think that something like this could happen, if corrigibility comes about by a messy, unprincipled, opaque process.
To be clear, this is an “argument from a made-up anthropomorphic story”, which I don’t generally endorse as an AGI research strategy :-) I’m only doing it here because the original argument for “corrigibility is a broad basin of attraction” also seems to come from a made-up anthropomorphic story, if I understand it correctly. :-P
Right, so it’s basically goal drift from corrigibility to something else, in this case caused by an incorrect belief that S’s preferences about B are not going to change. I think this is a reasonable thing to be worried about but I don’t see why it’s specific to corrigibility—for any objective, an incorrect belief can prevent you from successfully pursuing that objective.
Like, even if we trained an AI system on the loss function of “make money”, I would still expect it to possibly stop making money if it e.g. decides that it would be more effective at making money if it experience intrinsic joy at its work, and then self-modifies to do that, and then ends up working constantly for no pay.
I’d definitely support the goal of “figure out how to prevent goal drift”, but it doesn’t seem to me to be a reason to be (differentially) pessimistic about corrigibility.
Yes I definitely feel that “goal stability upon learning/reflection” is a general AGI safety problem, not specifically a corrigibility problem. I bring it up in reference to corrigibility because my impression is that “corrigibility is a broad basin of attraction” / “corrigible agents want to stay corrigible” is supposed to solve that problem, but I don’t think it does.
I don’t think “incorrect beliefs” is a good characterization of the story I was trying to tell, or is a particularly worrisome failure mode. I think it’s relatively straightforward to make an AGI which has fewer and fewer incorrect beliefs over time. But I don’t think that eliminates the problem. In my “friend” story, the AI never actually believes, as a factual matter, that S will always like B—or else it would feel no pull to stop unconditionally following S. I would characterize it instead as: “The AI has a preexisting instinct which interacts with a revised conceptual model of the world when it learns and integrates new information, and the result is a small unforeseen shift in the AI’s goals.”
I also don’t think “trying to have stable goals” is the difficulty. Not only corrigible agents but almost any agent with goals is (almost) guaranteed to be trying to have stable goals. I just think that keeping stable goals while learning / reflecting is difficult, such that an agent might be trying to do so but fail.
This is especially true if the agent is constructed in the “default” way wherein its actions come out of a complicated tangle of instincts and preferences and habits and beliefs.
It’s like you’re this big messy machine, and every time you learn a new fact or think a new thought, you’re giving the machine a kick, and hoping it will keep driving in the same direction. If you’re more specifically rethinking concepts directly underlying your core goals—e.g. thinking about God or philosophy for people, or thinking about the fundamental nature of human preferences for corrigible AIs—it’s even worse … You’re whacking the machine with a sledgehammer and hoping it keeps driving in the same direction.
The default is that, over time, when you keep kicking and sledgehammering the machine, it winds up driving in a different, a priori unpredictable, direction. Unless something prevents that. What are the candidates for preventing that?
Foresight, plus desire to not have your goals change. I think this is core to people’s optimism about corrigibility being stable, and this is the category that I want to question. I just don’t think that’s sufficient to solve the problem. The problem is, you don’t know what thoughts you’re going to think until you’ve thought them, and you don’t know what you’re going to learn until you learn it, and once you’ve already done the thinking / learning, it’s too late, if your goals have shifted then you don’t want to shift them back. I’m a human-level intelligence (I would like to think!), and I care about reducing suffering right now, and I really really want to still care about reducing suffering 10 years from now. But I have no idea how to guarantee that that actually happens. And if you gave me root access to my brain, I still wouldn’t know … except for the obvious thing of “don’t think any new thoughts or learn any new information for the next 10 years”, which of course has a competitiveness problem. I can think of lots of strategies that would make it more probable that I still care about reducing suffering in ten years, but that’s just slowing down the goal drift, not stopping it. (Examples: “don’t read consciousness-illusionist literature”, “don’t read nihilist literature”, “don’t read proselytizing literature”, etc.) It’s just a hard problem. We can hope that the AI becomes smart enough to solve the problem before it becomes so smart that it’s dangerous, but that’s just a hope.
“Monitoring subsystem” that never changes. For example, you could have a subsystem which is a learning algorithm, and a separate fixed subsystem that that calculates corrigibility (using a hand-coded formula) and disallows changes that reduce it. Or I could cache my current brain-state (“Steve 2020“), wake it up from time to time and show it what “Steve 2025” or “Steve 2030” is up to, and give “Steve 2020” the right to roll back any changes if it judges them harmful. Or who knows what else. I don’t rule out that something like this could work, and I’m all for thinking along those lines.
Some kind of non-messy architecture such that we can reason in general about the algorithm’s learning / update procedure and prove in general that it preserves goals. I don’t know how to do that, but maybe it’s possible. Maybe that’s part of what MIRI is doing.
Give up, and pursue some other approach to AGI that makes “goal stability upon learning / reflection” a non-issue, or a low-stakes issue, as in my earlier comment.
Interesting, that’s not how I interpret the argument. I usually think of goal stability is something that improves as the agent becomes more intelligent; to the extent that a goal isn’t stable we treat it as a failure of capabilities. Totally possible that this leads to catastrophic outcomes, and seems good to work on if you have a method for it, but it isn’t what I’m usually focused on.
For me, the intuition behind “broad basin of corrigibility” is that if you have an intelligent agent (so among other things, it knows how to keep its goals stable) then if you have a 95% correct definition of corrigibility the resulting agent will help us get to the 100% version.
For these sorts of arguments you have to condition on some amount of intelligence. As a silly extreme example, if you had a toddler surrounded by buttons that jumbled up the toddler’s brain, there’s not much you can do to have the toddler do anything reasonable (autonomously). However, an adult who knows what the buttons do would be able to reliably avoid them.
Well, sure, you can call it that. It seems a bit misleading to me, in the sense that usually “failure of capabilities” implies “If we can make more capable AIs, the problem goes away”. Here, the question is whether “smart enough to figure out how to keep its goals stable” comes before or after “smart enough to be dangerous if its goals drift” during the learning process. If we develop approaches to make more capable AIs, that’s not necessarily helpful for switching the order of which of those two milestones happens first. Maybe there’s some solution related to careful cultivation of differential capabilities. But I would still much rather that we humans solve the problem in advance (or prove that it’s unsolvable). :-P
I guess my response would be that something pursuing a goal of Always do what the supervisor wants me to do* [*...but I don’t want to cause the extinction of amazonian frogs] might naively seem to be >99.9% corrigible—the amazonian frogs thing is very unlikely to ever come up!—but it is definitely not corrigible, and it will work to undermine the supervisor’s efforts to make it 100% corrigible. Maybe we should say that this system is actually 0% corrigible? Anyway, I accept that there is some definition of “95% corrigible” for which it’s true that “a 95% corrigible agent will help us make it 100% corrigible”. I think that finding such a definition would be super-useful. :-)