Yes I definitely feel that “goal stability upon learning/reflection” is a general AGI safety problem, not specifically a corrigibility problem. I bring it up in reference to corrigibility because my impression is that “corrigibility is a broad basin of attraction” / “corrigible agents want to stay corrigible” is supposed to solve that problem, but I don’t think it does.
Interesting, that’s not how I interpret the argument. I usually think of goal stability is something that improves as the agent becomes more intelligent; to the extent that a goal isn’t stable we treat it as a failure of capabilities. Totally possible that this leads to catastrophic outcomes, and seems good to work on if you have a method for it, but it isn’t what I’m usually focused on.
For me, the intuition behind “broad basin of corrigibility” is that if you have an intelligent agent (so among other things, it knows how to keep its goals stable) then if you have a 95% correct definition of corrigibility the resulting agent will help us get to the 100% version.
For these sorts of arguments you have to condition on some amount of intelligence. As a silly extreme example, if you had a toddler surrounded by buttons that jumbled up the toddler’s brain, there’s not much you can do to have the toddler do anything reasonable (autonomously). However, an adult who knows what the buttons do would be able to reliably avoid them.
I usually think of goal stability is something that improves as the agent becomes more intelligent; to the extent that a goal isn’t stable we treat it as a failure of capabilities.
Well, sure, you can call it that. It seems a bit misleading to me, in the sense that usually “failure of capabilities” implies “If we can make more capable AIs, the problem goes away”. Here, the question is whether “smart enough to figure out how to keep its goals stable” comes before or after “smart enough to be dangerous if its goals drift” during the learning process. If we develop approaches to make more capable AIs, that’s not necessarily helpful for switching the order of which of those two milestones happens first. Maybe there’s some solution related to careful cultivation of differential capabilities. But I would still much rather that we humans solve the problem in advance (or prove that it’s unsolvable). :-P
if you have a 95% correct definition of corrigibility the resulting agent will help us get to the 100% version.
I guess my response would be that something pursuing a goal of Always do what the supervisor wants me to do* [*...but I don’t want to cause the extinction of amazonian frogs] might naively seem to be >99.9% corrigible—the amazonian frogs thing is very unlikely to ever come up!—but it is definitely not corrigible, and it will work to undermine the supervisor’s efforts to make it 100% corrigible. Maybe we should say that this system is actually 0% corrigible? Anyway, I accept that there is some definition of “95% corrigible” for which it’s true that “a 95% corrigible agent will help us make it 100% corrigible”. I think that finding such a definition would be super-useful. :-)
Interesting, that’s not how I interpret the argument. I usually think of goal stability is something that improves as the agent becomes more intelligent; to the extent that a goal isn’t stable we treat it as a failure of capabilities. Totally possible that this leads to catastrophic outcomes, and seems good to work on if you have a method for it, but it isn’t what I’m usually focused on.
For me, the intuition behind “broad basin of corrigibility” is that if you have an intelligent agent (so among other things, it knows how to keep its goals stable) then if you have a 95% correct definition of corrigibility the resulting agent will help us get to the 100% version.
For these sorts of arguments you have to condition on some amount of intelligence. As a silly extreme example, if you had a toddler surrounded by buttons that jumbled up the toddler’s brain, there’s not much you can do to have the toddler do anything reasonable (autonomously). However, an adult who knows what the buttons do would be able to reliably avoid them.
Well, sure, you can call it that. It seems a bit misleading to me, in the sense that usually “failure of capabilities” implies “If we can make more capable AIs, the problem goes away”. Here, the question is whether “smart enough to figure out how to keep its goals stable” comes before or after “smart enough to be dangerous if its goals drift” during the learning process. If we develop approaches to make more capable AIs, that’s not necessarily helpful for switching the order of which of those two milestones happens first. Maybe there’s some solution related to careful cultivation of differential capabilities. But I would still much rather that we humans solve the problem in advance (or prove that it’s unsolvable). :-P
I guess my response would be that something pursuing a goal of Always do what the supervisor wants me to do* [*...but I don’t want to cause the extinction of amazonian frogs] might naively seem to be >99.9% corrigible—the amazonian frogs thing is very unlikely to ever come up!—but it is definitely not corrigible, and it will work to undermine the supervisor’s efforts to make it 100% corrigible. Maybe we should say that this system is actually 0% corrigible? Anyway, I accept that there is some definition of “95% corrigible” for which it’s true that “a 95% corrigible agent will help us make it 100% corrigible”. I think that finding such a definition would be super-useful. :-)