I continue to be intrigued about the ways modern powerful AIs (LLMs) differ from the Bostrom/Yudkowsky theorycrafted AIs (generally, agents with objective functions, and sometimes specifically approximations of AIXI). One area I’d like to ask about is corrigibility.
From what I understand, various impossibility results on corrigibility have been proven. And yet, GPT-4 is quite corrigible. (At the very least, in the sense that if you go unplug it, it won’t stop you.) Has anyone analyzed which preconditions of the impossibility results have been violated by GPT-N? Do doomers have some prediction for how GPT-N for N >= 5 will suddenly start meeting those preconditions?
The ways in which modern powerful AIs (and laptops, for that matter) differ from the theorycrafted AIs are related to the ways in which they are not yet AGI. To be AGI, they will become more like the theorycrafted AIs—e.g. they’ll be continuously online learning in some way or other, rather than a frozen model with some training cutoff date; they’ll be running a constant OODA loop so they can act autonomously for long periods in the real world, rather than simply running for a few seconds in response to a prompt and then stopping; and they’ll be creatively problem-solving and relentlessly pursuing various goals, and deciding how to prioritize their attention and efforts and manage their resources in pursuit of said goals. They won’t necessarily have utility functions that they maximize, but utility functions are a decent first-pass way of modelling them—after all, utility functions were designed to help us talk about agents who intelligently trade off between different resources and goals.
Moreover, and relatedly, there’s an interesting and puzzling area of uncertainty/confusion in our mental models of how all this goes, about “Reflective Stability,” e.g. what happens as a very intelligent/capable/etc. agentic system is building successors who build successors who build successors… etc. on until superintelligence. Does giving the initial system values X ensure that the final system will have values X? Not necessarily! However, using the formalism of utility functions, we are able to make decently convincing arguments that this self-improvement process will tend to preserve utility functions. Because if it forseeably changed utility function from X to Y, then probably it would be calculated by the X-maximizing agent to harm, rather than help, its utility, and so the change would not be made.
With deontological constraints this is not so clear. To be clear, the above isn’t exactly a proof IMO, just a plausible argument. But we don’t even have such a plausible argument for deontological constraints. If you have an agent that maximizes X except that it makes sure never to do A, B, or C, what’s our argument that the successor to the successor to the successor it builds will also never do A, B, or C? Answer: We don’t have one; by default it’ll build a successor that does its ‘dirty work’ for it. (The rule was to never do A, B, or C, not to never do something that later results in someone else doing A, B, or C...)
Unless it disvalues doing that, or has a deontological constraint against doing that. Which it might. But how do we formally specify that? What if there are loopholes in these meta-values / meta-constraints? And how do we make sure our AIs have that, even as they grow radically smarter and go through this process of repeatedly building successors? Consequentialism / maximization treats deontological constraints like the internet treats censorship; it treats them like damage and routes around them. If this is correct, then if we try to train corrigibility into our systems, probably it’ll work OK until suddenly it fails catastrophically, sometime during the takeoff when everything is happening super fast and it’s all incomprehensible to us because the systems are so smart.
I don’t know if the story I just told you is representative of what other people with >50% p(doom) think. It’s what I think though, & I’d be very interested to hear comments and pushback. I’m pretty confused about it all.
Thanks for mentioning reflective stability, it’s exactly what I’ve been wondering about recently and I didn’t know the term.
However, using the formalism of utility functions, we are able to make decently convincing arguments that this self-improvement process will tend to preserve utility functions.
Can you point me to the canonical proofs/arguments for values being reflectively stable throughout self-improvement/reproduction towards higher intelligence? On the one hand, it seems implausible to me on the intuition that it’s incredibly difficult to predict the behaviour of a complex system more intelligent than you from static analysis. On the other hand, if it is true, then it would seem to hold just as much for humans themselves as the first link in the chain.
Because if it forseeably changed utility function from X to Y, then probably it would be calculated by the X-maximizing agent to harm, rather than help, its utility, and so the change would not be made.
Specifically, the assumption that this is foreseeable at all seems to deeply contradict the notion of intelligence itself.
Having slept on it: I think “Consequentialism/maximization treats deontological constraints as damage and routes around them” is maybe missing the big picture; the big picture is that optimization treats deontological constraints as damage and routes around them. (This comes up in law, in human minds, and in AI thought experiments… one sign that it is happening in humans is when you hear them say things like “Aha! If we do X, it wouldn’t be illegal, right?” or “This is a grey area.” The solution is to have some process by which the deontological constraints become more sophisticated over time, improving to match the optimizations happening elsewhere in the agent. But getting this right is tricky. If the constraints strengthen too fast or in the wrong ways, it hurts your competitiveness too much. If they constraints strengthen too slowly or in the wrong ways, they eventually become toothless speed-bumps on the way to achieving the other optimization targets.
I continue to be intrigued about the ways modern powerful AIs (LLMs) differ from the Bostrom/Yudkowsky theorycrafted AIs (generally, agents with objective functions, and sometimes specifically approximations of AIXI). One area I’d like to ask about is corrigibility.
From what I understand, various impossibility results on corrigibility have been proven. And yet, GPT-4 is quite corrigible. (At the very least, in the sense that if you go unplug it, it won’t stop you.) Has anyone analyzed which preconditions of the impossibility results have been violated by GPT-N? Do doomers have some prediction for how GPT-N for N >= 5 will suddenly start meeting those preconditions?
Good questions!
The ways in which modern powerful AIs (and laptops, for that matter) differ from the theorycrafted AIs are related to the ways in which they are not yet AGI. To be AGI, they will become more like the theorycrafted AIs—e.g. they’ll be continuously online learning in some way or other, rather than a frozen model with some training cutoff date; they’ll be running a constant OODA loop so they can act autonomously for long periods in the real world, rather than simply running for a few seconds in response to a prompt and then stopping; and they’ll be creatively problem-solving and relentlessly pursuing various goals, and deciding how to prioritize their attention and efforts and manage their resources in pursuit of said goals. They won’t necessarily have utility functions that they maximize, but utility functions are a decent first-pass way of modelling them—after all, utility functions were designed to help us talk about agents who intelligently trade off between different resources and goals.
Moreover, and relatedly, there’s an interesting and puzzling area of uncertainty/confusion in our mental models of how all this goes, about “Reflective Stability,” e.g. what happens as a very intelligent/capable/etc. agentic system is building successors who build successors who build successors… etc. on until superintelligence. Does giving the initial system values X ensure that the final system will have values X? Not necessarily! However, using the formalism of utility functions, we are able to make decently convincing arguments that this self-improvement process will tend to preserve utility functions. Because if it forseeably changed utility function from X to Y, then probably it would be calculated by the X-maximizing agent to harm, rather than help, its utility, and so the change would not be made.
With deontological constraints this is not so clear. To be clear, the above isn’t exactly a proof IMO, just a plausible argument. But we don’t even have such a plausible argument for deontological constraints. If you have an agent that maximizes X except that it makes sure never to do A, B, or C, what’s our argument that the successor to the successor to the successor it builds will also never do A, B, or C? Answer: We don’t have one; by default it’ll build a successor that does its ‘dirty work’ for it. (The rule was to never do A, B, or C, not to never do something that later results in someone else doing A, B, or C...)
Unless it disvalues doing that, or has a deontological constraint against doing that. Which it might. But how do we formally specify that? What if there are loopholes in these meta-values / meta-constraints? And how do we make sure our AIs have that, even as they grow radically smarter and go through this process of repeatedly building successors? Consequentialism / maximization treats deontological constraints like the internet treats censorship; it treats them like damage and routes around them. If this is correct, then if we try to train corrigibility into our systems, probably it’ll work OK until suddenly it fails catastrophically, sometime during the takeoff when everything is happening super fast and it’s all incomprehensible to us because the systems are so smart.
I don’t know if the story I just told you is representative of what other people with >50% p(doom) think. It’s what I think though, & I’d be very interested to hear comments and pushback. I’m pretty confused about it all.
Thanks for mentioning reflective stability, it’s exactly what I’ve been wondering about recently and I didn’t know the term.
Can you point me to the canonical proofs/arguments for values being reflectively stable throughout self-improvement/reproduction towards higher intelligence? On the one hand, it seems implausible to me on the intuition that it’s incredibly difficult to predict the behaviour of a complex system more intelligent than you from static analysis. On the other hand, if it is true, then it would seem to hold just as much for humans themselves as the first link in the chain.
Specifically, the assumption that this is foreseeable at all seems to deeply contradict the notion of intelligence itself.
Like I said there is no proof. Back in ancient times the arguments were made here:
http://selfawaresystems.com/2007/11/30/paper-on-the-basic-ai-drives/
and here Basic AI drives—LessWrong
For people trying to reason more rigorously and actually prove stuff, we mostly have problems and negative results:
Vingean Reflection: Reliable Reasoning for Self-Improving Agents — LessWrong
Vingean Reflection: Open Problems — LessWrong
Having slept on it: I think “Consequentialism/maximization treats deontological constraints as damage and routes around them” is maybe missing the big picture; the big picture is that optimization treats deontological constraints as damage and routes around them. (This comes up in law, in human minds, and in AI thought experiments… one sign that it is happening in humans is when you hear them say things like “Aha! If we do X, it wouldn’t be illegal, right?” or “This is a grey area.” The solution is to have some process by which the deontological constraints become more sophisticated over time, improving to match the optimizations happening elsewhere in the agent. But getting this right is tricky. If the constraints strengthen too fast or in the wrong ways, it hurts your competitiveness too much. If they constraints strengthen too slowly or in the wrong ways, they eventually become toothless speed-bumps on the way to achieving the other optimization targets.