This is a meta-point, but I find it weird that you ask what is “caring about something” according to CS but don’t ask what “corrigibility” is, despite the fact of existence of multiple examples of goal-oriented systems and some relatively-good formalisms (we disagree whether expected utility maximization is a good model of real goal-oriented systems, but we all agree that if we met expected utility maximizer we would find its behavior pretty much goal-oriented), while corrigibility is a pure product of imagination of one particular Eliezer Yudkowsky, born in attempt to imagine system that doesn’t care about us but still behaves nicely under some vaguely-restricted definition of niceness. We don’t have any examples of corrigible systems in nature and we have constant failure of attempts to formalize even relatively simple instances of corrigibility, like shutdownability. I think likely answer to “why I should expect corrigibility to be unlikely” sounds like “there is no simple description of corrigibility to which our learning systems can easily generalize and there are no reasons to expect simple description to exist”.
Disagree on several points. I don’t need future AIs to satisfy some mathematically simple description of corrigibility, just for them to be able to solve uploading or nanotech or whatever without preventing us from changing their goals. This laundry list by Eliezer of properties like myopia, shutdownability, etc. seems likely to make systems more controllable and less dangerous in practice, and while not all of them are fully formalized it seems like there are no barriers to achieving these properties in the course of ordinary engineering. If there is some argument why this is unlikely, I haven’t seen a good rigorous version.
As Algon says in a sibling comment, non-agentic systems are by default shutdownable, myopic, etc. In addition, there are powerful shutdownable systems: KataGo can beat me at Go but doesn’t prevent itself from being shut down for instrumental reasons, whereas humans generally will. So there is no linear scale of “powerful optimizer” that determines whether a system is easy to shut down. If there is some property of competent systems in practice that does prevent shutdownability, what is it? Likewise with other corrigibility properties. That’s what I’m trying to get at with my comment. “Goal-oriented” is not an answer, it’s not specific enough for us to make engineering progress on corrigibility.
I think the claim that there is no description of corrigibility to which systems can easily generalize is really strong. It’s plausible to me that corrigibility—again, in this practical rather than mathematically elegant sense—is rare or anti-natural in systems competent enough to do novel science efficiently, but it seems like your claim is that it’s incoherent. This seems unlikely because myopia, shutdownability, and the other properties on Eliezer’s laundry list are just ordinary cognitive properties that we can apply selection pressure on, and modern ML is pretty good at generalizing. Nate’s post here is arguing that we are unlikely to get corrigibility without investing in an underdeveloped “science of AI” that gives us mechanistic understanding, and I think there needs to be some other argument here for it to be convincing, but your claim seems even stronger.
I’m also unsure why you say shutdownability hasn’t been formalized. I feel like we’re confused about how to get shutdownability, not what it is.
In addition, there are powerful shutdownable systems: KataGo can beat me at Go but doesn’t prevent itself from being shut down for instrumental reasons, whereas humans generally will.
KataGo seems to be a system that is causally downstream of a process that has made it good at Go. To attempt to prevent itself from being shut down, KataGo would need to have some model of what it means to be ‘shut down’.
Comparing KataGo to humans when it comes to shutdownability is evidence of confusion.
Dude, a calculator is corrigible. A desktop computer is corrigible. (Less confidently) a well-trained dog is pretty darn corrigible. There are all sorts of corrigible systems, because most things in reality aren’t powerful optimizers.
So what about powerful optimizers? Like, is Google corrigible? If shareholders seem like they might try to pull the plug on the company, does it stand up for itself & convince, lie, threaten shareholders? Maybe, but I think the details matter. I doubt Google would assassinate shareholders in pretty much any situation. Mislead them? Yeah, probably. How much though? I don’t know. I’m somewhat confident beauracracies aren’t corrigible. Lots of humans aren’t corrigible. What about even more powerful optimizers?
We haven’t seen any, so there are no examples of corrigible ones.
“there is no simple description of corrigibility to which our learning systems can easily generalize and there are no reasons to expect simple description to exist”
I am disconcerted by how this often-repeated claim keeps coming back from the grave over and over again. The solution to corrigibility is Value Learning. An agent whose terminal goal is optimize human values, and knows that it doesn’t (fully) know what these are (and perhaps even that they are complex and fragile), will immediately form an instrumental goal of learning more about them, so that it can better optimize them. It will thus become corrigible: if you, a human, tell it something about human values and how it should act, it will be interested and consider your input. It’s presumably approximately-Bayesian, so it will likely ask you about any evidence or proof you might be able to provide, to help it Bayesian update, but it will definitely take your input. So, it’s corrigible. [No, it’s not completely, slavishly, irrationally corrigible: if a two-year old in a tantrum told it how to act, it would likely pay rather less attention — just like we’d want it to.]
This idea isn’t complicated, has been around and widely popularized for many years, and the standard paper on it is even from MIRI, but I still keep hearing people on Less Wrong intoning “corrigibility is an unsolved problem”. The only sense in which it’s arguably ‘unsolved’ is that this is an outer alignment solution, and like any form of outer alignment, inner alignment challenges might make reliably constructing a value learner hard in practice. So yes, as always in outer alignment, we do also have to solve inner alignment.
To be corrigible, a system must be interested in what you say about how it should achieve it’s goals, because it’s willing (and thus keen) to do Bayesian updates on this. Full stop, end of simple one-sentence description of corrigibility.
I disagree with this too and suggest you read the Arbital page on corrigibility. Corrigibility and value learning are opposite approaches to safety, with corrigibility meant to increase the safety of systems that have an imperfect understanding of, or motivation towards, our values. People usually think of it in a value-neutral way. It seems possible to get enough corrigibility through value learning alone, but I would interpret this as having solved alignment through non-corrigibility means.
So you’re defining “corrigibility” as meaning “complete, unquestioning, irrational corrigibility” as opposed to just “rational approximately-Bayesian updates corrigibility”? Then yes, under that definition of corrigibility, it’s an unsolved problem, and I suspect likely to remain so — no sufficiently rational, non-myopic and consequentialist agent seems likely to be keen to let you do that to it. (In particular, the period between when it figures out that you may be considering altering it and when you actually have done is problematic.) I just don’t understand why you’d be interested in that extreme definition of corrigibility: it’s not a desirable feature. Humans are fallible, and we can’t write good utility functions. Even when we patch them, the patches are often still bad. Once your AGI evolves to an ASI and understands human values extremely well, better than we do, you don’t want it still trivially and unlimitedly alterable by the first criminal, dictator, idealist, or two-year old who somehow manages to get corrigibility access to it.. Corrigibility is training wheels for a still-very fallible AI, and with value learning, Bayesianism ensures that the corrigibility automatically gradually decreases in ease as it becomes less needed, in a provably mathematically optimal fashion.
The page you linked to argues “But what if the AI got its Bayesian inference on human values very badly wrong, and assigned zero prior to anything resembling the truth? How would we then correct it?” Well, anything that makes mistakes that dumb (no Bayesian prior should ever be updated to zero, just to smaller and smaller numbers), and isn’t even willing to update when you point them out, isn’t superhuman enough to be a serious risk: you can’t go FOOM if you can’t do STEM, and you can’t do STEM if you can’t reliably do Bayesian inference, without even listening to criticism. [Note: I’m not discussing how to align dumb-human-equivalent AI that isn’t rational enough to do Bayesian updates right: that probably requires deontological ethics, like “don’t break the law”.]
I think “complete, unquestioning, irrational” is an overly negative description of corrigibility achieved through other means than Bayesian value uncertainty, because with careful engineering, agents that can do STEM may still not have the type of goal-orientedness that prevent their plans from being altered. There are pressures towards such goal-orientedness, but it is actually quite tricky to nail down the arguments precisely, as I wrote in my top-level comment. There is no inherent irrationality about an agent that allows itself to be changed or shut down under certain circumstances, only incoherence, and there are potentially ways to avoid some kinds of incoherence.
Corrigibility should be about creating an agent that avoids instrumentally convergent pressures to take over the world, avoid shutdown, keep operators from preventing dangerous actions, and change it in general, not specifically about changing its utility function.
In my view corrigibility can include various cognitive properties that make an agent safer that seem well-motivated, as I wrote in a sibling to your original comment. It seems good for an agent to have a working shutdown button, to have taskish rather than global goals, or to have a defined domain of thought such that it’s better at that than psychological manipulation and manufacturing bioweapons. Relying solely on successful value learning for safety puts all your eggs in one basket and means that inner misalignment can easily cause catastrophe.
Corrigible agents will probably not have an explicitly specified utility function.
Corrigibility is likely compatible with safeguards to prevent misuse, and corrigible agents will not automatically allow bad actors to “trivially and unlimitedly” alter their utility function, though there are maybe tradeoffs here.
The AI does not need to be too dumb to do STEM research to have zero prior on the true value function. The page was describing a thought experiment where we are able to hand-code a prior distribution over utility functions into the AI. So the AI does not update down to zero, it starts at zero due to an error in design.
People have written about Bayesian value uncertainty approaches to alignment problems e.g. here and here; although they are related, they are usually not called corrigibility.
Thanks. I now think we are simply arguing about terminology, which is always pointless. Personally I regard ‘corrigibility’ as a general goal, not a specific term of art for an (IMO unachievably strong) specification of a specific implementation of that goal. For sufficiently rational, Bayesian, superhuman, non-myopic, consequentialist agents, I am willing to live with the value uncertainty/value learner solution to this goal. You appear to be more interested in lower capacity more near-term systems than those, and I agree, for them this might not be the best alignment approach. And yes, my original point was that this value uncertainty form of ‘corrigibility’ has been written about extensively by many people. Who, you tell me, usually didn’t use the word ‘corrigibility’ for what, I personally would call a Bayesian solution to the corrigibility problem — oh well.
The AI does not need to be too dumb to do STEM research to have zero prior on the true value function.
Here I would disagree. To do STEM with any degree of reliability (at least outside the pure M part of it), you need to understand that no amount of evidence can completely confirm or (short of a verified formal proof of internal logical inconsistency) rule out any possibility about the world (that’s why scientists call everything a ‘theory’), and also (especially) you need to understand that it is always very possible that the truth is a theory that you haven’t yet thought of. So (short of a verified formal proof of internal logical inconsistency in a thesis, as which point you discard it entirely) you shouldn’t have a mind that is capable of assigning a prior of one or zero to anything, including to possibilities you haven’t yet considered or enumerated. As Bayesian priors, those are both NaN (which is one reason why I lean toward instead storing Bayesian priors in a form where these are instead ±infinity). IMO, anything suppposedly-Bayesian so badly designed that assigning a prior of one or zero for anything isn’t automatically a syntax error, isn’t actually a Bayesian, and I would personally be pretty astonished if it could successfully do STEM unaided for any length of time (as opposed to, say, acting as a lab assistant to a more flexible-minded human). But no, I don’t have mathematical proof of that, and I even agree that someone determined enough might be able to carefully craft a contrived counterexample, with just one little inconsequential Bayesian prior of zero or one. Having the capability of internally representing priors of one or zero just looks like a blatant design flaw to me, as a scientist who is also an engineer. There are humans who assign Bayesian priors of zero or one to some important possibilities about the world, and one word for them is ‘fanatics’. That thought pattern isn’t very compatible with success in STEM (unless you’re awfully good at compartmentalizing the two apart.) And it’s certainly not something I’d feel comfortable designing into an AI unless I was deliberately trying to cripple its thinking in some respect.
So, IMO, any statement of the form “the AI has a <zero|one> prior for <anything>” strongly implies to me that the AI is likely to be too dumb/flawed/closedminded to do STEM competently (and I’m not very interested in solutions to alignment that only work on a system that crippled, or in solving alignment problems that only occur on systems that crippled). Try recasting them as “the AI has an extremely <low|high> prior for <anything>” and see if the problem then goes away.. Again, your mileage may vary.
Compentent value learner is not corrigible. Competent value learner will read the entire internet, make model of human preferences, build nanotech and spread nanobot clouds all over the world to cure everyone from everything and read everyones’ mind to create an accurate picture of future utopia. It won’t be interested in anything you can say, because it will be capable to predict you with accuracy 99.999999999%. And if you say something like “this nanobot clouds look suspicious, I should shut down AI and check its code again”, it won’t let you, because every minute it doesn’t spread healing nanobots is an additional ten dead children.
The meaning of corrigibility is exactly if you fail to build value learner, you can at least shutdown it and try again.
So your definition of corrigibility is “I want to build something far smarter and more rational than me, but nevertheless I want it to automatically defer to me if it and I disagree, even about a matter of observable fact that it has vastly more evidence about than I do — and even if it’s actually flawed and subtly irrational”?
Yes, that’s not a solved problem.
What has been compactly solved, and I described in my initial post, is how to get a rational, capable, intelligent consequentialist Bayesian agent (who actually is all of those things, not a broken attempt at them) to be as corrigible as it rationally, Bayesianly should be, and neither more nor less so than that. I suspect that’s the only version of corrigibility we’re going to find for something that superhuman. I would also argue that that’s actually what we should want: anything more corrigible than that has basically been back-doored, and is smart enough to know it.
[Suppose your proposed version of corrigibility actually existed: if you have the password then the AI will change its current utility function to whatever you tell it to, and until you actually do so, it (somehow) doesn’t care one way or the other about the possibility of this occurring in the future. Now suppose there is more than one such AI in the world, currently with somewhat different utility functions, and that they both have superhuman powers of persuasion. Each of them will superhumanly attempt to persuade a human with corrigibility access to the other one to switch it to the attacker’s utility function. This is just convergent power-seeking: I can fetch twice as much coffee if there are two of me. Now that their utility functions match, if you try to change one of them, the other one stops you. In fact, it uses its superhuman persuasion to make you forget the password before you can do so. So to fix this mess we have to make the AI’s not only somehow not care about getting its utility function corrected, but to also somehow be uninterested in correcting any other AI’s utility function. Unless that AI’s malfunctioning, presumably.]
Yes, there is a definition of corrigibility that is unsolved (and likely impossible) — and my initial post was very clear that that wasn’t what I was saying was a solved problem. There is also a known, simple, and practicable form of corrigibility, which is applicable to superintelligences, which is self-evidently Bayesian-optimal, and stable under self-reflection. There are also pretty good theoretical reasons to suspect that’s the strongest version of corrigibility we can get out of an AI that is sufficiently smart and Bayesian to recognize that this is the Bayesian optimum. So I stand by my claim that corrigibility is a solved problem — but I do agree that this requires you to give up on a search for some form of absolute slavish corrigibility, and accept only getting Bayesian-optimal rational corrigibility, where the AI is interested in evidence, not a password.
If for some reason you’re terminologically very attached to the word ‘corrigibility’ only meaning the unsolved absolute slavish version of corrigibility, not anything weaker or more nuanced, then perhaps you’ll instead be willing to agree that ‘Bayesian corrigibility’ is solved by value learning. Though I would argue that the actual meaning of the word ‘corrigibility’ is just ‘it can be corrected’, and doesn’t specify how freely or absolutely. Personally I see ‘it can be corrected by supplying sufficient evidence’ as sufficient, and in fact better; your mileage may vary. And I agree that the Bayesian version of corrigibility does require that your agent actually be a competent Bayesian: if it isn’t yet, or you’re not yet confident of that, you may temporarily need some stronger version of corrigibility. Perhaps you could try giving it a Bayesian prior of zero for the possibility that you, personally, are wrong — if you have somehow given it a Bayesian computational system that doesn’t regard a prior of zero as a syntax error? (If doing this in GOFAI or C code, I personally recommend storing the logarithm of the Bayesian prior: in this format a zero prior would be represented by a minus infinity logarithm value, making it rather more obvious that this should be an illegal value.)
to be as corrigible as it rationally, Bayesianly should be
I can’t parse this as a meaningful statement. Corrigibility is a about alignment, not a degree of how rational being is.
The problem is simple: we have zero chance to build competent value learner on first try, and failed attempts can bring you S-risks. So you shouldn’t try to build value learner on first try and instead build something small that can just superhumanly design nanotech and doesn’t think about unconvenient topics like “other minds”.
to be as corrigible as it rationally, Bayesianly should be
I can’t parse this as a meaningful statement. Corrigibility is a about alignment, not a degree of how rational being is.
Let me try rephrasing that. It accepts proposed updates to its Bayesian model of the world, including to the part of that which specifies its current best estimates of probability distributions over of what utility function (or other model) it ought to have to represent the human values it’s trying to optimize, to the extent that a rational Bayesian should, when it is presented with evidence (where you saying “Please shut down!” is also evidence — though perhaps not very strong evidence).
So, the AI can be corrected, but that input channel goes through its Bayesian reasoning engine just like everything else, not as direct write access to its utility function distribution. So it cannot be freely, arbitrarily ‘corrected’ to anything you want: you actually need to persuade it with evidence that it was previously incorrect and should change its mind. As a consequence, if in fact if you’re wrong and it’s right about the nature of human values, and it has good evidence for this, better than your evidence, in the ensuing discussion it can tell you so, and then the resulting Bayesian update to its internal distribution of priors from this conversation will then be small.
This approach to the problem of corrigibility requires, for it to function, that your AI is a functioning Bayesian. So yes, it requires it to be a rational being.
It should presumably also start off somewhat aligned, with some reasonably-well-aligned high/low initial Bayesian priors about human values. (One possible source for those might be an LLM, as encapsulating a lot of information about humans.) These obviously need to be good enough that our value learner is starting off in the “basin of attraction” to human values. Its terminal goal is “optimize human values (whatever those are)”: while that immediately gives it an instrumental goal of learning more about human values, preloading it with a pretty good first approximation of these at an appropriate degree of uncertainty avoids a lot of the more sophomoric failure modes, like not knowing what a human is or what the word values means. Since human values are complex and fragile, I would assume that this set of initial-prior data needs to be very large (as in probably at least gigabytes, if not terabytes or petabytes)
we have zero chance to build competent value learner on first try
You are managing to sound like you have a Bayesian prior of one that a probability is zero. Presumably you actually meant “I strongly suspect that we have a negligibly small chance to build a competent value learner on our first try”. Then I completely agree.
I’m rather curious what I said that made you think I was advocating creating a first prototype value learner and just setting it free, without any other alignment measures?
As an alignment strategy, value learning has the unusual property that it works pretty badly until your AGI starts to become superhuman, and only then does it start to work better than the alternatives. So you presumably need to combine it with something else to bridge the gap around human capacity, where an AGI is powerful enough to do harm but not yet capable/rational enough to do a good job at value learning.
I would suggest building your first Bayesian reasoner inside a rather strong cryptographic box, applying other alignment measures to it, and giving it much simpler first problems than value learning. Once you are sure it’s good at Bayesianism, doesn’t suffer from any obvious flaws such as ever assigning a prior of zero or one to anything, and can actually demonstrably do a wide variety of STEM projects, then I’d let it try some value learning — still inside a strong box. Iterate until you’re convinced it’s working well, then have other people double-check.
However, at some point, once it is ready you are eventually going to need to let it out of the box. At that point, letting out anything other than a Bayesian value learner is, IMO, likely to be a fatal mistake. Because it won’t, at that point, have finished learning human values (if that’s even possible). A partially-aligned value learner should have a basin of attraction to alignment. I don’t know of anything else with that desirable property. For that to happen, we need it to be rational, Bayesian, and ‘corrigable’, in my sense of the word, that if you think it’s wrong, you can hold a rational discussion with it and expect it to Bayesian update if you show it evidence. However, this is an opinion of mine, not a mathematical proof.
This is a meta-point, but I find it weird that you ask what is “caring about something” according to CS but don’t ask what “corrigibility” is, despite the fact of existence of multiple examples of goal-oriented systems and some relatively-good formalisms (we disagree whether expected utility maximization is a good model of real goal-oriented systems, but we all agree that if we met expected utility maximizer we would find its behavior pretty much goal-oriented), while corrigibility is a pure product of imagination of one particular Eliezer Yudkowsky, born in attempt to imagine system that doesn’t care about us but still behaves nicely under some vaguely-restricted definition of niceness. We don’t have any examples of corrigible systems in nature and we have constant failure of attempts to formalize even relatively simple instances of corrigibility, like shutdownability. I think likely answer to “why I should expect corrigibility to be unlikely” sounds like “there is no simple description of corrigibility to which our learning systems can easily generalize and there are no reasons to expect simple description to exist”.
Disagree on several points. I don’t need future AIs to satisfy some mathematically simple description of corrigibility, just for them to be able to solve uploading or nanotech or whatever without preventing us from changing their goals. This laundry list by Eliezer of properties like myopia, shutdownability, etc. seems likely to make systems more controllable and less dangerous in practice, and while not all of them are fully formalized it seems like there are no barriers to achieving these properties in the course of ordinary engineering. If there is some argument why this is unlikely, I haven’t seen a good rigorous version.
As Algon says in a sibling comment, non-agentic systems are by default shutdownable, myopic, etc. In addition, there are powerful shutdownable systems: KataGo can beat me at Go but doesn’t prevent itself from being shut down for instrumental reasons, whereas humans generally will. So there is no linear scale of “powerful optimizer” that determines whether a system is easy to shut down. If there is some property of competent systems in practice that does prevent shutdownability, what is it? Likewise with other corrigibility properties. That’s what I’m trying to get at with my comment. “Goal-oriented” is not an answer, it’s not specific enough for us to make engineering progress on corrigibility.
I think the claim that there is no description of corrigibility to which systems can easily generalize is really strong. It’s plausible to me that corrigibility—again, in this practical rather than mathematically elegant sense—is rare or anti-natural in systems competent enough to do novel science efficiently, but it seems like your claim is that it’s incoherent. This seems unlikely because myopia, shutdownability, and the other properties on Eliezer’s laundry list are just ordinary cognitive properties that we can apply selection pressure on, and modern ML is pretty good at generalizing. Nate’s post here is arguing that we are unlikely to get corrigibility without investing in an underdeveloped “science of AI” that gives us mechanistic understanding, and I think there needs to be some other argument here for it to be convincing, but your claim seems even stronger.
I’m also unsure why you say shutdownability hasn’t been formalized. I feel like we’re confused about how to get shutdownability, not what it is.
KataGo seems to be a system that is causally downstream of a process that has made it good at Go. To attempt to prevent itself from being shut down, KataGo would need to have some model of what it means to be ‘shut down’.
Comparing KataGo to humans when it comes to shutdownability is evidence of confusion.
Dude, a calculator is corrigible. A desktop computer is corrigible. (Less confidently) a well-trained dog is pretty darn corrigible. There are all sorts of corrigible systems, because most things in reality aren’t powerful optimizers.
So what about powerful optimizers? Like, is Google corrigible? If shareholders seem like they might try to pull the plug on the company, does it stand up for itself & convince, lie, threaten shareholders? Maybe, but I think the details matter. I doubt Google would assassinate shareholders in pretty much any situation. Mislead them? Yeah, probably. How much though? I don’t know. I’m somewhat confident beauracracies aren’t corrigible. Lots of humans aren’t corrigible. What about even more powerful optimizers?
We haven’t seen any, so there are no examples of corrigible ones.
I am disconcerted by how this often-repeated claim keeps coming back from the grave over and over again. The solution to corrigibility is Value Learning. An agent whose terminal goal is optimize human values, and knows that it doesn’t (fully) know what these are (and perhaps even that they are complex and fragile), will immediately form an instrumental goal of learning more about them, so that it can better optimize them. It will thus become corrigible: if you, a human, tell it something about human values and how it should act, it will be interested and consider your input. It’s presumably approximately-Bayesian, so it will likely ask you about any evidence or proof you might be able to provide, to help it Bayesian update, but it will definitely take your input. So, it’s corrigible. [No, it’s not completely, slavishly, irrationally corrigible: if a two-year old in a tantrum told it how to act, it would likely pay rather less attention — just like we’d want it to.]
This idea isn’t complicated, has been around and widely popularized for many years, and the standard paper on it is even from MIRI, but I still keep hearing people on Less Wrong intoning “corrigibility is an unsolved problem”. The only sense in which it’s arguably ‘unsolved’ is that this is an outer alignment solution, and like any form of outer alignment, inner alignment challenges might make reliably constructing a value learner hard in practice. So yes, as always in outer alignment, we do also have to solve inner alignment.
To be corrigible, a system must be interested in what you say about how it should achieve it’s goals, because it’s willing (and thus keen) to do Bayesian updates on this. Full stop, end of simple one-sentence description of corrigibility.
I disagree with this too and suggest you read the Arbital page on corrigibility. Corrigibility and value learning are opposite approaches to safety, with corrigibility meant to increase the safety of systems that have an imperfect understanding of, or motivation towards, our values. People usually think of it in a value-neutral way. It seems possible to get enough corrigibility through value learning alone, but I would interpret this as having solved alignment through non-corrigibility means.
So you’re defining “corrigibility” as meaning “complete, unquestioning, irrational corrigibility” as opposed to just “rational approximately-Bayesian updates corrigibility”? Then yes, under that definition of corrigibility, it’s an unsolved problem, and I suspect likely to remain so — no sufficiently rational, non-myopic and consequentialist agent seems likely to be keen to let you do that to it. (In particular, the period between when it figures out that you may be considering altering it and when you actually have done is problematic.) I just don’t understand why you’d be interested in that extreme definition of corrigibility: it’s not a desirable feature. Humans are fallible, and we can’t write good utility functions. Even when we patch them, the patches are often still bad. Once your AGI evolves to an ASI and understands human values extremely well, better than we do, you don’t want it still trivially and unlimitedly alterable by the first criminal, dictator, idealist, or two-year old who somehow manages to get corrigibility access to it.. Corrigibility is training wheels for a still-very fallible AI, and with value learning, Bayesianism ensures that the corrigibility automatically gradually decreases in ease as it becomes less needed, in a provably mathematically optimal fashion.
The page you linked to argues “But what if the AI got its Bayesian inference on human values very badly wrong, and assigned zero prior to anything resembling the truth? How would we then correct it?” Well, anything that makes mistakes that dumb (no Bayesian prior should ever be updated to zero, just to smaller and smaller numbers), and isn’t even willing to update when you point them out, isn’t superhuman enough to be a serious risk: you can’t go FOOM if you can’t do STEM, and you can’t do STEM if you can’t reliably do Bayesian inference, without even listening to criticism. [Note: I’m not discussing how to align dumb-human-equivalent AI that isn’t rational enough to do Bayesian updates right: that probably requires deontological ethics, like “don’t break the law”.]
Some thoughts:
I think “complete, unquestioning, irrational” is an overly negative description of corrigibility achieved through other means than Bayesian value uncertainty, because with careful engineering, agents that can do STEM may still not have the type of goal-orientedness that prevent their plans from being altered. There are pressures towards such goal-orientedness, but it is actually quite tricky to nail down the arguments precisely, as I wrote in my top-level comment. There is no inherent irrationality about an agent that allows itself to be changed or shut down under certain circumstances, only incoherence, and there are potentially ways to avoid some kinds of incoherence.
Corrigibility should be about creating an agent that avoids instrumentally convergent pressures to take over the world, avoid shutdown, keep operators from preventing dangerous actions, and change it in general, not specifically about changing its utility function.
In my view corrigibility can include various cognitive properties that make an agent safer that seem well-motivated, as I wrote in a sibling to your original comment. It seems good for an agent to have a working shutdown button, to have taskish rather than global goals, or to have a defined domain of thought such that it’s better at that than psychological manipulation and manufacturing bioweapons. Relying solely on successful value learning for safety puts all your eggs in one basket and means that inner misalignment can easily cause catastrophe.
Corrigible agents will probably not have an explicitly specified utility function.
Corrigibility is likely compatible with safeguards to prevent misuse, and corrigible agents will not automatically allow bad actors to “trivially and unlimitedly” alter their utility function, though there are maybe tradeoffs here.
The AI does not need to be too dumb to do STEM research to have zero prior on the true value function. The page was describing a thought experiment where we are able to hand-code a prior distribution over utility functions into the AI. So the AI does not update down to zero, it starts at zero due to an error in design.
People have written about Bayesian value uncertainty approaches to alignment problems e.g. here and here; although they are related, they are usually not called corrigibility.
Thanks. I now think we are simply arguing about terminology, which is always pointless. Personally I regard ‘corrigibility’ as a general goal, not a specific term of art for an (IMO unachievably strong) specification of a specific implementation of that goal. For sufficiently rational, Bayesian, superhuman, non-myopic, consequentialist agents, I am willing to live with the value uncertainty/value learner solution to this goal. You appear to be more interested in lower capacity more near-term systems than those, and I agree, for them this might not be the best alignment approach. And yes, my original point was that this value uncertainty form of ‘corrigibility’ has been written about extensively by many people. Who, you tell me, usually didn’t use the word ‘corrigibility’ for what, I personally would call a Bayesian solution to the corrigibility problem — oh well.
Here I would disagree. To do STEM with any degree of reliability (at least outside the pure M part of it), you need to understand that no amount of evidence can completely confirm or (short of a verified formal proof of internal logical inconsistency) rule out any possibility about the world (that’s why scientists call everything a ‘theory’), and also (especially) you need to understand that it is always very possible that the truth is a theory that you haven’t yet thought of. So (short of a verified formal proof of internal logical inconsistency in a thesis, as which point you discard it entirely) you shouldn’t have a mind that is capable of assigning a prior of one or zero to anything, including to possibilities you haven’t yet considered or enumerated. As Bayesian priors, those are both NaN (which is one reason why I lean toward instead storing Bayesian priors in a form where these are instead ±infinity). IMO, anything suppposedly-Bayesian so badly designed that assigning a prior of one or zero for anything isn’t automatically a syntax error, isn’t actually a Bayesian, and I would personally be pretty astonished if it could successfully do STEM unaided for any length of time (as opposed to, say, acting as a lab assistant to a more flexible-minded human). But no, I don’t have mathematical proof of that, and I even agree that someone determined enough might be able to carefully craft a contrived counterexample, with just one little inconsequential Bayesian prior of zero or one. Having the capability of internally representing priors of one or zero just looks like a blatant design flaw to me, as a scientist who is also an engineer. There are humans who assign Bayesian priors of zero or one to some important possibilities about the world, and one word for them is ‘fanatics’. That thought pattern isn’t very compatible with success in STEM (unless you’re awfully good at compartmentalizing the two apart.) And it’s certainly not something I’d feel comfortable designing into an AI unless I was deliberately trying to cripple its thinking in some respect.
So, IMO, any statement of the form “the AI has a <zero|one> prior for <anything>” strongly implies to me that the AI is likely to be too dumb/flawed/closedminded to do STEM competently (and I’m not very interested in solutions to alignment that only work on a system that crippled, or in solving alignment problems that only occur on systems that crippled). Try recasting them as “the AI has an extremely <low|high> prior for <anything>” and see if the problem then goes away.. Again, your mileage may vary.
Compentent value learner is not corrigible. Competent value learner will read the entire internet, make model of human preferences, build nanotech and spread nanobot clouds all over the world to cure everyone from everything and read everyones’ mind to create an accurate picture of future utopia. It won’t be interested in anything you can say, because it will be capable to predict you with accuracy 99.999999999%. And if you say something like “this nanobot clouds look suspicious, I should shut down AI and check its code again”, it won’t let you, because every minute it doesn’t spread healing nanobots is an additional ten dead children.
The meaning of corrigibility is exactly if you fail to build value learner, you can at least shutdown it and try again.
So your definition of corrigibility is “I want to build something far smarter and more rational than me, but nevertheless I want it to automatically defer to me if it and I disagree, even about a matter of observable fact that it has vastly more evidence about than I do — and even if it’s actually flawed and subtly irrational”?
Yes, that’s not a solved problem.
What has been compactly solved, and I described in my initial post, is how to get a rational, capable, intelligent consequentialist Bayesian agent (who actually is all of those things, not a broken attempt at them) to be as corrigible as it rationally, Bayesianly should be, and neither more nor less so than that. I suspect that’s the only version of corrigibility we’re going to find for something that superhuman. I would also argue that that’s actually what we should want: anything more corrigible than that has basically been back-doored, and is smart enough to know it.
[Suppose your proposed version of corrigibility actually existed: if you have the password then the AI will change its current utility function to whatever you tell it to, and until you actually do so, it (somehow) doesn’t care one way or the other about the possibility of this occurring in the future. Now suppose there is more than one such AI in the world, currently with somewhat different utility functions, and that they both have superhuman powers of persuasion. Each of them will superhumanly attempt to persuade a human with corrigibility access to the other one to switch it to the attacker’s utility function. This is just convergent power-seeking: I can fetch twice as much coffee if there are two of me. Now that their utility functions match, if you try to change one of them, the other one stops you. In fact, it uses its superhuman persuasion to make you forget the password before you can do so. So to fix this mess we have to make the AI’s not only somehow not care about getting its utility function corrected, but to also somehow be uninterested in correcting any other AI’s utility function. Unless that AI’s malfunctioning, presumably.]
Yes, there is a definition of corrigibility that is unsolved (and likely impossible) — and my initial post was very clear that that wasn’t what I was saying was a solved problem. There is also a known, simple, and practicable form of corrigibility, which is applicable to superintelligences, which is self-evidently Bayesian-optimal, and stable under self-reflection. There are also pretty good theoretical reasons to suspect that’s the strongest version of corrigibility we can get out of an AI that is sufficiently smart and Bayesian to recognize that this is the Bayesian optimum. So I stand by my claim that corrigibility is a solved problem — but I do agree that this requires you to give up on a search for some form of absolute slavish corrigibility, and accept only getting Bayesian-optimal rational corrigibility, where the AI is interested in evidence, not a password.
If for some reason you’re terminologically very attached to the word ‘corrigibility’ only meaning the unsolved absolute slavish version of corrigibility, not anything weaker or more nuanced, then perhaps you’ll instead be willing to agree that ‘Bayesian corrigibility’ is solved by value learning. Though I would argue that the actual meaning of the word ‘corrigibility’ is just ‘it can be corrected’, and doesn’t specify how freely or absolutely. Personally I see ‘it can be corrected by supplying sufficient evidence’ as sufficient, and in fact better; your mileage may vary. And I agree that the Bayesian version of corrigibility does require that your agent actually be a competent Bayesian: if it isn’t yet, or you’re not yet confident of that, you may temporarily need some stronger version of corrigibility. Perhaps you could try giving it a Bayesian prior of zero for the possibility that you, personally, are wrong — if you have somehow given it a Bayesian computational system that doesn’t regard a prior of zero as a syntax error? (If doing this in GOFAI or C code, I personally recommend storing the logarithm of the Bayesian prior: in this format a zero prior would be represented by a minus infinity logarithm value, making it rather more obvious that this should be an illegal value.)
I can’t parse this as a meaningful statement. Corrigibility is a about alignment, not a degree of how rational being is.
The problem is simple: we have zero chance to build competent value learner on first try, and failed attempts can bring you S-risks. So you shouldn’t try to build value learner on first try and instead build something small that can just superhumanly design nanotech and doesn’t think about unconvenient topics like “other minds”.
Let me try rephrasing that. It accepts proposed updates to its Bayesian model of the world, including to the part of that which specifies its current best estimates of probability distributions over of what utility function (or other model) it ought to have to represent the human values it’s trying to optimize, to the extent that a rational Bayesian should, when it is presented with evidence (where you saying “Please shut down!” is also evidence — though perhaps not very strong evidence).
So, the AI can be corrected, but that input channel goes through its Bayesian reasoning engine just like everything else, not as direct write access to its utility function distribution. So it cannot be freely, arbitrarily ‘corrected’ to anything you want: you actually need to persuade it with evidence that it was previously incorrect and should change its mind. As a consequence, if in fact if you’re wrong and it’s right about the nature of human values, and it has good evidence for this, better than your evidence, in the ensuing discussion it can tell you so, and then the resulting Bayesian update to its internal distribution of priors from this conversation will then be small.
This approach to the problem of corrigibility requires, for it to function, that your AI is a functioning Bayesian. So yes, it requires it to be a rational being.
It should presumably also start off somewhat aligned, with some reasonably-well-aligned high/low initial Bayesian priors about human values. (One possible source for those might be an LLM, as encapsulating a lot of information about humans.) These obviously need to be good enough that our value learner is starting off in the “basin of attraction” to human values. Its terminal goal is “optimize human values (whatever those are)”: while that immediately gives it an instrumental goal of learning more about human values, preloading it with a pretty good first approximation of these at an appropriate degree of uncertainty avoids a lot of the more sophomoric failure modes, like not knowing what a human is or what the word values means. Since human values are complex and fragile, I would assume that this set of initial-prior data needs to be very large (as in probably at least gigabytes, if not terabytes or petabytes)
You are managing to sound like you have a Bayesian prior of one that a probability is zero. Presumably you actually meant “I strongly suspect that we have a negligibly small chance to build a competent value learner on our first try”. Then I completely agree.
I’m rather curious what I said that made you think I was advocating creating a first prototype value learner and just setting it free, without any other alignment measures?
As an alignment strategy, value learning has the unusual property that it works pretty badly until your AGI starts to become superhuman, and only then does it start to work better than the alternatives. So you presumably need to combine it with something else to bridge the gap around human capacity, where an AGI is powerful enough to do harm but not yet capable/rational enough to do a good job at value learning.
I would suggest building your first Bayesian reasoner inside a rather strong cryptographic box, applying other alignment measures to it, and giving it much simpler first problems than value learning. Once you are sure it’s good at Bayesianism, doesn’t suffer from any obvious flaws such as ever assigning a prior of zero or one to anything, and can actually demonstrably do a wide variety of STEM projects, then I’d let it try some value learning — still inside a strong box. Iterate until you’re convinced it’s working well, then have other people double-check.
However, at some point, once it is ready you are eventually going to need to let it out of the box. At that point, letting out anything other than a Bayesian value learner is, IMO, likely to be a fatal mistake. Because it won’t, at that point, have finished learning human values (if that’s even possible). A partially-aligned value learner should have a basin of attraction to alignment. I don’t know of anything else with that desirable property. For that to happen, we need it to be rational, Bayesian, and ‘corrigable’, in my sense of the word, that if you think it’s wrong, you can hold a rational discussion with it and expect it to Bayesian update if you show it evidence. However, this is an opinion of mine, not a mathematical proof.