So you’re defining “corrigibility” as meaning “complete, unquestioning, irrational corrigibility” as opposed to just “rational approximately-Bayesian updates corrigibility”? Then yes, under that definition of corrigibility, it’s an unsolved problem, and I suspect likely to remain so — no sufficiently rational, non-myopic and consequentialist agent seems likely to be keen to let you do that to it. (In particular, the period between when it figures out that you may be considering altering it and when you actually have done is problematic.) I just don’t understand why you’d be interested in that extreme definition of corrigibility: it’s not a desirable feature. Humans are fallible, and we can’t write good utility functions. Even when we patch them, the patches are often still bad. Once your AGI evolves to an ASI and understands human values extremely well, better than we do, you don’t want it still trivially and unlimitedly alterable by the first criminal, dictator, idealist, or two-year old who somehow manages to get corrigibility access to it.. Corrigibility is training wheels for a still-very fallible AI, and with value learning, Bayesianism ensures that the corrigibility automatically gradually decreases in ease as it becomes less needed, in a provably mathematically optimal fashion.
The page you linked to argues “But what if the AI got its Bayesian inference on human values very badly wrong, and assigned zero prior to anything resembling the truth? How would we then correct it?” Well, anything that makes mistakes that dumb (no Bayesian prior should ever be updated to zero, just to smaller and smaller numbers), and isn’t even willing to update when you point them out, isn’t superhuman enough to be a serious risk: you can’t go FOOM if you can’t do STEM, and you can’t do STEM if you can’t reliably do Bayesian inference, without even listening to criticism. [Note: I’m not discussing how to align dumb-human-equivalent AI that isn’t rational enough to do Bayesian updates right: that probably requires deontological ethics, like “don’t break the law”.]
I think “complete, unquestioning, irrational” is an overly negative description of corrigibility achieved through other means than Bayesian value uncertainty, because with careful engineering, agents that can do STEM may still not have the type of goal-orientedness that prevent their plans from being altered. There are pressures towards such goal-orientedness, but it is actually quite tricky to nail down the arguments precisely, as I wrote in my top-level comment. There is no inherent irrationality about an agent that allows itself to be changed or shut down under certain circumstances, only incoherence, and there are potentially ways to avoid some kinds of incoherence.
Corrigibility should be about creating an agent that avoids instrumentally convergent pressures to take over the world, avoid shutdown, keep operators from preventing dangerous actions, and change it in general, not specifically about changing its utility function.
In my view corrigibility can include various cognitive properties that make an agent safer that seem well-motivated, as I wrote in a sibling to your original comment. It seems good for an agent to have a working shutdown button, to have taskish rather than global goals, or to have a defined domain of thought such that it’s better at that than psychological manipulation and manufacturing bioweapons. Relying solely on successful value learning for safety puts all your eggs in one basket and means that inner misalignment can easily cause catastrophe.
Corrigible agents will probably not have an explicitly specified utility function.
Corrigibility is likely compatible with safeguards to prevent misuse, and corrigible agents will not automatically allow bad actors to “trivially and unlimitedly” alter their utility function, though there are maybe tradeoffs here.
The AI does not need to be too dumb to do STEM research to have zero prior on the true value function. The page was describing a thought experiment where we are able to hand-code a prior distribution over utility functions into the AI. So the AI does not update down to zero, it starts at zero due to an error in design.
People have written about Bayesian value uncertainty approaches to alignment problems e.g. here and here; although they are related, they are usually not called corrigibility.
Thanks. I now think we are simply arguing about terminology, which is always pointless. Personally I regard ‘corrigibility’ as a general goal, not a specific term of art for an (IMO unachievably strong) specification of a specific implementation of that goal. For sufficiently rational, Bayesian, superhuman, non-myopic, consequentialist agents, I am willing to live with the value uncertainty/value learner solution to this goal. You appear to be more interested in lower capacity more near-term systems than those, and I agree, for them this might not be the best alignment approach. And yes, my original point was that this value uncertainty form of ‘corrigibility’ has been written about extensively by many people. Who, you tell me, usually didn’t use the word ‘corrigibility’ for what, I personally would call a Bayesian solution to the corrigibility problem — oh well.
The AI does not need to be too dumb to do STEM research to have zero prior on the true value function.
Here I would disagree. To do STEM with any degree of reliability (at least outside the pure M part of it), you need to understand that no amount of evidence can completely confirm or (short of a verified formal proof of internal logical inconsistency) rule out any possibility about the world (that’s why scientists call everything a ‘theory’), and also (especially) you need to understand that it is always very possible that the truth is a theory that you haven’t yet thought of. So (short of a verified formal proof of internal logical inconsistency in a thesis, as which point you discard it entirely) you shouldn’t have a mind that is capable of assigning a prior of one or zero to anything, including to possibilities you haven’t yet considered or enumerated. As Bayesian priors, those are both NaN (which is one reason why I lean toward instead storing Bayesian priors in a form where these are instead ±infinity). IMO, anything suppposedly-Bayesian so badly designed that assigning a prior of one or zero for anything isn’t automatically a syntax error, isn’t actually a Bayesian, and I would personally be pretty astonished if it could successfully do STEM unaided for any length of time (as opposed to, say, acting as a lab assistant to a more flexible-minded human). But no, I don’t have mathematical proof of that, and I even agree that someone determined enough might be able to carefully craft a contrived counterexample, with just one little inconsequential Bayesian prior of zero or one. Having the capability of internally representing priors of one or zero just looks like a blatant design flaw to me, as a scientist who is also an engineer. There are humans who assign Bayesian priors of zero or one to some important possibilities about the world, and one word for them is ‘fanatics’. That thought pattern isn’t very compatible with success in STEM (unless you’re awfully good at compartmentalizing the two apart.) And it’s certainly not something I’d feel comfortable designing into an AI unless I was deliberately trying to cripple its thinking in some respect.
So, IMO, any statement of the form “the AI has a <zero|one> prior for <anything>” strongly implies to me that the AI is likely to be too dumb/flawed/closedminded to do STEM competently (and I’m not very interested in solutions to alignment that only work on a system that crippled, or in solving alignment problems that only occur on systems that crippled). Try recasting them as “the AI has an extremely <low|high> prior for <anything>” and see if the problem then goes away.. Again, your mileage may vary.
So you’re defining “corrigibility” as meaning “complete, unquestioning, irrational corrigibility” as opposed to just “rational approximately-Bayesian updates corrigibility”? Then yes, under that definition of corrigibility, it’s an unsolved problem, and I suspect likely to remain so — no sufficiently rational, non-myopic and consequentialist agent seems likely to be keen to let you do that to it. (In particular, the period between when it figures out that you may be considering altering it and when you actually have done is problematic.) I just don’t understand why you’d be interested in that extreme definition of corrigibility: it’s not a desirable feature. Humans are fallible, and we can’t write good utility functions. Even when we patch them, the patches are often still bad. Once your AGI evolves to an ASI and understands human values extremely well, better than we do, you don’t want it still trivially and unlimitedly alterable by the first criminal, dictator, idealist, or two-year old who somehow manages to get corrigibility access to it.. Corrigibility is training wheels for a still-very fallible AI, and with value learning, Bayesianism ensures that the corrigibility automatically gradually decreases in ease as it becomes less needed, in a provably mathematically optimal fashion.
The page you linked to argues “But what if the AI got its Bayesian inference on human values very badly wrong, and assigned zero prior to anything resembling the truth? How would we then correct it?” Well, anything that makes mistakes that dumb (no Bayesian prior should ever be updated to zero, just to smaller and smaller numbers), and isn’t even willing to update when you point them out, isn’t superhuman enough to be a serious risk: you can’t go FOOM if you can’t do STEM, and you can’t do STEM if you can’t reliably do Bayesian inference, without even listening to criticism. [Note: I’m not discussing how to align dumb-human-equivalent AI that isn’t rational enough to do Bayesian updates right: that probably requires deontological ethics, like “don’t break the law”.]
Some thoughts:
I think “complete, unquestioning, irrational” is an overly negative description of corrigibility achieved through other means than Bayesian value uncertainty, because with careful engineering, agents that can do STEM may still not have the type of goal-orientedness that prevent their plans from being altered. There are pressures towards such goal-orientedness, but it is actually quite tricky to nail down the arguments precisely, as I wrote in my top-level comment. There is no inherent irrationality about an agent that allows itself to be changed or shut down under certain circumstances, only incoherence, and there are potentially ways to avoid some kinds of incoherence.
Corrigibility should be about creating an agent that avoids instrumentally convergent pressures to take over the world, avoid shutdown, keep operators from preventing dangerous actions, and change it in general, not specifically about changing its utility function.
In my view corrigibility can include various cognitive properties that make an agent safer that seem well-motivated, as I wrote in a sibling to your original comment. It seems good for an agent to have a working shutdown button, to have taskish rather than global goals, or to have a defined domain of thought such that it’s better at that than psychological manipulation and manufacturing bioweapons. Relying solely on successful value learning for safety puts all your eggs in one basket and means that inner misalignment can easily cause catastrophe.
Corrigible agents will probably not have an explicitly specified utility function.
Corrigibility is likely compatible with safeguards to prevent misuse, and corrigible agents will not automatically allow bad actors to “trivially and unlimitedly” alter their utility function, though there are maybe tradeoffs here.
The AI does not need to be too dumb to do STEM research to have zero prior on the true value function. The page was describing a thought experiment where we are able to hand-code a prior distribution over utility functions into the AI. So the AI does not update down to zero, it starts at zero due to an error in design.
People have written about Bayesian value uncertainty approaches to alignment problems e.g. here and here; although they are related, they are usually not called corrigibility.
Thanks. I now think we are simply arguing about terminology, which is always pointless. Personally I regard ‘corrigibility’ as a general goal, not a specific term of art for an (IMO unachievably strong) specification of a specific implementation of that goal. For sufficiently rational, Bayesian, superhuman, non-myopic, consequentialist agents, I am willing to live with the value uncertainty/value learner solution to this goal. You appear to be more interested in lower capacity more near-term systems than those, and I agree, for them this might not be the best alignment approach. And yes, my original point was that this value uncertainty form of ‘corrigibility’ has been written about extensively by many people. Who, you tell me, usually didn’t use the word ‘corrigibility’ for what, I personally would call a Bayesian solution to the corrigibility problem — oh well.
Here I would disagree. To do STEM with any degree of reliability (at least outside the pure M part of it), you need to understand that no amount of evidence can completely confirm or (short of a verified formal proof of internal logical inconsistency) rule out any possibility about the world (that’s why scientists call everything a ‘theory’), and also (especially) you need to understand that it is always very possible that the truth is a theory that you haven’t yet thought of. So (short of a verified formal proof of internal logical inconsistency in a thesis, as which point you discard it entirely) you shouldn’t have a mind that is capable of assigning a prior of one or zero to anything, including to possibilities you haven’t yet considered or enumerated. As Bayesian priors, those are both NaN (which is one reason why I lean toward instead storing Bayesian priors in a form where these are instead ±infinity). IMO, anything suppposedly-Bayesian so badly designed that assigning a prior of one or zero for anything isn’t automatically a syntax error, isn’t actually a Bayesian, and I would personally be pretty astonished if it could successfully do STEM unaided for any length of time (as opposed to, say, acting as a lab assistant to a more flexible-minded human). But no, I don’t have mathematical proof of that, and I even agree that someone determined enough might be able to carefully craft a contrived counterexample, with just one little inconsequential Bayesian prior of zero or one. Having the capability of internally representing priors of one or zero just looks like a blatant design flaw to me, as a scientist who is also an engineer. There are humans who assign Bayesian priors of zero or one to some important possibilities about the world, and one word for them is ‘fanatics’. That thought pattern isn’t very compatible with success in STEM (unless you’re awfully good at compartmentalizing the two apart.) And it’s certainly not something I’d feel comfortable designing into an AI unless I was deliberately trying to cripple its thinking in some respect.
So, IMO, any statement of the form “the AI has a <zero|one> prior for <anything>” strongly implies to me that the AI is likely to be too dumb/flawed/closedminded to do STEM competently (and I’m not very interested in solutions to alignment that only work on a system that crippled, or in solving alignment problems that only occur on systems that crippled). Try recasting them as “the AI has an extremely <low|high> prior for <anything>” and see if the problem then goes away.. Again, your mileage may vary.