OK, Below I will provide links to few mathematically precise papers
about AGI corrigibility solutions, with some comments. I do not have
enough time to write short comments, so I wrote longer ones.
This list or links below is not a complete literature overview. I did a
comprehensive literature search on corrigibility back in 2019 trying
to find all mathematical papers of interest, but have not done so
since.
I wrote some of the papers below, and have read all the rest of them.
I am not linking to any papers I heard about but did not read (yet).
Math-based work on corrigibility solutions typically starts with
formalizing corrigibility, or a sub-component of corrigibility, as a
mathematical property we want an agent to have. It then constructs
such an agent with enough detail to show that this property is indeed
correctly there, or at least there during some part of the agent lifetime, or
there under some boundary assumptions.
Not all of the papers below have actual mathematical proofs in them,
some of them show correctness by construction. Correctness by
construction is superior to having to have proofs: if you have
correctness by construction, your notation will usually be much more
revealing about what is really going on than if you need proofs.
Here is the list, with the bold headings describing different
approaches to corrigibility.
Indifference to being switched off, or to reward function updates
Motivated Value Selection for Artificial
Agents
introduces Armstrong’s indifference methods for creating
corrigibility. It has some proofs, but does not completely work out
the math of the solution to a this-is-how-to-implement-it level.
Corrigibility tried
to work out the how-to-implement-it details of the paper above but
famously failed to do so, and has proofs showing that it failed to do
so. This paper somehow launched the myth that corrigibility is
super-hard.
AGI Agent Safety by Iteratively Improving the Utility
Function does work out all the
how-to-implement-it details of Armstrong’s indifference methods, with
proofs. It also goes into the epistemology of the connection between
correctness proofs in models and safety claims for real-world
implementations.
Counterfactual Planning in AGI
Systems introduces a different and
more easy to interpret way for constructing a a corrigible agent, and
agent that happens to be equivalent to agents that can be constructed
with Armstrong’s indifference methods. This paper has
proof-by-construction type of math.
Corrigibility with Utility
Preservation has a bunch of proofs
about agents capable of more self-modification than those in
Counterfactual Planning. As the author, I do not recommend you read
this paper first, or maybe even at all. Read Counterfactual Planning
first.
Safely Interruptible
Agents has yet
another take on, or re-interpretation of, Armstrong’s indifference
methods. Its title and presentation somewhat de-emphasize the fact
that it is about corrigibility, by never even discussing the
construction of the interruption mechanism. The paper is also less
clearly about AGI-level corrigibility.
A completely different approach to corrigibility, based on a somewhat
different definition of what it means to be corrigible, is to
construct an agent that automatically stops and asks a supervisor for
instructions when it encounters a situation or decision it is unsure
about. Such a design would be corrigible by construction, for certain
values of corrigibility. The last two papers above can be interpreted
as disclosing ML designs that also applicable in the context of this
stop when unsure idea.
Asymptotically unambitious artificial general
intelligence is a paper that
derives some probabilistic bounds on what can go wrong regardless,
bounds on the case where the stop-and-ask-the-supervisor mechanism
does not trigger. This paper is more clearly about the AGI case,
presenting a very general definition of ML.
Anything about model-based reinforcement learning
I have yet to write a paper that emphasizes this point, but most
model-based reinforcement learning algorithms produce a corrigible
agent, in the sense that they approximate the ITC counterfactual
planner from the counterfactual planning paper above.
Now, consider a definition of corrigibility where incompetent agents
(or less inner-aligned agents, to use a term often used here) are less
corrigible because they may end up damaging themselves, their stop
buttons. or their operator by being incompetent. In this case, every
convergence-to-optimal-policy proof for a model-based RL algorithm can
be read as a proof that its agent will be increasingly corrigible
under learning.
CIRL
Cooperative Inverse Reinforcement
Learning and The Off-Switch
Game present yet another
corrigibility method with enough math to see how you might implement
it. This is the method that Stuart Russell reviews in Human
Compatible. CIRL has a drawback, in that the agent becomes less
corrigible as it learns more, so CIRL is not generally considered to
be a full AGI-level corrigibility solution, not even by the original
authors of the papers. The CIRL drawback can be fixed in various
ways, for example by not letting the agent learn too much. But
curiously, there is very little followup work from the authors of the
above papers, or from anybody else I know of, that explores this kind
of thing.
Commanding the agent to be corrigible
If you have an infinitely competent superintelligence that you can
give verbal commands to that it will absolutely obey, then giving it
the command to turn itself into a corrigible agent will trivially
produce a corrigible agent by construction.
Giving the same command to a not infinitely competent and obedient
agent may give you a huge number of problems instead of course. This
has sparked endless non-mathematical speculation, but in I cannot
think of a mathematical paper about this that I would recommend.
AIs that are corrigible because they are not agents
Plenty of work on this. One notable analysis of extending this idea
to AGI-level prediction, and considering how it might produce
non-corrigibility anyway, is the work on counterfactual oracles. If
you want to see a mathematically unambiguous presentation of this,
with some further references, look for the section on counterfactual
oracles in the Counterfactual Planning paper above.
Myopia
Myopia can also be considered to be feature that creates or improves
or corrigibility. Many real-world non-AGI agents and predictive
systems are myopic by construction: either myopic in time, in space,
or in other ways. Again, if you want to see this type of myopia by
construction in a mathematically well-defined way when applied to
AGI-level ML, you can look at the Counterfactual Planning paper.
Hi Koen, thank you very much for writing this list!
I must say I’m skeptical that the technical problem of corrigibility as I see it is really solved already. I see the problem of corrigibility as shaping consequentialist optimization in a corrigible way. (Yeah that’s not at all a clear definition yet, I’m still deconfusing myself about that, and I’ll likely publish a post clarifying the problem how I see it within the next month.)
So e.g. corrigibility from non-agenthood is not a possible solution to what I see as the core problem.
I’d expect that the other solutions here may likewise only give you corrigible agents that cannot do new very impressive things (or if they can they might still kill us all).
But I may be wrong. I probably only have time to read one paper. So: What would you say is the strongest result we have here? If I looked at on paper/post and explained why this isn’t a solution to corrigibility as I see it, for what paper would it be most interesting for you to see what I write?
(I guess I’ll do it sometime this week of you write me back, but no promises.)
Also, from your perspective, how big is the alignment tax for implementing corrigibility? E.g. is it mostly just more effort implementing and supervising? Or does it also take more compute to get the same impressive result done? If so, how much? (Best take an example task that is preferably a bit too hard for humans to do. That makes it harder to reason about it, but I think this is where the difficulty is.)
Hi Simon! You are welcome! By the way, I very much want to encourage you to be skeptical and make up your own mind.
I am guessing that by mentioning consequentialist, you are referring to this part of Yudkowsky’s list of doom:
Corrigibility is anti-natural to consequentialist reasoning
I am not sure how exactly Yudkowsky is defining the terms corrigibility or consequentalist here, but I might actually be agreeing with him on the above statement, depending on definitions.
I suggest you read my paper Counterfactual Planning in AGI Systems, because it is the most accessible and general one, and because it presents AGI designs which can be interpreted as non-consequentualist.
I could see consequentialist AGI being stably corrigible if it is placed in a stable game-theoretical environment where deference to humans literally always pays as a strategy. However, many application areas for AI or potential future AGI do not offer such a stable game-theoretical environment, so I feel that this technique has very limited applicability.
If we use the 2015 MIRI paper definition of corrigibility, the alignment tax (the extra engineering and validation effort needed) for implementing corrigibility in current-generation AI systems is low to non-existent. The TL;DR here is: avoid using a bunch of RL methods that you do not want to use anyway when you want any robustness or verifiability. As for future AGI, the size of the engineering tax is open to speculation. My best guess is that future AGI will be built, if ever, by leveraging ML methods that still resemble world model creation by function approximation, as opposed to say brain uploading. Because of this, and some other reasons, I estimate a low safety engineering tax to achieve basic corrigibility.
Other parts of AGI alignment may be very expensive. e.g. the part of actually monitoring an AGI to make sure its creativity is benefiting humanity, instead of merely finding and exploiting loopholes in its reward function that will hurt somebody somewhere. To the extent that alignment cannot be cheap, more regulation will be needed to make sure that operating a massively unaligned AI will always be more expensive for a company to do than operating a mostly aligned AI. So we are looking at regulatory instruments like taxation, fines, laws that threaten jail time, and potentially measures inside the semiconductor supply chain, all depending on what type of AGI will become technically feasible, if ever.
Thank you! I’ll likely read your paper and get back to you. (Hopefully within a week.)
From reading you comment my guess is that the main disagreement may be that I think powerful AGI will need to be consequentialist. Like, for e.g. achieving something that humans cannot do yet, you need to search for that target in some way, i.e. have some consequentialist cognition, i.e. do some optimization. (So what I mean by consequentialism is just having some goal to search for / update toward, in contrast to just executing fixed patterns. I think that’s how Yudkowsky means it, but not sure if that’s what most people mean when they use the term.)
(Though not that this implies that you need so much consequentialism that we won’t be able to shut down the AGI. But as I see it a theoretical solution to corrigibility needs to deal with consequentialism. I haven’t looked into your paper yet, so it’s well possible that my comment here might appear misguided.)
E.g. if we just built a gigantic transformer and train it on all human knowledge (and say we have a higher sample efficiency or so), it is possible that it can do almost everything humans can do. But it won’t be able to just one-shot solve quantum gravity or so when we give it the prompt “solve quantum gravity”. There is no runtime updating/optimization going on, i.e. the transformer is non-consequentialist. All optimization happened through the training data or gradient descent. Either the human training data was already sufficient to encode a solution to quantum gravity in the patterns of the transformer, or it wasn’t. It is theoretically possible that the transformer learns a bit deeper underlying patterns than humans have (though I do not expect that from sth like the transformer architecture), and is so able to generalize a bit further than humans. But it seems extremely unlikely that it learned so deep understanding to already have the solution to quantum gravity encoded, although it was never explicitly trained to learn that and just read physics papers.
The transformer might be able to solve quantum gravity if it can recursively query itself to engineer better prompts, or if it can give itself feedback which is then somehow converted into gradient descent updates and then try multiple times. But in those cases there is consequentialist reasoning again.
The key point: Consequentialism becomes necessary when you go beyond human level.
Out of interest, how much do you agree with what I just wrote?
I think I agree to most of it: I agree that some form of optimization or policy search is needed to get many things you want to use AI for. But I guess you have to read the paper to find out the exact subtle way in which the AGIs inside can be called non-consequentialist. To quote Wikipedia:
In ethical philosophy, consequentialism is a class of normative, teleological ethical theories that holds that the consequences of one’s conduct are the ultimate basis for judgment about the rightness or wrongness of that conduct.
I do not talk about this in the paper, but in terms of ethical philosophy, the key bit about counterfactual planning is that it asks: judge one’s conduct by its consequences in what world exactly? Mind you, the problem considered is that we have to define the most appropriate ethical value system for a robot butler, not what is most appropriate for a human.
OK, Below I will provide links to few mathematically precise papers about AGI corrigibility solutions, with some comments. I do not have enough time to write short comments, so I wrote longer ones.
This list or links below is not a complete literature overview. I did a comprehensive literature search on corrigibility back in 2019 trying to find all mathematical papers of interest, but have not done so since.
I wrote some of the papers below, and have read all the rest of them. I am not linking to any papers I heard about but did not read (yet).
Math-based work on corrigibility solutions typically starts with formalizing corrigibility, or a sub-component of corrigibility, as a mathematical property we want an agent to have. It then constructs such an agent with enough detail to show that this property is indeed correctly there, or at least there during some part of the agent lifetime, or there under some boundary assumptions.
Not all of the papers below have actual mathematical proofs in them, some of them show correctness by construction. Correctness by construction is superior to having to have proofs: if you have correctness by construction, your notation will usually be much more revealing about what is really going on than if you need proofs.
Here is the list, with the bold headings describing different approaches to corrigibility.
Indifference to being switched off, or to reward function updates
Motivated Value Selection for Artificial Agents introduces Armstrong’s indifference methods for creating corrigibility. It has some proofs, but does not completely work out the math of the solution to a this-is-how-to-implement-it level.
Corrigibility tried to work out the how-to-implement-it details of the paper above but famously failed to do so, and has proofs showing that it failed to do so. This paper somehow launched the myth that corrigibility is super-hard.
AGI Agent Safety by Iteratively Improving the Utility Function does work out all the how-to-implement-it details of Armstrong’s indifference methods, with proofs. It also goes into the epistemology of the connection between correctness proofs in models and safety claims for real-world implementations.
Counterfactual Planning in AGI Systems introduces a different and more easy to interpret way for constructing a a corrigible agent, and agent that happens to be equivalent to agents that can be constructed with Armstrong’s indifference methods. This paper has proof-by-construction type of math.
Corrigibility with Utility Preservation has a bunch of proofs about agents capable of more self-modification than those in Counterfactual Planning. As the author, I do not recommend you read this paper first, or maybe even at all. Read Counterfactual Planning first.
Safely Interruptible Agents has yet another take on, or re-interpretation of, Armstrong’s indifference methods. Its title and presentation somewhat de-emphasize the fact that it is about corrigibility, by never even discussing the construction of the interruption mechanism. The paper is also less clearly about AGI-level corrigibility.
How RL Agents Behave When Their Actions Are Modified is another contribution in this space. Again this is less clearly about AGI.
Agents that stop to ask a supervisor when unsure
A completely different approach to corrigibility, based on a somewhat different definition of what it means to be corrigible, is to construct an agent that automatically stops and asks a supervisor for instructions when it encounters a situation or decision it is unsure about. Such a design would be corrigible by construction, for certain values of corrigibility. The last two papers above can be interpreted as disclosing ML designs that also applicable in the context of this stop when unsure idea.
Asymptotically unambitious artificial general intelligence is a paper that derives some probabilistic bounds on what can go wrong regardless, bounds on the case where the stop-and-ask-the-supervisor mechanism does not trigger. This paper is more clearly about the AGI case, presenting a very general definition of ML.
Anything about model-based reinforcement learning
I have yet to write a paper that emphasizes this point, but most model-based reinforcement learning algorithms produce a corrigible agent, in the sense that they approximate the ITC counterfactual planner from the counterfactual planning paper above.
Now, consider a definition of corrigibility where incompetent agents (or less inner-aligned agents, to use a term often used here) are less corrigible because they may end up damaging themselves, their stop buttons. or their operator by being incompetent. In this case, every convergence-to-optimal-policy proof for a model-based RL algorithm can be read as a proof that its agent will be increasingly corrigible under learning.
CIRL
Cooperative Inverse Reinforcement Learning and The Off-Switch Game present yet another corrigibility method with enough math to see how you might implement it. This is the method that Stuart Russell reviews in Human Compatible. CIRL has a drawback, in that the agent becomes less corrigible as it learns more, so CIRL is not generally considered to be a full AGI-level corrigibility solution, not even by the original authors of the papers. The CIRL drawback can be fixed in various ways, for example by not letting the agent learn too much. But curiously, there is very little followup work from the authors of the above papers, or from anybody else I know of, that explores this kind of thing.
Commanding the agent to be corrigible
If you have an infinitely competent superintelligence that you can give verbal commands to that it will absolutely obey, then giving it the command to turn itself into a corrigible agent will trivially produce a corrigible agent by construction.
Giving the same command to a not infinitely competent and obedient agent may give you a huge number of problems instead of course. This has sparked endless non-mathematical speculation, but in I cannot think of a mathematical paper about this that I would recommend.
AIs that are corrigible because they are not agents
Plenty of work on this. One notable analysis of extending this idea to AGI-level prediction, and considering how it might produce non-corrigibility anyway, is the work on counterfactual oracles. If you want to see a mathematically unambiguous presentation of this, with some further references, look for the section on counterfactual oracles in the Counterfactual Planning paper above.
Myopia
Myopia can also be considered to be feature that creates or improves or corrigibility. Many real-world non-AGI agents and predictive systems are myopic by construction: either myopic in time, in space, or in other ways. Again, if you want to see this type of myopia by construction in a mathematically well-defined way when applied to AGI-level ML, you can look at the Counterfactual Planning paper.
Hi Koen, thank you very much for writing this list!
I must say I’m skeptical that the technical problem of corrigibility as I see it is really solved already. I see the problem of corrigibility as shaping consequentialist optimization in a corrigible way. (Yeah that’s not at all a clear definition yet, I’m still deconfusing myself about that, and I’ll likely publish a post clarifying the problem how I see it within the next month.)
So e.g. corrigibility from non-agenthood is not a possible solution to what I see as the core problem. I’d expect that the other solutions here may likewise only give you corrigible agents that cannot do new very impressive things (or if they can they might still kill us all).
But I may be wrong. I probably only have time to read one paper. So: What would you say is the strongest result we have here? If I looked at on paper/post and explained why this isn’t a solution to corrigibility as I see it, for what paper would it be most interesting for you to see what I write? (I guess I’ll do it sometime this week of you write me back, but no promises.)
Also, from your perspective, how big is the alignment tax for implementing corrigibility? E.g. is it mostly just more effort implementing and supervising? Or does it also take more compute to get the same impressive result done? If so, how much? (Best take an example task that is preferably a bit too hard for humans to do. That makes it harder to reason about it, but I think this is where the difficulty is.)
Hi Simon! You are welcome! By the way, I very much want to encourage you to be skeptical and make up your own mind.
I am guessing that by mentioning consequentialist, you are referring to this part of Yudkowsky’s list of doom:
I am not sure how exactly Yudkowsky is defining the terms corrigibility or consequentalist here, but I might actually be agreeing with him on the above statement, depending on definitions.
I suggest you read my paper Counterfactual Planning in AGI Systems, because it is the most accessible and general one, and because it presents AGI designs which can be interpreted as non-consequentualist.
I could see consequentialist AGI being stably corrigible if it is placed in a stable game-theoretical environment where deference to humans literally always pays as a strategy. However, many application areas for AI or potential future AGI do not offer such a stable game-theoretical environment, so I feel that this technique has very limited applicability.
If we use the 2015 MIRI paper definition of corrigibility, the alignment tax (the extra engineering and validation effort needed) for implementing corrigibility in current-generation AI systems is low to non-existent. The TL;DR here is: avoid using a bunch of RL methods that you do not want to use anyway when you want any robustness or verifiability. As for future AGI, the size of the engineering tax is open to speculation. My best guess is that future AGI will be built, if ever, by leveraging ML methods that still resemble world model creation by function approximation, as opposed to say brain uploading. Because of this, and some other reasons, I estimate a low safety engineering tax to achieve basic corrigibility.
Other parts of AGI alignment may be very expensive. e.g. the part of actually monitoring an AGI to make sure its creativity is benefiting humanity, instead of merely finding and exploiting loopholes in its reward function that will hurt somebody somewhere. To the extent that alignment cannot be cheap, more regulation will be needed to make sure that operating a massively unaligned AI will always be more expensive for a company to do than operating a mostly aligned AI. So we are looking at regulatory instruments like taxation, fines, laws that threaten jail time, and potentially measures inside the semiconductor supply chain, all depending on what type of AGI will become technically feasible, if ever.
Thank you! I’ll likely read your paper and get back to you. (Hopefully within a week.)
From reading you comment my guess is that the main disagreement may be that I think powerful AGI will need to be consequentialist. Like, for e.g. achieving something that humans cannot do yet, you need to search for that target in some way, i.e. have some consequentialist cognition, i.e. do some optimization. (So what I mean by consequentialism is just having some goal to search for / update toward, in contrast to just executing fixed patterns. I think that’s how Yudkowsky means it, but not sure if that’s what most people mean when they use the term.) (Though not that this implies that you need so much consequentialism that we won’t be able to shut down the AGI. But as I see it a theoretical solution to corrigibility needs to deal with consequentialism. I haven’t looked into your paper yet, so it’s well possible that my comment here might appear misguided.) E.g. if we just built a gigantic transformer and train it on all human knowledge (and say we have a higher sample efficiency or so), it is possible that it can do almost everything humans can do. But it won’t be able to just one-shot solve quantum gravity or so when we give it the prompt “solve quantum gravity”. There is no runtime updating/optimization going on, i.e. the transformer is non-consequentialist. All optimization happened through the training data or gradient descent. Either the human training data was already sufficient to encode a solution to quantum gravity in the patterns of the transformer, or it wasn’t. It is theoretically possible that the transformer learns a bit deeper underlying patterns than humans have (though I do not expect that from sth like the transformer architecture), and is so able to generalize a bit further than humans. But it seems extremely unlikely that it learned so deep understanding to already have the solution to quantum gravity encoded, although it was never explicitly trained to learn that and just read physics papers. The transformer might be able to solve quantum gravity if it can recursively query itself to engineer better prompts, or if it can give itself feedback which is then somehow converted into gradient descent updates and then try multiple times. But in those cases there is consequentialist reasoning again. The key point: Consequentialism becomes necessary when you go beyond human level.
Out of interest, how much do you agree with what I just wrote?
I think I agree to most of it: I agree that some form of optimization or policy search is needed to get many things you want to use AI for. But I guess you have to read the paper to find out the exact subtle way in which the AGIs inside can be called non-consequentialist. To quote Wikipedia:
I do not talk about this in the paper, but in terms of ethical philosophy, the key bit about counterfactual planning is that it asks: judge one’s conduct by its consequences in what world exactly? Mind you, the problem considered is that we have to define the most appropriate ethical value system for a robot butler, not what is most appropriate for a human.