Why do you think the term “corrigibility” was coined by Robert Miles? My autobiographical memory tends to be worryingly fallible, but I remember coining this term myself after some brainstorming (possibly at a MIRI workshop). This is a kind of thing that I usually try to avoid enforcing because it would look bad if all of the concepts that I did in fact invent were being cited as traceable to me—the truth about how much of this field I invented does not look good for the field or for humanity’s prospects—but outright errors of this sort should still be avoided, if an error it is.
Agent designs that provably meet more of them have since been developed, for example here.
First I’ve seen this paper, haven’t had a chance to look at it yet, would be very surprised if it fulfilled the claims made in the abstract. Those are very large claims and you should not take them at face value without a lot of careful looking.
I’m 94% confident it came from a Facebook thread where you blegged for help naming the concept and Rob suggested it. I’ll have a look now to find it and report back.
Edit: having a hard time finding it, though note that Paul repeats the claim at the top of his post on corrigibility in 2017.
Ok, I’ve given this some thought, and I’d call it:
“Corrigible Reasoning”
using the definition of corrigible as “capable of being corrected, rectified, or reformed”. (And of course AIs that don’t meet this criterion are “Incorrigible”)
Thank you very much! It seems worth distinguishing the concept invention from the name brainstorming, in a case like this one, but I now agree that Rob Miles invented the word itself.
The technical term corrigibility, coined by Robert Miles, was introduced to the AGI safety/alignment community in the 2015 paper MIRI/FHI paper titled Corrigibility.
Eg I’d suggest that to avoid confusion this kind of language should be something like “The technical term corrigibility, a name suggested by Robert Miles to denote concepts previously discussed at MIRI, was introduced...” &c.
Thanks at lot all! I just edited the post above to change the language as suggested.
FWIW, Paul’s post on corrigibility here was my primary source for the into that Robert Miles named the technical term. Nice to see the original suggestion as made on Facebook too.
First I’ve seen this paper, haven’t had a chance to look at it yet, would be very surprised if it fulfilled the claims
made in the abstract. Those are very large claims and you should not take them at face value without a lot of
careful looking.
I wrote that paper and abstract
back in 2019. Just re-read the abstract.
I am somewhat puzzled how you can read the abstract and feel that it
makes ‘very large claims’ that would be ‘very surprising’ when
fulfilled. I don’t feel that the claims are that large or hard
to believe.
Feel free to tell me more when you have read the paper. My more recent
papers make somewhat similar claims about corrigibility results, but they use more accessible
math.
I very much agree with Eliezer about the abstract making big claims. I haven’t read the whole paper, so forgive any critiques which you address later on, but here are some of my objections.
I think you might be discussing corrigibility in the very narrow sense of “given a known environment and an agent with a known ontology, such that we can pick out a ‘shutdown button pressed’ event in the agent’s world model, the agent will be indifferent to whether this button is pressed or not.”
We don’t know how to robustly pick out things in the agent’s world model, and I don’t see that acknowledged in what I’ve read thus far. In 2019 I remember someone claiming, “if we knew how to build paperclip maximizers, which actually maximize real paperclips in the real world and don’t just wirehead, we’d probably have resolved most of our confusions about the alignment problem.” I’m sympathetic to this sentiment. We just really don’t know how to do this kind of thing.
Yes, toy models can be useful, and you don’t have to solve this problem to make an advance on off-switch-corrigibility, but in that case, your paper should flag that assumption and its limitations.
I don’t think this engages with the key parts of corrigibility that I think really matter. I think corrigible minds will not try to influence the pressing of a shutdown button, but I really don’t think you can get robustly corrigible behavior by focusing on explicitly specified indifference incentives. Necessary, but not sufficient.
So I think your paper says ‘an agent is corrigible’ when you mean ‘an agent satisfies a formal property that might correspond to corrigible behavior in certain situations.’ Be careful not to motte-and-bailey with English words! This is why, in my work on power-seeking, I generally refer to my formal quantity as ‘POWER’ and to the intuitive notion as ‘power.’ I don’t want to say ‘optimal policies tend to seek power in this environment’ and fool the reader into thinking I’ve proved that. I instead say ‘optimal policies tend to seek POWER [in this formal setting, etc]‘, and then also argue ‘and here’s why POWER might be a reasonable proxy for our intuitive notion of power.’
That said, the rest of this comment addresses your paper as if it’s proving claims about intuitive-corrigibility.
On p2, you write:
The main contribution of this paper is that it shows, and proves correct, the construction of a corrigibility safety layer that can be applied to utility maximizing AGI agents.
If this were true, I could give you AIXI, a utility function, and an environmental specification, and your method will guarantee it won’t try to get in our way / prevent us from deactivating it, while also ensuring it does something non-trivial to optimize its goals? That is a big claim. If this were true it would be an absolute breakthrough (even though it wouldn’t necessarily result in practical AGI safety approaches). If this claim were true I’d wonder what the hell kind of insights we’d had; how deeply we had understood the nature of the alignment problem to leash arbitrarily smart and arbitrarily mis-aligned AGIs to our yoke.
To “prove” that a policy has this property, you’d have to define that property first. I don’t even know how to begin formalizing that property, and so a priori I’d be quite surprised if that were done successfully all in one paper. I think corrigibility is interesting because it points to a certain shape of cognition, where somehow that cognition just doesn’t have an instrumental incentive to control human behavior and to avoid deactivation, where somehow that cognition is aware that it might be unfixably motivationally flawed and therefore robustly defers to humans. CIRLcan’t do this, for example, and CIRL did advance our understanding of corrigibility to some extent.
Another issue is that you describe a “superintelligent” AGI simulator, when it would really be more accurate to say that you demonstrate a method by computing optimal policies in toy environments, which you claim represent AGI-related scenarios of interest.
(I have more to say but I need to do something else right now)
OK, so we now have n=2 people who read this
abstract and feel it makes
objectionable ‘very large claims’ or ‘big claims’, where these people
feel the need to express their objections even before reading the full
paper itself. Something vaguely interesting is going on.
I guess I have to speculate further about the root cause of why you
are reading the abstract in a ‘big claim’ way, whereas I do not see
‘big claim’ when I read the abstract.
Utopian priors?
Specifically, you are both not objecting to the actual contents of the
paper, you are taking time to offer somewhat pre-emptive criticism
based on a strong prior you have about what the contents of that paper
will have to be.
Alex, you are even making rhetorical moves to maintain your strong
prior in the face of potentially conflicting evidence:
That said, the rest of this comment addresses your paper as if it’s proving claims about intuitive-corrigibility.
Curious. So here is some speculation.
In MIRI’s writing and research agenda, and in some of the writing on
this forum, there seems to be an utopian expectation that hugely big
breakthroughs in mathematical modeling could be made, mixed up with a
wish that they must be made. I am talking about breakthroughs that
allow us to use mathematics to construct AGI agents that will
provably be
perfectly aligned
with zero residual safety risk
under all possible circumstances.
Suppose you have these utopian expectations about what AGI safety
mathematics can do (or desperately must do, or else we are all dead
soon).
If you have these expectations of perfection, you can only be
disappointed when you read actually existing mathematical papers with
models and correctness proofs that depend on well-defined
boundary conditions. I am seeing a lot of pre-emptive expression of
disappointment here.
Alex: your somewhat extensive comments above seem to be developing and
attacking the strawman expectation that you will be reading a paper
that will
resolve all open problems in corrigibility perfectly,
not just corrigibility as the paper happens to define it, but
corrigibility as you define it
while also resolving, or at least name-checking, all the open items
on MIRI’s research agenda
You express doubts that the paper will do any of this. Your doubts
are reasonable:
So I think your paper says ‘an agent is corrigible’ when you mean ‘an agent satisfies a formal property that might correspond to corrigible behavior in certain situations.’
What you think is broadly correct. The surprising thing that needs to
be explained here is: why would you even expect to get anything
different in a paper with this kind of abstract?
Structure of the paper: pretty conventional
My 2019 paper is a deeply mathematical
work, but it proceeds in a fairly standard way for such mathematical
work. Here is what happens:
I introduce the term corrigibility by referencing the notion of
corrigibility developed in the 2015 MIRI/FHI paper
I define 6 mathematical properties which I call corrigibility
desiderata. 5 of them are taken straight from the 2015 MIRI/FHI
paper that introduced the term.
I construct an agent and prove that it meets these 6 desiderata
under certain well-defined boundary conditions. The abstract
mentions an important boundary condition right from the start:
A detailed model for agents which can reason about preserving their utility function is developed, and used to prove that the corrigibility layer works as intended in a large set of non-hostile universes.
The paper devotes a lot of space (it is 35 pages long!) to
exploring and illustrating the matter of boundary conditions. This is
one of the main themes of the paper. In the end, the proven results
are not as utopian as one might conceivably hope for,
What I also do in the paper is that I sometimes us the term
‘corrigible’ as a shorthand for ‘provably meets the 6 defined
corrigibility properties’. For example I do that
in the title of section 9.8.
You are right that the word ‘corrigible’ is used in the paper in both
an informal (or intuitive) sense, and in a more formal sense where it
is equated to these 6 properties only. This is a pretty standard
thing to do in mathematical writing. It does rely on the assumption
that the reader will not confuse the two different uses.
You propose a writing convention where ‘POWER’ always is the formal
in-paper definition of power and ‘power’ is the ‘intuitive’ meaning of
power, which puts less of a burden on the reader. Frankly I feel that
is a bit too much of a departure from what is normal in mathematical
writing. (Depends a bit I guess on your intended audience.)
If people want to complain that the formal mathematical properties you
named X do not correspond to their own intuitive notion of what the
word X really means, then they are going to complain. Does not matter
whether you use uppercase or not.
Now, back in 2019 when I wrote the paper, I was working under the
assumption that when people in the AGI safety community read the world
‘corrigibility’, they would naturally map this word to the list of
mathematical desiderata in the 2015 MIRI/FHI paper titled
‘corrigibility’. So I assumed that my use of the word corrigibility in
the paper would not be that confusing or jarring to anybody.
I found out in late 2019 that the meaning of the ‘intuitive’ term
corrigibility was much more contingent, and basically all over the
place. See the ‘disentangling corrigibility’ post above, where I try
to offer a map to this diverse landscape. As I mention in the post
above:
Personally, I have stopped trying to reverse linguistic entropy. In my recent technical papers, I have tried to avoid using the word corrigibility as much as possible.
But I am not going to update my 2019 paper to covert some words to
uppercase.
On the ‘bigness’ of the mathematical claims
You write:
On p2, you write:
The main contribution of this paper is that it shows, and proves correct, the construction of a corrigibility safety layer that can be applied to utility maximizing AGI agents.
If this were true, I could give you AIXI, a utility function, and an environmental specification, and your method will guarantee it won’t try to get in our way / prevent us from deactivating it, while also ensuring it does something non-trivial to optimize its goals? That is a big claim.
You seem to have trouble believing the ‘if this were true’. The
open question here is how strong of a guarantee you are looking for, when you
are saying ‘will guarantee’ above.
If you are looking for absolute, rock-solid utopian ‘provable safety’
guarantees, where this method will reduce AGI risk to zero under
all circumstances, then I have no such guarantees on offer.
If you are looking for techniques that can will deliver weaker
guarantees, of the kind where there is a low but non-zero residual
risk of corrigibility failure, if you wrap these techniques around a
well-tested AI or AGI-level ML system, these are the kind of
techniques that I have to offer.
If this were true it would be an absolute breakthrough
Again, you seem to be looking for the type of absolute breakthrough
that delivers mathematically perfect safety always, even though we
have fallible humans, potentially hostile universes that might contain
unstoppable processes that will damage the agent, and agents that have
to learn and act based on partial observation only. Sorry, I can’t
deliver on that kind of utopian programme of provable safety. Nobody
can.
Still, I feel that the mathematical results in the paper are pretty
big. They clarify and resolve several issues identified in the 2015
MIRI/FHI paper. They resolve some of these by saying ‘you can never
perfectly have this thing unless boundary condition X is met’, but
that is significant progress too.
On the topic of what happens to the proven results when I replace the
agent that I make the proofs for with AIXI, see section 5.4 under
learning agents. AIXI can make certain prediction mistakes that
the agent I am making the proofs for cannot make by definition. These
mistakes can have the result of lowering the effectiveness of the
safety layer. I explore the topic in some more detail in later
papers.
Stability under recursive self-improvement
You say:
I think you might be discussing corrigibility in the very narrow
sense of “given a known environment and an agent with a known
ontology, such that we can pick out a ‘shutdown button pressed’
event in the agent’s world model, the agent will be indifferent to
whether this button is pressed or not.”
We don’t know how to robustly pick out things in the agent’s world model, and I don’t see that acknowledged in what I’ve read thus far.
First off, your claim that ‘We don’t know how to robustly pick out
things in the agent’s world model’ is deeply misleading.
We know very well ‘how to do this’ for many types of agent world
models. Robustly picking out simple binary input signals like stop
buttons is routinely achieved in many (non-AGI) world models as used
by today’s actually existing AI agents, both hard-coded and learned
world models, and there is no big mystery about how this is achieved.
Even with black-box learned world models, high levels of robustness
can be achieved by a regime of testing on-distribution and then
ensuring that the agent environment never goes off-distribution.
You seem to be looking for ‘not very narrow sense’ corrigibility
solutions where we can get symbol grounding robustness even in
scenarios where the AGI does recursive self improvement, where it
re-builds is entire reasoning system from the ground up, and where it
then possibly undergoes an ontological crisis. The basic solution I
have to offer for this scenario is very simple. Barring massive
breakthroughs, don’t build a system like that if you want to be safe.
The problem of formalizing humility
In another set of remarks you make, you refer to the web page Hard
problem of corrigibility,
were Ellezer speculates that to solve the problem of corrigibility,
what really we want to formalize is not indifference but
something analogous to humility or philosophical uncertainty.
You say about this that
I don’t even know how to begin formalizing that property, and so a
priori I’d be quite surprised if that were done successfully all in
one paper.
I fully share your stance that I would not even know how to begin with
‘humility or philosophical uncertainty’ and end successfully.
In the paper I ignore this speculation about humility-based solution
directions, and leverage and formalize the concept of ‘indifference’
instead. Sorry to disappoint if you were expecting major progress on
the humility agenda advanced by Ellezer.
Superintelligence
Another issue is that you describe a “superintelligent” AGI simulator
Yeah, in the paper I explicitly defined the adjective superintelligent
in a somewhat provocative way, I defined ‘superintelligent’ to mean
‘maximally adapted to solving the problem of utility maximization in
its universe’.
I know this is somewhat jarring to many people, but in this case it
was fully intended to be jarring. It is supposed to make you stop and
think...
(This grew into a very long response, and I do not feel I have
necessarily addressed or resolved all of your concerns. If you want
to move further conversation about the more technical details of my paper or of
corrigibility to a video call, I’d be open to that.)
where these people feel the need to express their objections even before reading the full paper itself
I’d very much like to flag that my comment isn’t meant to judge the contributions of your full paper. My comment was primarily judging your abstract and why it made me feel weird/hesitant to read the paper. The abstract is short, but it is important to optimize so that your hard work gets the proper attention!
(I had about half an hour at the time; I read about 6 pages of your paper to make sure I wasn’t totally off-base, and then spent the rest of the time composing a reply.)
Specifically, you are both not objecting to the actual contents of the paper, you are taking time to offer somewhat pre-emptive criticism based on a strong prior you have about what the contents of that paper will have to be.
Alex, you are even making rhetorical moves to maintain your strong prior in the face of potentially conflicting evidence:
“That said, the rest of this comment addresses your paper as if it’s proving claims about intuitive-corrigibility.”
Curious. So here is some speculation.
Perhaps I could have flagged this so you would have realized it wasn’t meant as a “rhetorical move”: it’s returning to my initial reactions as I read the abstract, which is that this paper is about intuitive-corrigibility. From the abstract:
A corrigible agent will not resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started. This paper shows how to construct a safety layer that adds corrigibility to arbitrarily advanced utility maximizing agents, including possible future agents with Artificial General Intelligence (AGI).
You aren’t just saying “I’ll prove that this AI design leads to such-and-such formal property”, but (lightly rephrasing the above): “This paper shows how to construct a safety layer that [significantly increases the probability that] arbitrarily advanced utility maximizing agents [will not] resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started… [I] prove that the corrigibility layer works as intended in a large set of non-hostile universes.”
This does not parse like a normal-strength mathematical claim. This is a claim about the de facto post-deployment safety properties of “arbitrarily advanced utility maximizing agents.”
Again, I’m not saying that your paper doesn’t have any good contributions. I can’t judge that without further reading. But I am standing by my statement that this is a non-standard claim which I’m skeptical of and which makes me hesitate to read the rest of the paper.
We know very well ‘how to do this’ for many types of agent world models. Robustly picking out simple binary input signals like stop buttons is routinely achieved in many (non-AGI) world models as used by today’s actually existing AI agents, both hard-coded and learned world models, and there is no big mystery about how this is achieved.
Yes, we know how to do it for existing AI agents. But if the ‘off-switch’ is only a binary sensory modality (there’s a channel that says ‘0’ or ‘1’ at each time step), then how do you have AIXI pick out ‘the set of worlds in which humans are pressing the button’ versus ‘the set of worlds in which a rock fell on the button’?
And even that is an unrealistically structured scenario, since it seems like prosaic AGI is quite plausible. Prosaic AGI would be way messier than AIXI, since it wouldn’t be doing anything as clean as Bayes-updating the simplicity prior to optimize an explicit utility function.
Even with black-box learned world models, high levels of robustness can be achieved by a regime of testing on-distribution and then ensuring that the agent environment never goes off-distribution.
This is not going to happen for AGI, since we might not survive testing on-distribution, and how would we ensure that the environment “stays on-distribution”? Is that like, pausing the world forever?
You seem to be looking for ‘not very narrow sense’ corrigibility solutions where we can get symbol grounding robustness even in scenarios where the AGI does recursive self improvement, where it re-builds is entire reasoning system from the ground up, and where it then possibly undergoes an ontological crisis. The basic solution I have to offer for this scenario is very simple. Barring massive breakthroughs, don’t build a system like that if you want to be safe.
I’m not just talking about that; the above shows how symbol grounding is tough even for seemingly well-defined events like “is the off-switch being pressed?”, without any fancy self-improvement.
My comment was primarily judging your abstract and why it made me
feel weird/hesitant to read the paper. The abstract is short, but it
is important to optimize so that your hard work gets the proper
attention!
OK, that clarifies your stance. You feeling weird definitely created
a weird vibe in the narrative structure of your comment, a vibe that I
picked up on.
(I had about half an hour at the time; I read about 6 pages of your
paper to make sure I wasn’t totally off-base, and then spent the rest
of the time composing a reply.)
You writing it quickly in half an hour also explains a lot about
how it reads.
it’s returning to my initial reactions as I read the abstract, which is that this paper is about intuitive-corrigibility.
I guess we have established by now that the paper is not about your
version of intuitive-corrigibility.
For my analysis of intuitive-corrigibility, see the contents of the post
above. My analysis is that intuitions on corrigibility are highly
diverse, and have gotten even more diverse and divergent over time.
You interpret the abstract as follows:
You aren’t just saying “I’ll prove that this AI design leads to such-and-such formal property”, but (lightly rephrasing the above): “This paper shows how to construct a safety layer that [significantly increases the probability that] arbitrarily advanced utility maximizing agents [will not] resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started.
Yes that is what I am saying in the abstract. Your light rephrasing
where you add [significantly increases the probability that] indeed
expresses the message I intended to convey.
[I] prove that the corrigibility layer works as intended in a large set of non-hostile universes.”
The phrasing ‘works as intended’ in the abstract is supposed to
indicate clearly that the layer is designed to produce specific
such-and-such formal corrigibility properties only, not some broad
idea of ‘intuitive corrigibility’.
So I am guessing you did not pick up on that when reading the abstract.
OK, moving away from a discussion about abstracts, initial impressions,
feelings and their causes, moving towards a discussion of more technical
stuff:
But if the ‘off-switch’ is only a binary sensory modality (there’s a channel that says ‘0’ or ‘1’ at each time step), then how do you have AIXI pick out ‘the set of worlds in which humans are pressing the button’ versus ‘the set of worlds in which a rock fell on the button’?
In the paper I don’t try to make the agent’s world model distinguish between
‘humans are pressing the button’ versus ‘the set of worlds in which a
rock fell on the button’. The ‘works as intended’ is that any button
press for whatever reason is supposed to stop the agent.
So this is a design tradeoff I am making.
Specifically, this is driven by a lack of trust that any ML-based ‘human versus rock’
discrimination system will work reliably enough in every possible future
emergency. The design tradeoff is to accept the risk that a stray
rock will stop the agent accidentally, potentially causing some level
of inconvenience.
Even with black-box learned world models, high levels of robustness
can be achieved by a regime of testing on-distribution and then
ensuring that the agent environment never goes off-distribution.
This is not going to happen for AGI, since we might not survive
testing on-distribution,
You are saying we might not survive AGI. This kind of observation
is almost a tautology, and therefore not very interesting.
What interests me is identifying techniques that can increase the
chance of survival. Drive down the risk. Admit that zero risk is
impossible and move on. There are plenty of risk-reduction techniques
out there if you look.
and how would we ensure that the environment
“stays on-distribution”? Is that like, pausing the world forever?
Here is a paperclip optimizer example of ‘ensuring that the agent
environment never goes off-distribution’. Your design goals are to 1)
keep your paperclip optimizing AGI agent inside your paperclip
factory, and also 2) to make sure that crazy people and/or radically
new technologies do not manage to get into the factory.
In other words, keep the environment inside the factory
on-distribution, and keep the agent in the factory.
Yes, you will never perfectly achieve these two goals with 100%
certainty. Your goal is to drive down the risk of failure on both of
them.
For goal 1), there are many techniques that come to mind, you probably
want to use them all. 1a) physical barriers that the agent cannot
breach with the resources it has at hand, 1b) active agent oversight,
where oversight has a kill switch and is not afraid to use it 1c)
agent design measures (penalty terms in the reward function, use of
counterfactual planning, certain training regimes, etc.) that robustly
lower the chance of the agent ever wanting to get out of the
factory. 1d) automatic safety
interlocks.
I spent some time in my professional life doing safety engineering for physical systems and
cyber-physical systems, and this is how safety gets done there. Progress is
often expensive, often annoying to certain stakeholders, and not always
philosophically very satisfying. But you cannot have everything.
Why do you think the term “corrigibility” was coined by Robert Miles? My autobiographical memory tends to be worryingly fallible, but I remember coining this term myself after some brainstorming (possibly at a MIRI workshop). This is a kind of thing that I usually try to avoid enforcing because it would look bad if all of the concepts that I did in fact invent were being cited as traceable to me—the truth about how much of this field I invented does not look good for the field or for humanity’s prospects—but outright errors of this sort should still be avoided, if an error it is.
First I’ve seen this paper, haven’t had a chance to look at it yet, would be very surprised if it fulfilled the claims made in the abstract. Those are very large claims and you should not take them at face value without a lot of careful looking.
I’m 94% confident it came from a Facebook thread where you blegged for help naming the concept and Rob suggested it. I’ll have a look now to find it and report back.
Edit: having a hard time finding it, though note that Paul repeats the claim at the top of his post on corrigibility in 2017.
Here it is: https://www.facebook.com/yudkowsky/posts/10152443714699228?comment_id=10152445126604228
Rob Miles (May 2014):
Thank you very much! It seems worth distinguishing the concept invention from the name brainstorming, in a case like this one, but I now agree that Rob Miles invented the word itself.
Eg I’d suggest that to avoid confusion this kind of language should be something like “The technical term corrigibility, a name suggested by Robert Miles to denote concepts previously discussed at MIRI, was introduced...” &c.
You’re welcome. Yeah “invented the concept” and “named the concept” are different (and both important!).
Thanks at lot all! I just edited the post above to change the language as suggested.
FWIW, Paul’s post on corrigibility here was my primary source for the into that Robert Miles named the technical term. Nice to see the original suggestion as made on Facebook too.
Note that the way Paul phrases it in that post is much clearer and more accurate:
> “I believe this concept was introduced in the context of AI by Eliezer and named by Robert Miles”
Yeah I definitely wouldn’t say I ‘coined’ it, I just suggested the name
I wrote that paper and abstract back in 2019. Just re-read the abstract.
I am somewhat puzzled how you can read the abstract and feel that it makes ‘very large claims’ that would be ‘very surprising’ when fulfilled. I don’t feel that the claims are that large or hard to believe.
Feel free to tell me more when you have read the paper. My more recent papers make somewhat similar claims about corrigibility results, but they use more accessible math.
I very much agree with Eliezer about the abstract making big claims. I haven’t read the whole paper, so forgive any critiques which you address later on, but here are some of my objections.
I think you might be discussing corrigibility in the very narrow sense of “given a known environment and an agent with a known ontology, such that we can pick out a ‘shutdown button pressed’ event in the agent’s world model, the agent will be indifferent to whether this button is pressed or not.”
We don’t know how to robustly pick out things in the agent’s world model, and I don’t see that acknowledged in what I’ve read thus far. In 2019 I remember someone claiming, “if we knew how to build paperclip maximizers, which actually maximize real paperclips in the real world and don’t just wirehead, we’d probably have resolved most of our confusions about the alignment problem.” I’m sympathetic to this sentiment. We just really don’t know how to do this kind of thing.
Yes, toy models can be useful, and you don’t have to solve this problem to make an advance on off-switch-corrigibility, but in that case, your paper should flag that assumption and its limitations.
I don’t think this engages with the key parts of corrigibility that I think really matter. I think corrigible minds will not try to influence the pressing of a shutdown button, but I really don’t think you can get robustly corrigible behavior by focusing on explicitly specified indifference incentives. Necessary, but not sufficient.
So I think your paper says ‘an agent is corrigible’ when you mean ‘an agent satisfies a formal property that might correspond to corrigible behavior in certain situations.’ Be careful not to motte-and-bailey with English words! This is why, in my work on power-seeking, I generally refer to my formal quantity as ‘POWER’ and to the intuitive notion as ‘power.’ I don’t want to say ‘optimal policies tend to seek power in this environment’ and fool the reader into thinking I’ve proved that. I instead say ‘optimal policies tend to seek POWER [in this formal setting, etc]‘, and then also argue ‘and here’s why POWER might be a reasonable proxy for our intuitive notion of power.’
That said, the rest of this comment addresses your paper as if it’s proving claims about intuitive-corrigibility.
On p2, you write:
If this were true, I could give you AIXI, a utility function, and an environmental specification, and your method will guarantee it won’t try to get in our way / prevent us from deactivating it, while also ensuring it does something non-trivial to optimize its goals? That is a big claim. If this were true it would be an absolute breakthrough (even though it wouldn’t necessarily result in practical AGI safety approaches). If this claim were true I’d wonder what the hell kind of insights we’d had; how deeply we had understood the nature of the alignment problem to leash arbitrarily smart and arbitrarily mis-aligned AGIs to our yoke.
To “prove” that a policy has this property, you’d have to define that property first. I don’t even know how to begin formalizing that property, and so a priori I’d be quite surprised if that were done successfully all in one paper. I think corrigibility is interesting because it points to a certain shape of cognition, where somehow that cognition just doesn’t have an instrumental incentive to control human behavior and to avoid deactivation, where somehow that cognition is aware that it might be unfixably motivationally flawed and therefore robustly defers to humans. CIRL can’t do this, for example, and CIRL did advance our understanding of corrigibility to some extent.
Another issue is that you describe a “superintelligent” AGI simulator, when it would really be more accurate to say that you demonstrate a method by computing optimal policies in toy environments, which you claim represent AGI-related scenarios of interest.
(I have more to say but I need to do something else right now)
OK, so we now have n=2 people who read this abstract and feel it makes objectionable ‘very large claims’ or ‘big claims’, where these people feel the need to express their objections even before reading the full paper itself. Something vaguely interesting is going on.
I guess I have to speculate further about the root cause of why you are reading the abstract in a ‘big claim’ way, whereas I do not see ‘big claim’ when I read the abstract.
Utopian priors?
Specifically, you are both not objecting to the actual contents of the paper, you are taking time to offer somewhat pre-emptive criticism based on a strong prior you have about what the contents of that paper will have to be.
Alex, you are even making rhetorical moves to maintain your strong prior in the face of potentially conflicting evidence:
Curious. So here is some speculation.
In MIRI’s writing and research agenda, and in some of the writing on this forum, there seems to be an utopian expectation that hugely big breakthroughs in mathematical modeling could be made, mixed up with a wish that they must be made. I am talking about breakthroughs that allow us to use mathematics to construct AGI agents that will provably be
perfectly aligned
with zero residual safety risk
under all possible circumstances.
Suppose you have these utopian expectations about what AGI safety mathematics can do (or desperately must do, or else we are all dead soon). If you have these expectations of perfection, you can only be disappointed when you read actually existing mathematical papers with models and correctness proofs that depend on well-defined boundary conditions. I am seeing a lot of pre-emptive expression of disappointment here.
Alex: your somewhat extensive comments above seem to be developing and attacking the strawman expectation that you will be reading a paper that will
resolve all open problems in corrigibility perfectly,
not just corrigibility as the paper happens to define it, but corrigibility as you define it
while also resolving, or at least name-checking, all the open items on MIRI’s research agenda
You express doubts that the paper will do any of this. Your doubts are reasonable:
What you think is broadly correct. The surprising thing that needs to be explained here is: why would you even expect to get anything different in a paper with this kind of abstract?
Structure of the paper: pretty conventional
My 2019 paper is a deeply mathematical work, but it proceeds in a fairly standard way for such mathematical work. Here is what happens:
I introduce the term corrigibility by referencing the notion of corrigibility developed in the 2015 MIRI/FHI paper
I define 6 mathematical properties which I call corrigibility desiderata. 5 of them are taken straight from the 2015 MIRI/FHI paper that introduced the term.
I construct an agent and prove that it meets these 6 desiderata under certain well-defined boundary conditions. The abstract mentions an important boundary condition right from the start:
The paper devotes a lot of space (it is 35 pages long!) to exploring and illustrating the matter of boundary conditions. This is one of the main themes of the paper. In the end, the proven results are not as utopian as one might conceivably hope for,
What I also do in the paper is that I sometimes us the term ‘corrigible’ as a shorthand for ‘provably meets the 6 defined corrigibility properties’. For example I do that in the title of section 9.8.
You are right that the word ‘corrigible’ is used in the paper in both an informal (or intuitive) sense, and in a more formal sense where it is equated to these 6 properties only. This is a pretty standard thing to do in mathematical writing. It does rely on the assumption that the reader will not confuse the two different uses.
You propose a writing convention where ‘POWER’ always is the formal in-paper definition of power and ‘power’ is the ‘intuitive’ meaning of power, which puts less of a burden on the reader. Frankly I feel that is a bit too much of a departure from what is normal in mathematical writing. (Depends a bit I guess on your intended audience.)
If people want to complain that the formal mathematical properties you named X do not correspond to their own intuitive notion of what the word X really means, then they are going to complain. Does not matter whether you use uppercase or not.
Now, back in 2019 when I wrote the paper, I was working under the assumption that when people in the AGI safety community read the world ‘corrigibility’, they would naturally map this word to the list of mathematical desiderata in the 2015 MIRI/FHI paper titled ‘corrigibility’. So I assumed that my use of the word corrigibility in the paper would not be that confusing or jarring to anybody.
I found out in late 2019 that the meaning of the ‘intuitive’ term corrigibility was much more contingent, and basically all over the place. See the ‘disentangling corrigibility’ post above, where I try to offer a map to this diverse landscape. As I mention in the post above:
But I am not going to update my 2019 paper to covert some words to uppercase.
On the ‘bigness’ of the mathematical claims
You write:
You seem to have trouble believing the ‘if this were true’. The open question here is how strong of a guarantee you are looking for, when you are saying ‘will guarantee’ above.
If you are looking for absolute, rock-solid utopian ‘provable safety’ guarantees, where this method will reduce AGI risk to zero under all circumstances, then I have no such guarantees on offer.
If you are looking for techniques that can will deliver weaker guarantees, of the kind where there is a low but non-zero residual risk of corrigibility failure, if you wrap these techniques around a well-tested AI or AGI-level ML system, these are the kind of techniques that I have to offer.
Again, you seem to be looking for the type of absolute breakthrough that delivers mathematically perfect safety always, even though we have fallible humans, potentially hostile universes that might contain unstoppable processes that will damage the agent, and agents that have to learn and act based on partial observation only. Sorry, I can’t deliver on that kind of utopian programme of provable safety. Nobody can.
Still, I feel that the mathematical results in the paper are pretty big. They clarify and resolve several issues identified in the 2015 MIRI/FHI paper. They resolve some of these by saying ‘you can never perfectly have this thing unless boundary condition X is met’, but that is significant progress too.
On the topic of what happens to the proven results when I replace the agent that I make the proofs for with AIXI, see section 5.4 under learning agents. AIXI can make certain prediction mistakes that the agent I am making the proofs for cannot make by definition. These mistakes can have the result of lowering the effectiveness of the safety layer. I explore the topic in some more detail in later papers.
Stability under recursive self-improvement
You say:
First off, your claim that ‘We don’t know how to robustly pick out things in the agent’s world model’ is deeply misleading.
We know very well ‘how to do this’ for many types of agent world models. Robustly picking out simple binary input signals like stop buttons is routinely achieved in many (non-AGI) world models as used by today’s actually existing AI agents, both hard-coded and learned world models, and there is no big mystery about how this is achieved.
Even with black-box learned world models, high levels of robustness can be achieved by a regime of testing on-distribution and then ensuring that the agent environment never goes off-distribution.
You seem to be looking for ‘not very narrow sense’ corrigibility solutions where we can get symbol grounding robustness even in scenarios where the AGI does recursive self improvement, where it re-builds is entire reasoning system from the ground up, and where it then possibly undergoes an ontological crisis. The basic solution I have to offer for this scenario is very simple. Barring massive breakthroughs, don’t build a system like that if you want to be safe.
The problem of formalizing humility
In another set of remarks you make, you refer to the web page Hard problem of corrigibility, were Ellezer speculates that to solve the problem of corrigibility, what really we want to formalize is not indifference but
You say about this that
I fully share your stance that I would not even know how to begin with ‘humility or philosophical uncertainty’ and end successfully.
In the paper I ignore this speculation about humility-based solution directions, and leverage and formalize the concept of ‘indifference’ instead. Sorry to disappoint if you were expecting major progress on the humility agenda advanced by Ellezer.
Superintelligence
Yeah, in the paper I explicitly defined the adjective superintelligent in a somewhat provocative way, I defined ‘superintelligent’ to mean ‘maximally adapted to solving the problem of utility maximization in its universe’.
I know this is somewhat jarring to many people, but in this case it was fully intended to be jarring. It is supposed to make you stop and think...
(This grew into a very long response, and I do not feel I have necessarily addressed or resolved all of your concerns. If you want to move further conversation about the more technical details of my paper or of corrigibility to a video call, I’d be open to that.)
I’d very much like to flag that my comment isn’t meant to judge the contributions of your full paper. My comment was primarily judging your abstract and why it made me feel weird/hesitant to read the paper. The abstract is short, but it is important to optimize so that your hard work gets the proper attention!
(I had about half an hour at the time; I read about 6 pages of your paper to make sure I wasn’t totally off-base, and then spent the rest of the time composing a reply.)
Perhaps I could have flagged this so you would have realized it wasn’t meant as a “rhetorical move”: it’s returning to my initial reactions as I read the abstract, which is that this paper is about intuitive-corrigibility. From the abstract:
You aren’t just saying “I’ll prove that this AI design leads to such-and-such formal property”, but (lightly rephrasing the above): “This paper shows how to construct a safety layer that [significantly increases the probability that] arbitrarily advanced utility maximizing agents [will not] resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started… [I] prove that the corrigibility layer works as intended in a large set of non-hostile universes.”
This does not parse like a normal-strength mathematical claim. This is a claim about the de facto post-deployment safety properties of “arbitrarily advanced utility maximizing agents.”
Again, I’m not saying that your paper doesn’t have any good contributions. I can’t judge that without further reading. But I am standing by my statement that this is a non-standard claim which I’m skeptical of and which makes me hesitate to read the rest of the paper.
Yes, we know how to do it for existing AI agents. But if the ‘off-switch’ is only a binary sensory modality (there’s a channel that says ‘0’ or ‘1’ at each time step), then how do you have AIXI pick out ‘the set of worlds in which humans are pressing the button’ versus ‘the set of worlds in which a rock fell on the button’?
And even that is an unrealistically structured scenario, since it seems like prosaic AGI is quite plausible. Prosaic AGI would be way messier than AIXI, since it wouldn’t be doing anything as clean as Bayes-updating the simplicity prior to optimize an explicit utility function.
This is not going to happen for AGI, since we might not survive testing on-distribution, and how would we ensure that the environment “stays on-distribution”? Is that like, pausing the world forever?
I’m not just talking about that; the above shows how symbol grounding is tough even for seemingly well-defined events like “is the off-switch being pressed?”, without any fancy self-improvement.
OK, that clarifies your stance. You feeling weird definitely created a weird vibe in the narrative structure of your comment, a vibe that I picked up on.
You writing it quickly in half an hour also explains a lot about how it reads.
I guess we have established by now that the paper is not about your version of intuitive-corrigibility.
For my analysis of intuitive-corrigibility, see the contents of the post above. My analysis is that intuitions on corrigibility are highly diverse, and have gotten even more diverse and divergent over time.
You interpret the abstract as follows:
Yes that is what I am saying in the abstract. Your light rephrasing where you add [significantly increases the probability that] indeed expresses the message I intended to convey.
The phrasing ‘works as intended’ in the abstract is supposed to indicate clearly that the layer is designed to produce specific such-and-such formal corrigibility properties only, not some broad idea of ‘intuitive corrigibility’.
So I am guessing you did not pick up on that when reading the abstract.
OK, moving away from a discussion about abstracts, initial impressions, feelings and their causes, moving towards a discussion of more technical stuff:
In the paper I don’t try to make the agent’s world model distinguish between ‘humans are pressing the button’ versus ‘the set of worlds in which a rock fell on the button’. The ‘works as intended’ is that any button press for whatever reason is supposed to stop the agent.
So this is a design tradeoff I am making. Specifically, this is driven by a lack of trust that any ML-based ‘human versus rock’ discrimination system will work reliably enough in every possible future emergency. The design tradeoff is to accept the risk that a stray rock will stop the agent accidentally, potentially causing some level of inconvenience.
You are saying we might not survive AGI. This kind of observation is almost a tautology, and therefore not very interesting.
What interests me is identifying techniques that can increase the chance of survival. Drive down the risk. Admit that zero risk is impossible and move on. There are plenty of risk-reduction techniques out there if you look.
Here is a paperclip optimizer example of ‘ensuring that the agent environment never goes off-distribution’. Your design goals are to 1) keep your paperclip optimizing AGI agent inside your paperclip factory, and also 2) to make sure that crazy people and/or radically new technologies do not manage to get into the factory.
In other words, keep the environment inside the factory on-distribution, and keep the agent in the factory.
Yes, you will never perfectly achieve these two goals with 100% certainty. Your goal is to drive down the risk of failure on both of them.
For goal 1), there are many techniques that come to mind, you probably want to use them all. 1a) physical barriers that the agent cannot breach with the resources it has at hand, 1b) active agent oversight, where oversight has a kill switch and is not afraid to use it 1c) agent design measures (penalty terms in the reward function, use of counterfactual planning, certain training regimes, etc.) that robustly lower the chance of the agent ever wanting to get out of the factory. 1d) automatic safety interlocks.
I spent some time in my professional life doing safety engineering for physical systems and cyber-physical systems, and this is how safety gets done there. Progress is often expensive, often annoying to certain stakeholders, and not always philosophically very satisfying. But you cannot have everything.