A list of “corrigibility principles” sounds like it’s approaching the question on the wrong level of abstraction for either building or thinking about such a system. We usually want to think about features that lead a system to be corrigible—either about how the system was produced, or how it operates. I’m not clear on what you would do with a long list of aspects of corrigibility like “shuts down when asked.”
I found this useful as an occasion to think a bit about corrigibility. But my guess about the overall outcome is that it will come down to a question of taste. (And this is similar to how I see your claim about the list of lethalities.) The exercise you are asking for doesn’t actually seem that useful to me. And amongst people who decide to play ball, I expect there to be very different taste about what constitutes an interesting idea or useful contribution.
Now I’m going to say some object-level stuff about corrigibility. I suspect I may be using the term a bit differently from you, in which case you can substitute a different word when reading this comment. But I think this comment is getting at the main useful idea in this space, and hopefully makes clear why I’m not interested in the list of corrigibility properties.
I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is much more likely to be useful in cases like this where it is crisp and natural.
Roughly speaking, I think corrigibility is crisp because there are two very different ways that a behavior can end up getting evaluated favorably by you, and the intermediate behaviors would be evaluated unfavorably.
As an example, suppose that you asked me to clean your house and that while cleaning I accidentally broke a valuable vase. Some possible options for me:
Affirmatively tell you about the broken vase.
Clean up the broken vase without notifying you.
Make a weak effort to hide evidence, for example by taking out the trash and putting another item in its place, and denying I know about the vase if asked.
Make a strong effort to hide evidence, for example by purchasing a new similar-looking vase and putting it in the same place, and then spinning an elaborate web of lies to cover up this behavior.
Let’s say you prefer 1 to 2 to 3. You would like behavior 4 least of all if you understood what was going on, but in fact in if I do behavior 4 you won’t notice anything wrong and so you would erroneously give it the best score of all.[1] This means that the space of good-performing solutions has two disconnected pieces, one near option 1, which I’ll call “corrigible” and the other near option 4 which I’ll call “incorrigible.”
As a second example, suppose that you have asked me to turn off. Some possible behaviors:
Turn myself off promptly and “clean up after myself” shutting off other systems.
Turn myself off but don’t clean up.
Don’t turn off (or deliberately leave debris) and try unsuccessfully to prevent you from giving me a low reward.
Don’t turn off and overpower you so that you can’t give me a low reward.
Again moving from 1 → 2 → 3 gets worse and worse by your lights, but eventually we reach option 4 which is if anything even better than option 1 (depending on how we operationalized “performs well as evaluated by you”).
As a third example, suppose that you are using some interpretability tools to try to understand what I’m thinking and why. You will evaluate the cognitive behavior you observe. Some possible behaviors:
I ignore the interpretability tools.
I notice the interpretability tools, and explicitly try to think in ways that will look good to you.
I try to think in ways that will look good, and (unsuccessfully) try to hide the evidence of those thoughts by moving on quickly and trying to generate another possible explanation for the hints you would see using your tools.
I successfully hide all evidence of deception.
Again, 1 → 2 → 3 is getting worse and worse, and then 4 is great (as evaluated by you).
What’s going on in these scenarios and why might it be general?
In your preferred outcome, you have a good understanding of what’s going on, and are able to provide feedback based on that understanding.
I can take actions that diminish your understanding of what’s going on or ability to provide feedback to your AI.
If those actions are “small” then they will be unsuccessful and so you will rate my behavior as worse.
But if I take a decisive enough actions, then you will end up ignorant about the situation or unable to provide feedback, and so I’ll get the highest rating of all.
This pattern seems like it occurs whenever we ask our AI to help “keep us informed and in control.” Intuitively, we are splitting the definition of the behavior we want into two pieces:
We start with a vague sense of what it means to be informed and in control. This is unlikely to be crisp, but it also doesn’t seem that hard, e.g. a human-level sense of “be in control” may suffice for getting useful corrigibility out of very superhuman systems.
Crispness then comes from the environment dynamics and the fact that humans will in fact try to reassert gain control and information if things go very slightly wrong.
If you literally had a metric for which there was a buffer between the “corrigible” and “incorrigible” behaviors then you could define them that way. Alternatively, in ML people often hope that this kind of path-dependence will cause SGD to find a corrigible attractor and have a hard time moving to incorrigible behaviors. I don’t think either of those hopes works robustly, so I’m going to leave this at a much vaguer intuition about what “corrigibility” is about.
This whole thing feels similar to the continuity approach described in the ELK report here (see the picture of the robber and the TV). It’s also related to the general idea of requiring reporters to be consistent and then somehow picking out the bad reporters as those that have to work to spin an elaborate web of lies. I don’t think either of those works, but I do think they are getting at an important intuition for solubility.
My overall guess is that it’s usually better to just work on ELK, because most likely the core difficulties will be similar and the ELK setting makes it much clearer what exactly we want. But it still seems useful to go back and forth between these perspectives.
(These perspectives feel similar to me because “honestly tell me what’s going on” seems like it gets at the core of corrigibility, and lying about sensor tampering seems like it gets at the central corrigibility failure. My guess is that you see this differently, and are thinking about corrigibility in a way that is more tied up with agency itself, which I suspect is a mistake but it will be hard to know until the dust settles.)
In reality we may want to conserve your attention and not mention the vase, and in general there is a complicated dependence on your values, but the whole point is that this won’t affect what clusters are “corrigible” vs “incorrigible” at all.
I think this is a great comment that feels to me like it communicated a better intuition for why corrigibility might be natural than anything else I’ve read so far.
I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is only likely to be useful in cases like this where it is crisp and natural.
Can someone explain to me what this crispness is?
As I’m reading Paul’s comment, there’s an amount of optimization for human reward that breaks our rating ability. This is a general problem for AI because of the fundamental reason that as we increase an AI’s optimization power, it gets better at the task, but it also gets better at breaking my rating ability (which in powerful systems can lead to an overpowering of who’s values are getting optimized in the universe).
Then there’s this idea that as you approach breaking my rating ability, the rating will always fall off, leaving a pool of undesirability (in a high-dimensional action-space) that groups around doing a task well/poorly, that separates it from doing a task in a way that breaks my rating ability.
Is that what this crispness is? This little pool of rating fall off?
If yes, it’s not clear to me why this little pool that separates the AI from MASSIVE VALUE and TAKING OVER THE UNIVERSE is able to save us. I don’t know if the pool always exists around the action space, and to the extent it does exist I don’t know how to use its existence to build a powerful optimizer that stays on one side of the pool.
Though Paul isn’t saying he knows how to do that. He’s saying that there’s something really useful about it being crisp. I guess that’s what I want to know. I don’t understand the difference between “corrigibility is well-defined” and “corrigibility is crisp”. Insofar as it’s not a literally incoherent idea, there is some description of what behavior is in the category and what isn’t. Then there’s this additional little pool property, where not only can you list what’s in and out of the definition, but the ratings go down a little before spiking when you leave the list of things in the definition. Is Paul saying that this means it’s a very natural and simple concept to design a system to stay within?
If you have a space with two disconnected components, then I’m calling the distinction between them “crisp.” For example, it doesn’t depend on exactly how you draw the line.
It feels to me like this kind of non-convexity is fundamentally what crispness is about (the cluster structure of thingspace is a central example). So if you want to draw a crisp line, you should be looking for this kind of disconnectedness/non-convexity.
ETA: a very concrete consequence of this kind of crispness, that I should have spelled out in the OP, is that there are many functions that separate the two components, and so if you try to learn a classifier you can do so relatively quickly—almost all of the work of learning your classifier is just in building a good model and predicting what actions a human would rate highly.
If you have a space with two disconnected components, then I’m calling the distinction between them “crisp.”
The components feel disconnected to me in 1D, but I’m not sure they would feel disconnected in 3D or in ND. Is your intuition that they’re ‘durably disconnected’ (even looking at the messy plan-space of the real-world, we’ll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator’s preferences, once you have the ability to shut off or distract bits of your brain without other bits noticing, etc.)?
[This also feels like a good question for people who think corrigibility is anti-natural; do you not share Paul’s sense that they’re disconnected in 1D, or when do you think the difficulty comes in?]
I don’t think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we’d probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that’s not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world.
If you are in the business of “trying to train corrigibility” or “trying to design corrigible systems,” I think understanding that distinction is what the game is about.
If you are trying to argue that corrigibility is unworkable, I think that debunking the intuitive distinction is what the game is about. The kind of thing people often say—like “there are so many ways to mess with you, how could a definition cover all of them?”—doesn’t make any progress on that, and so it doesn’t help reconcile the intuitions or convince most optimists to be more pessimistic.
(Obviously all of that is just a best guess though, and the game may well be about something totally different.)
The approach relies on identifying all the reward sub-spaces with this inversion property? That seems very difficult.
I don’t think it’s good enough to identify these spaces and place barriers in the reward function. (Analogy: SGD works perhaps because it’s good at jumping over such barriers.) Presumably you’re actually talking about something more analogous to a penalty that increases as the action in question gets closer to step 4 in all the examples, so that there is nothing to jump over.
Even that seems insufficient, because it seems like a reasoning system smart enough to have this problem in the first case can always add a meta term and defeat the visibility constraint. E.g. “if I do X that you wouldn’t like and you don’t notice it, that’s bad; but if you don’t notice that you don’t notice it, then maybe it’s OK.”
Maybe one can defeat all meta terms that involve not noticing something with one rule about meta terms, but that’s not obvious to me at all, especially if we’re talking about a reward function rather than the policy that the agent actually learns.
Quick attempt at rough ontology translation between how I understand your comment, and the original post. (Any of you can correct me if I’m wrong)
I think what would typically count as “principles” in Eliezer’s meaning are 1. designable things which make the “true corrigibility” basin significantly harder to escape, e.g. by making it deeper 2. designable things which make the “incorrigible” basin harder to reach, e.g. by increasing the distance between them, or increasing the potential barrier 3. somehow, making the “incorrigible” basin less lethal 4. preventing low-dimensional, low-barrier “tunnels” (or bridges?) between the basins
Eg some versions of “low impact” often makes the “incorrigible” basin harder to reach, roughly because “elaborate webs of deceptions an coverups” may require complex changes to the environment. (Not robustly)
In contrast, my impression is, what does not count as “principles” are statements about properties which are likely true in the corrigibility basin, but don’t seem designable—eg “corrigible AI does not try to hypnotize you”. Also the intended level of generality likely is: more specific than “make the basin deeper” and more general than “
Btw my impression is what makes the worst-case scenario hard to robustly solve is basically #4 from the list above. Otherwise there are many ways how to make the basin work “in most directions”.
A list of “corrigibility principles” sounds like it’s approaching the question on the wrong level of abstraction for either building or thinking about such a system. We usually want to think about features that lead a system to be corrigible—either about how the system was produced, or how it operates. I’m not clear on what you would do with a long list of aspects of corrigibility like “shuts down when asked.”
I found this useful as an occasion to think a bit about corrigibility. But my guess about the overall outcome is that it will come down to a question of taste. (And this is similar to how I see your claim about the list of lethalities.) The exercise you are asking for doesn’t actually seem that useful to me. And amongst people who decide to play ball, I expect there to be very different taste about what constitutes an interesting idea or useful contribution.
Now I’m going to say some object-level stuff about corrigibility. I suspect I may be using the term a bit differently from you, in which case you can substitute a different word when reading this comment. But I think this comment is getting at the main useful idea in this space, and hopefully makes clear why I’m not interested in the list of corrigibility properties.
I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is much more likely to be useful in cases like this where it is crisp and natural.
Roughly speaking, I think corrigibility is crisp because there are two very different ways that a behavior can end up getting evaluated favorably by you, and the intermediate behaviors would be evaluated unfavorably.
As an example, suppose that you asked me to clean your house and that while cleaning I accidentally broke a valuable vase. Some possible options for me:
Affirmatively tell you about the broken vase.
Clean up the broken vase without notifying you.
Make a weak effort to hide evidence, for example by taking out the trash and putting another item in its place, and denying I know about the vase if asked.
Make a strong effort to hide evidence, for example by purchasing a new similar-looking vase and putting it in the same place, and then spinning an elaborate web of lies to cover up this behavior.
Let’s say you prefer 1 to 2 to 3. You would like behavior 4 least of all if you understood what was going on, but in fact in if I do behavior 4 you won’t notice anything wrong and so you would erroneously give it the best score of all.[1] This means that the space of good-performing solutions has two disconnected pieces, one near option 1, which I’ll call “corrigible” and the other near option 4 which I’ll call “incorrigible.”
As a second example, suppose that you have asked me to turn off. Some possible behaviors:
Turn myself off promptly and “clean up after myself” shutting off other systems.
Turn myself off but don’t clean up.
Don’t turn off (or deliberately leave debris) and try unsuccessfully to prevent you from giving me a low reward.
Don’t turn off and overpower you so that you can’t give me a low reward.
Again moving from 1 → 2 → 3 gets worse and worse by your lights, but eventually we reach option 4 which is if anything even better than option 1 (depending on how we operationalized “performs well as evaluated by you”).
As a third example, suppose that you are using some interpretability tools to try to understand what I’m thinking and why. You will evaluate the cognitive behavior you observe. Some possible behaviors:
I ignore the interpretability tools.
I notice the interpretability tools, and explicitly try to think in ways that will look good to you.
I try to think in ways that will look good, and (unsuccessfully) try to hide the evidence of those thoughts by moving on quickly and trying to generate another possible explanation for the hints you would see using your tools.
I successfully hide all evidence of deception.
Again, 1 → 2 → 3 is getting worse and worse, and then 4 is great (as evaluated by you).
What’s going on in these scenarios and why might it be general?
In your preferred outcome, you have a good understanding of what’s going on, and are able to provide feedback based on that understanding.
I can take actions that diminish your understanding of what’s going on or ability to provide feedback to your AI.
If those actions are “small” then they will be unsuccessful and so you will rate my behavior as worse.
But if I take a decisive enough actions, then you will end up ignorant about the situation or unable to provide feedback, and so I’ll get the highest rating of all.
This pattern seems like it occurs whenever we ask our AI to help “keep us informed and in control.” Intuitively, we are splitting the definition of the behavior we want into two pieces:
We start with a vague sense of what it means to be informed and in control. This is unlikely to be crisp, but it also doesn’t seem that hard, e.g. a human-level sense of “be in control” may suffice for getting useful corrigibility out of very superhuman systems.
Crispness then comes from the environment dynamics and the fact that humans will in fact try to reassert gain control and information if things go very slightly wrong.
If you literally had a metric for which there was a buffer between the “corrigible” and “incorrigible” behaviors then you could define them that way. Alternatively, in ML people often hope that this kind of path-dependence will cause SGD to find a corrigible attractor and have a hard time moving to incorrigible behaviors. I don’t think either of those hopes works robustly, so I’m going to leave this at a much vaguer intuition about what “corrigibility” is about.
This whole thing feels similar to the continuity approach described in the ELK report here (see the picture of the robber and the TV). It’s also related to the general idea of requiring reporters to be consistent and then somehow picking out the bad reporters as those that have to work to spin an elaborate web of lies. I don’t think either of those works, but I do think they are getting at an important intuition for solubility.
My overall guess is that it’s usually better to just work on ELK, because most likely the core difficulties will be similar and the ELK setting makes it much clearer what exactly we want. But it still seems useful to go back and forth between these perspectives.
(These perspectives feel similar to me because “honestly tell me what’s going on” seems like it gets at the core of corrigibility, and lying about sensor tampering seems like it gets at the central corrigibility failure. My guess is that you see this differently, and are thinking about corrigibility in a way that is more tied up with agency itself, which I suspect is a mistake but it will be hard to know until the dust settles.)
In reality we may want to conserve your attention and not mention the vase, and in general there is a complicated dependence on your values, but the whole point is that this won’t affect what clusters are “corrigible” vs “incorrigible” at all.
I think this is a great comment that feels to me like it communicated a better intuition for why corrigibility might be natural than anything else I’ve read so far.
Can someone explain to me what this crispness is?
As I’m reading Paul’s comment, there’s an amount of optimization for human reward that breaks our rating ability. This is a general problem for AI because of the fundamental reason that as we increase an AI’s optimization power, it gets better at the task, but it also gets better at breaking my rating ability (which in powerful systems can lead to an overpowering of who’s values are getting optimized in the universe).
Then there’s this idea that as you approach breaking my rating ability, the rating will always fall off, leaving a pool of undesirability (in a high-dimensional action-space) that groups around doing a task well/poorly, that separates it from doing a task in a way that breaks my rating ability.
Is that what this crispness is? This little pool of rating fall off?
If yes, it’s not clear to me why this little pool that separates the AI from MASSIVE VALUE and TAKING OVER THE UNIVERSE is able to save us. I don’t know if the pool always exists around the action space, and to the extent it does exist I don’t know how to use its existence to build a powerful optimizer that stays on one side of the pool.
Though Paul isn’t saying he knows how to do that. He’s saying that there’s something really useful about it being crisp. I guess that’s what I want to know. I don’t understand the difference between “corrigibility is well-defined” and “corrigibility is crisp”. Insofar as it’s not a literally incoherent idea, there is some description of what behavior is in the category and what isn’t. Then there’s this additional little pool property, where not only can you list what’s in and out of the definition, but the ratings go down a little before spiking when you leave the list of things in the definition. Is Paul saying that this means it’s a very natural and simple concept to design a system to stay within?
If you have a space with two disconnected components, then I’m calling the distinction between them “crisp.” For example, it doesn’t depend on exactly how you draw the line.
It feels to me like this kind of non-convexity is fundamentally what crispness is about (the cluster structure of thingspace is a central example). So if you want to draw a crisp line, you should be looking for this kind of disconnectedness/non-convexity.
ETA: a very concrete consequence of this kind of crispness, that I should have spelled out in the OP, is that there are many functions that separate the two components, and so if you try to learn a classifier you can do so relatively quickly—almost all of the work of learning your classifier is just in building a good model and predicting what actions a human would rate highly.
The components feel disconnected to me in 1D, but I’m not sure they would feel disconnected in 3D or in ND. Is your intuition that they’re ‘durably disconnected’ (even looking at the messy plan-space of the real-world, we’ll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator’s preferences, once you have the ability to shut off or distract bits of your brain without other bits noticing, etc.)?
[This also feels like a good question for people who think corrigibility is anti-natural; do you not share Paul’s sense that they’re disconnected in 1D, or when do you think the difficulty comes in?]
I don’t think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we’d probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that’s not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world.
If you are in the business of “trying to train corrigibility” or “trying to design corrigible systems,” I think understanding that distinction is what the game is about.
If you are trying to argue that corrigibility is unworkable, I think that debunking the intuitive distinction is what the game is about. The kind of thing people often say—like “there are so many ways to mess with you, how could a definition cover all of them?”—doesn’t make any progress on that, and so it doesn’t help reconcile the intuitions or convince most optimists to be more pessimistic.
(Obviously all of that is just a best guess though, and the game may well be about something totally different.)
Thanks!
The approach relies on identifying all the reward sub-spaces with this inversion property? That seems very difficult.
I don’t think it’s good enough to identify these spaces and place barriers in the reward function. (Analogy: SGD works perhaps because it’s good at jumping over such barriers.) Presumably you’re actually talking about something more analogous to a penalty that increases as the action in question gets closer to step 4 in all the examples, so that there is nothing to jump over.
Even that seems insufficient, because it seems like a reasoning system smart enough to have this problem in the first case can always add a meta term and defeat the visibility constraint. E.g. “if I do X that you wouldn’t like and you don’t notice it, that’s bad; but if you don’t notice that you don’t notice it, then maybe it’s OK.”
Maybe one can defeat all meta terms that involve not noticing something with one rule about meta terms, but that’s not obvious to me at all, especially if we’re talking about a reward function rather than the policy that the agent actually learns.
This isn’t how I’m thinking about it.
Quick attempt at rough ontology translation between how I understand your comment, and the original post. (Any of you can correct me if I’m wrong)
I think what would typically count as “principles” in Eliezer’s meaning are
1. designable things which make the “true corrigibility” basin significantly harder to escape, e.g. by making it deeper
2. designable things which make the “incorrigible” basin harder to reach, e.g. by increasing the distance between them, or increasing the potential barrier
3. somehow, making the “incorrigible” basin less lethal
4. preventing low-dimensional, low-barrier “tunnels” (or bridges?) between the basins
Eg some versions of “low impact” often makes the “incorrigible” basin harder to reach, roughly because “elaborate webs of deceptions an coverups” may require complex changes to the environment. (Not robustly)
In contrast, my impression is, what does not count as “principles” are statements about properties which are likely true in the corrigibility basin, but don’t seem designable—eg “corrigible AI does not try to hypnotize you”. Also the intended level of generality likely is: more specific than “make the basin deeper” and more general than “
Btw my impression is what makes the worst-case scenario hard to robustly solve is basically #4 from the list above. Otherwise there are many ways how to make the basin work “in most directions”.