I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is only likely to be useful in cases like this where it is crisp and natural.
Can someone explain to me what this crispness is?
As I’m reading Paul’s comment, there’s an amount of optimization for human reward that breaks our rating ability. This is a general problem for AI because of the fundamental reason that as we increase an AI’s optimization power, it gets better at the task, but it also gets better at breaking my rating ability (which in powerful systems can lead to an overpowering of who’s values are getting optimized in the universe).
Then there’s this idea that as you approach breaking my rating ability, the rating will always fall off, leaving a pool of undesirability (in a high-dimensional action-space) that groups around doing a task well/poorly, that separates it from doing a task in a way that breaks my rating ability.
Is that what this crispness is? This little pool of rating fall off?
If yes, it’s not clear to me why this little pool that separates the AI from MASSIVE VALUE and TAKING OVER THE UNIVERSE is able to save us. I don’t know if the pool always exists around the action space, and to the extent it does exist I don’t know how to use its existence to build a powerful optimizer that stays on one side of the pool.
Though Paul isn’t saying he knows how to do that. He’s saying that there’s something really useful about it being crisp. I guess that’s what I want to know. I don’t understand the difference between “corrigibility is well-defined” and “corrigibility is crisp”. Insofar as it’s not a literally incoherent idea, there is some description of what behavior is in the category and what isn’t. Then there’s this additional little pool property, where not only can you list what’s in and out of the definition, but the ratings go down a little before spiking when you leave the list of things in the definition. Is Paul saying that this means it’s a very natural and simple concept to design a system to stay within?
If you have a space with two disconnected components, then I’m calling the distinction between them “crisp.” For example, it doesn’t depend on exactly how you draw the line.
It feels to me like this kind of non-convexity is fundamentally what crispness is about (the cluster structure of thingspace is a central example). So if you want to draw a crisp line, you should be looking for this kind of disconnectedness/non-convexity.
ETA: a very concrete consequence of this kind of crispness, that I should have spelled out in the OP, is that there are many functions that separate the two components, and so if you try to learn a classifier you can do so relatively quickly—almost all of the work of learning your classifier is just in building a good model and predicting what actions a human would rate highly.
If you have a space with two disconnected components, then I’m calling the distinction between them “crisp.”
The components feel disconnected to me in 1D, but I’m not sure they would feel disconnected in 3D or in ND. Is your intuition that they’re ‘durably disconnected’ (even looking at the messy plan-space of the real-world, we’ll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator’s preferences, once you have the ability to shut off or distract bits of your brain without other bits noticing, etc.)?
[This also feels like a good question for people who think corrigibility is anti-natural; do you not share Paul’s sense that they’re disconnected in 1D, or when do you think the difficulty comes in?]
I don’t think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we’d probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that’s not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world.
If you are in the business of “trying to train corrigibility” or “trying to design corrigible systems,” I think understanding that distinction is what the game is about.
If you are trying to argue that corrigibility is unworkable, I think that debunking the intuitive distinction is what the game is about. The kind of thing people often say—like “there are so many ways to mess with you, how could a definition cover all of them?”—doesn’t make any progress on that, and so it doesn’t help reconcile the intuitions or convince most optimists to be more pessimistic.
(Obviously all of that is just a best guess though, and the game may well be about something totally different.)
The approach relies on identifying all the reward sub-spaces with this inversion property? That seems very difficult.
I don’t think it’s good enough to identify these spaces and place barriers in the reward function. (Analogy: SGD works perhaps because it’s good at jumping over such barriers.) Presumably you’re actually talking about something more analogous to a penalty that increases as the action in question gets closer to step 4 in all the examples, so that there is nothing to jump over.
Even that seems insufficient, because it seems like a reasoning system smart enough to have this problem in the first case can always add a meta term and defeat the visibility constraint. E.g. “if I do X that you wouldn’t like and you don’t notice it, that’s bad; but if you don’t notice that you don’t notice it, then maybe it’s OK.”
Maybe one can defeat all meta terms that involve not noticing something with one rule about meta terms, but that’s not obvious to me at all, especially if we’re talking about a reward function rather than the policy that the agent actually learns.
Can someone explain to me what this crispness is?
As I’m reading Paul’s comment, there’s an amount of optimization for human reward that breaks our rating ability. This is a general problem for AI because of the fundamental reason that as we increase an AI’s optimization power, it gets better at the task, but it also gets better at breaking my rating ability (which in powerful systems can lead to an overpowering of who’s values are getting optimized in the universe).
Then there’s this idea that as you approach breaking my rating ability, the rating will always fall off, leaving a pool of undesirability (in a high-dimensional action-space) that groups around doing a task well/poorly, that separates it from doing a task in a way that breaks my rating ability.
Is that what this crispness is? This little pool of rating fall off?
If yes, it’s not clear to me why this little pool that separates the AI from MASSIVE VALUE and TAKING OVER THE UNIVERSE is able to save us. I don’t know if the pool always exists around the action space, and to the extent it does exist I don’t know how to use its existence to build a powerful optimizer that stays on one side of the pool.
Though Paul isn’t saying he knows how to do that. He’s saying that there’s something really useful about it being crisp. I guess that’s what I want to know. I don’t understand the difference between “corrigibility is well-defined” and “corrigibility is crisp”. Insofar as it’s not a literally incoherent idea, there is some description of what behavior is in the category and what isn’t. Then there’s this additional little pool property, where not only can you list what’s in and out of the definition, but the ratings go down a little before spiking when you leave the list of things in the definition. Is Paul saying that this means it’s a very natural and simple concept to design a system to stay within?
If you have a space with two disconnected components, then I’m calling the distinction between them “crisp.” For example, it doesn’t depend on exactly how you draw the line.
It feels to me like this kind of non-convexity is fundamentally what crispness is about (the cluster structure of thingspace is a central example). So if you want to draw a crisp line, you should be looking for this kind of disconnectedness/non-convexity.
ETA: a very concrete consequence of this kind of crispness, that I should have spelled out in the OP, is that there are many functions that separate the two components, and so if you try to learn a classifier you can do so relatively quickly—almost all of the work of learning your classifier is just in building a good model and predicting what actions a human would rate highly.
The components feel disconnected to me in 1D, but I’m not sure they would feel disconnected in 3D or in ND. Is your intuition that they’re ‘durably disconnected’ (even looking at the messy plan-space of the real-world, we’ll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator’s preferences, once you have the ability to shut off or distract bits of your brain without other bits noticing, etc.)?
[This also feels like a good question for people who think corrigibility is anti-natural; do you not share Paul’s sense that they’re disconnected in 1D, or when do you think the difficulty comes in?]
I don’t think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we’d probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that’s not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world.
If you are in the business of “trying to train corrigibility” or “trying to design corrigible systems,” I think understanding that distinction is what the game is about.
If you are trying to argue that corrigibility is unworkable, I think that debunking the intuitive distinction is what the game is about. The kind of thing people often say—like “there are so many ways to mess with you, how could a definition cover all of them?”—doesn’t make any progress on that, and so it doesn’t help reconcile the intuitions or convince most optimists to be more pessimistic.
(Obviously all of that is just a best guess though, and the game may well be about something totally different.)
Thanks!
The approach relies on identifying all the reward sub-spaces with this inversion property? That seems very difficult.
I don’t think it’s good enough to identify these spaces and place barriers in the reward function. (Analogy: SGD works perhaps because it’s good at jumping over such barriers.) Presumably you’re actually talking about something more analogous to a penalty that increases as the action in question gets closer to step 4 in all the examples, so that there is nothing to jump over.
Even that seems insufficient, because it seems like a reasoning system smart enough to have this problem in the first case can always add a meta term and defeat the visibility constraint. E.g. “if I do X that you wouldn’t like and you don’t notice it, that’s bad; but if you don’t notice that you don’t notice it, then maybe it’s OK.”
Maybe one can defeat all meta terms that involve not noticing something with one rule about meta terms, but that’s not obvious to me at all, especially if we’re talking about a reward function rather than the policy that the agent actually learns.
This isn’t how I’m thinking about it.