I’ve actually updated toward a target which could arguably be called either corrigibility or human values, and is different from previous framings I’ve seen of either
Right, in my view the line between them is blurry as well. One distinction that makes sense to me is:
“Value learning” means making the AGI care about human values directly — as in, putting them in place of its utility function.
“Corrigibility” means making the AGI care about human values through the intermediary of humans — making it terminally care about “what agents with the designation ‘human’ care about”. (Or maybe “what my creator cares about”, “what a highly-specific input channel tells me to do”, etc.)
Put like this, yup, “corrigibility” seems like a better target to aim for. In particular, it’s compact and, as you point out, convergent — “an agent” and “this agent’s goals” are likely much easier to express than the whole suite of human values, and would be easier for us to locate in the AGI’s ontology (e. g., we should be able to side-step a lot of philosophical headaches, outsource them to the AI itself).
In that sense, “strawberry alignment”, in MIRI’s parlance, is indeed easier than “eudaimonia alignment”.
However...
insofar as humans value corrigibility (or particular aspects of corrigibility), the same challenges of expressing corrigibility mathematically also need to be solved in order to target values
I’ve been pretty confused about why MIRI thought that corrigibility is easier, and this is exactly why. Imparting corrigibility still requires making the AI care about some very specific conceptions humans have about how their commands should be executed, e. g. “don’t optimize for this too hard” and other DWIMs. But if we can do that, if it understands us well enough to figure out all the subtle implications in our orders, then why can’t we just tell it to “build an utopia” and expect that to go well? It seems like a strawberry-aligned AI should interpret that order faithfully as well… Which is a view that Nate/Eliezer seem not to outright rule out, they talk about “short reflection” sometimes.
But other times “corrigibility” seems to mean a grab-bag of tricks for essentially upper-bounding the damage an AI can inflict, presumably followed by a pivotal act (with a large amount of collateral damage) via this system and then long reflection. On that model, there’s also a meaningful presumed difficulty difference between strawberry alignment and eudaimonia alignment: the former doesn’t require us to be very good at retargeting the AGI at all. But it also seems obviously doomed to me, and not necessarily easier (inasmuch as this flavor of “corrigibility” doesn’t seem like a natural concept at all).
My reading is that you endorse the former type of corrigibility as well, not the latter?
My reading is that you endorse the former type of corrigibility as well, not the latter?
Yes. I also had the “grab-bag of tricks” impression from MIRI’s previous work on the topic, since it was mostly just trying various hacks, and that was also part of why I mostly ignored it. The notion that there’s a True Name to be found here, we’re not just trying hacks, is a big part of why I now have hope for corrigibility.
“Corrigibility” means making the AGI care about human values through the intermediary of humans — making it terminally care about “what agents with the designation ‘human’ care about”. (Or maybe “what my creator cares about”,
Interesting—that is actually what I’ve considered to be proper ‘value learning’: correctly locating and pointing to humans and their values in the agent’s learned world model, in a way that naturally survives/updates correctly with world model ontology updates. The agent then has a natural intrinsic motivation to further improve its own understanding of human values (and thus its utility function) simply through the normal curiosity drive for value of information improvement to its world model.
I wasn’t making a definitive statement on what I think people mean when they say “corrigibility”, to be clear. The point I was making is that any implementation of corrigibility that I think is worth trying for necessarily has the “faithfulness” component — i. e., the AI would have to interpret its values/tasks/orders the way they were intended by the order-giver, instead of some other way. Which, in turn, likely requires somehow making it locate humans in its world-model (though likely implemented as “locate the model of [whoever is giving me the order]” in the AI’s utility function, not necessarily referring to [humans] specifically).
And building off that definition, if “value learning” is supposed to mean something different, then I’d define it as pointing at human values not through humans, but directly. I. e., making the AI value the same things that humans value not because it knows that it’s what humans value, but just because.
Again, I don’t necessarily think that it’s what most people mean by these terms most times — I would natively view both approaches to this as something like “value learning” as well. But this discussion started from John (1) differentiating between them, and (2) viewing both approaches as viable. This is just how I’d carve it under these two constraints.
Right, in my view the line between them is blurry as well. One distinction that makes sense to me is:
“Value learning” means making the AGI care about human values directly — as in, putting them in place of its utility function.
“Corrigibility” means making the AGI care about human values through the intermediary of humans — making it terminally care about “what agents with the designation ‘human’ care about”. (Or maybe “what my creator cares about”, “what a highly-specific input channel tells me to do”, etc.)
Put like this, yup, “corrigibility” seems like a better target to aim for. In particular, it’s compact and, as you point out, convergent — “an agent” and “this agent’s goals” are likely much easier to express than the whole suite of human values, and would be easier for us to locate in the AGI’s ontology (e. g., we should be able to side-step a lot of philosophical headaches, outsource them to the AI itself).
In that sense, “strawberry alignment”, in MIRI’s parlance, is indeed easier than “eudaimonia alignment”.
However...
I’ve been pretty confused about why MIRI thought that corrigibility is easier, and this is exactly why. Imparting corrigibility still requires making the AI care about some very specific conceptions humans have about how their commands should be executed, e. g. “don’t optimize for this too hard” and other DWIMs. But if we can do that, if it understands us well enough to figure out all the subtle implications in our orders, then why can’t we just tell it to “build an utopia” and expect that to go well? It seems like a strawberry-aligned AI should interpret that order faithfully as well… Which is a view that Nate/Eliezer seem not to outright rule out, they talk about “short reflection” sometimes.
But other times “corrigibility” seems to mean a grab-bag of tricks for essentially upper-bounding the damage an AI can inflict, presumably followed by a pivotal act (with a large amount of collateral damage) via this system and then long reflection. On that model, there’s also a meaningful presumed difficulty difference between strawberry alignment and eudaimonia alignment: the former doesn’t require us to be very good at retargeting the AGI at all. But it also seems obviously doomed to me, and not necessarily easier (inasmuch as this flavor of “corrigibility” doesn’t seem like a natural concept at all).
My reading is that you endorse the former type of corrigibility as well, not the latter?
Yes. I also had the “grab-bag of tricks” impression from MIRI’s previous work on the topic, since it was mostly just trying various hacks, and that was also part of why I mostly ignored it. The notion that there’s a True Name to be found here, we’re not just trying hacks, is a big part of why I now have hope for corrigibility.
Interesting—that is actually what I’ve considered to be proper ‘value learning’: correctly locating and pointing to humans and their values in the agent’s learned world model, in a way that naturally survives/updates correctly with world model ontology updates. The agent then has a natural intrinsic motivation to further improve its own understanding of human values (and thus its utility function) simply through the normal curiosity drive for value of information improvement to its world model.
I wasn’t making a definitive statement on what I think people mean when they say “corrigibility”, to be clear. The point I was making is that any implementation of corrigibility that I think is worth trying for necessarily has the “faithfulness” component — i. e., the AI would have to interpret its values/tasks/orders the way they were intended by the order-giver, instead of some other way. Which, in turn, likely requires somehow making it locate humans in its world-model (though likely implemented as “locate the model of [whoever is giving me the order]” in the AI’s utility function, not necessarily referring to [humans] specifically).
And building off that definition, if “value learning” is supposed to mean something different, then I’d define it as pointing at human values not through humans, but directly. I. e., making the AI value the same things that humans value not because it knows that it’s what humans value, but just because.
Again, I don’t necessarily think that it’s what most people mean by these terms most times — I would natively view both approaches to this as something like “value learning” as well. But this discussion started from John (1) differentiating between them, and (2) viewing both approaches as viable. This is just how I’d carve it under these two constraints.