“Corrigibility” means making the AGI care about human values through the intermediary of humans — making it terminally care about “what agents with the designation ‘human’ care about”. (Or maybe “what my creator cares about”,
Interesting—that is actually what I’ve considered to be proper ‘value learning’: correctly locating and pointing to humans and their values in the agent’s learned world model, in a way that naturally survives/updates correctly with world model ontology updates. The agent then has a natural intrinsic motivation to further improve its own understanding of human values (and thus its utility function) simply through the normal curiosity drive for value of information improvement to its world model.
I wasn’t making a definitive statement on what I think people mean when they say “corrigibility”, to be clear. The point I was making is that any implementation of corrigibility that I think is worth trying for necessarily has the “faithfulness” component — i. e., the AI would have to interpret its values/tasks/orders the way they were intended by the order-giver, instead of some other way. Which, in turn, likely requires somehow making it locate humans in its world-model (though likely implemented as “locate the model of [whoever is giving me the order]” in the AI’s utility function, not necessarily referring to [humans] specifically).
And building off that definition, if “value learning” is supposed to mean something different, then I’d define it as pointing at human values not through humans, but directly. I. e., making the AI value the same things that humans value not because it knows that it’s what humans value, but just because.
Again, I don’t necessarily think that it’s what most people mean by these terms most times — I would natively view both approaches to this as something like “value learning” as well. But this discussion started from John (1) differentiating between them, and (2) viewing both approaches as viable. This is just how I’d carve it under these two constraints.
Interesting—that is actually what I’ve considered to be proper ‘value learning’: correctly locating and pointing to humans and their values in the agent’s learned world model, in a way that naturally survives/updates correctly with world model ontology updates. The agent then has a natural intrinsic motivation to further improve its own understanding of human values (and thus its utility function) simply through the normal curiosity drive for value of information improvement to its world model.
I wasn’t making a definitive statement on what I think people mean when they say “corrigibility”, to be clear. The point I was making is that any implementation of corrigibility that I think is worth trying for necessarily has the “faithfulness” component — i. e., the AI would have to interpret its values/tasks/orders the way they were intended by the order-giver, instead of some other way. Which, in turn, likely requires somehow making it locate humans in its world-model (though likely implemented as “locate the model of [whoever is giving me the order]” in the AI’s utility function, not necessarily referring to [humans] specifically).
And building off that definition, if “value learning” is supposed to mean something different, then I’d define it as pointing at human values not through humans, but directly. I. e., making the AI value the same things that humans value not because it knows that it’s what humans value, but just because.
Again, I don’t necessarily think that it’s what most people mean by these terms most times — I would natively view both approaches to this as something like “value learning” as well. But this discussion started from John (1) differentiating between them, and (2) viewing both approaches as viable. This is just how I’d carve it under these two constraints.