This is really similar to some stuff I’ve been thinking about, so I’ll be writing up a longer comment with more compare/contrast later.
But one thing really stood out to me—I think one can go farther in grappling with and taking advantage of “where UH lives.” UH doesn’t live inside the human, it lives in the AI’s model of the human. Humans aren’t idealized agents, they’re clusters of atoms, which means they don’t have preferences except after the sort of coarse-graining procedure you describe, and this coarse-graining procedure lives with a particular model of the human—it’s not inherent in the atoms.
This means that once you’ve specified a value learning procedure and human model, there is no residual “actual preferences” the AI can check itself against. The challenge was never to access our “actual preferences,” it was always to make a best effort to model humans as they want to be modeled. This is deeply counterintuitive (“What do you mean, the AI isn’t going to learn what humans’ actual preferences are?!”), but also liberating and motivating.
Now that I think about it, it’s a pretty big PR problem if I have to start every explanation of my value learning scheme with “humans don’t have actual preferences so the AI is just going to try to learn something adequate.” Maybe I should figure out a system of jargon such that I can say, in jargon, that the AI is learning peoples’ actual preferences, and it will correspond to what laypeople actually want from value learning.
I’m not sure whether such jargon would make actual technical thinking harder, though.
“humans don’t have actual preferences so the AI is just going to try to learn something adequate.”
Try something like: humans don’t have actual consistent preferences, so the AI is going to try and find a good approximation that covers all the contradictions and uncertainties in human preferences.
This is really similar to some stuff I’ve been thinking about, so I’ll be writing up a longer comment with more compare/contrast later.
But one thing really stood out to me—I think one can go farther in grappling with and taking advantage of “where UH lives.” UH doesn’t live inside the human, it lives in the AI’s model of the human. Humans aren’t idealized agents, they’re clusters of atoms, which means they don’t have preferences except after the sort of coarse-graining procedure you describe, and this coarse-graining procedure lives with a particular model of the human—it’s not inherent in the atoms.
This means that once you’ve specified a value learning procedure and human model, there is no residual “actual preferences” the AI can check itself against. The challenge was never to access our “actual preferences,” it was always to make a best effort to model humans as they want to be modeled. This is deeply counterintuitive (“What do you mean, the AI isn’t going to learn what humans’ actual preferences are?!”), but also liberating and motivating.
One of the reasons I refer to synthesising (or constructing) the UH, not learning it.
Now that I think about it, it’s a pretty big PR problem if I have to start every explanation of my value learning scheme with “humans don’t have actual preferences so the AI is just going to try to learn something adequate.” Maybe I should figure out a system of jargon such that I can say, in jargon, that the AI is learning peoples’ actual preferences, and it will correspond to what laypeople actually want from value learning.
I’m not sure whether such jargon would make actual technical thinking harder, though.
Try something like: humans don’t have actual consistent preferences, so the AI is going to try and find a good approximation that covers all the contradictions and uncertainties in human preferences.