I have a different example in mind, from the one John provided. @johnswentworth, do mention if I’m misunderstanding what you’re getting at there.
Suppose you train your AI to show respect to your ancestors. Your understanding of what this involves contains things like “preserve accurate history” and “teach the next generations about the ancestors’ deeds” and “pray to the ancestors daily” and “ritually consult the ancestors before making big decisions”.
In the standard reward-misspecification setup, the AI doesn’t actually internalize the intended goal of “respect the ancestors”. Instead, it grows a bunch of values about the upstream correlates of that, like “preserving accurate history” and “doing elaborate ritual dances” (or, more realistically, some completely alien variants of this). It starts to care about the correlates terminally. Then it tiles the universe with dancing books or something, with no “ancestors” mentioned anywhere in them.
In the “unexpected generalization” setup, the AI does end up caring about the ancestors directly. But as it learns more about the world, more than you, its ontology is updated, and it discovers that, why, actually spirits aren’t real and “praying to” and “consulting” the ancestors are just arbitrary behaviors that don’t have anything in particular to do with keeping the ancestors happy and respected. So the AI keeps on telling accurate histories and teaching them, but entirely drops the ritualistic elements of your culture.
But what if actually, what you cared about was preserving your culture? Rituals included, even if you learn that they don’t do anything, because you still want them for the aesthetic/cultural connection?
Well, then you’re out of luck. You thought you knew what you wanted, but your lack of knowledge of the structure of the domain in which you operated foiled you. And the AI doesn’t care; it was taught to respect the ancestors, not be corrigible to your shifting opinions.
It’s similar to the original post’s example of using “zero correlation” as a proxy for “zero mutual information” to minimize information leaks. You think you know what your target is, but you don’t actually know its True Name, so even optimizing for your actual not-Goodharted best understanding of it still leads to unintended outcomes.
“The AI starts to care about making humans rate its actions as good” is a particularly extreme example of it: where whatever concept the humans care about is so confused there’s nothing in reality outside their minds that it corresponds to, so there’s nothing for the AI to latch onto except the raters themselves.
That is a different phenomenon than the thing I was getting at with the particular block Daniel quoted at top-of-thread. It is, however, an excellent example (better than the example I used) of the sort of thing the “metaphilosophy” section of the post was getting at.
Yeah, I guess that block was about more concrete issues with the “humans rate things” setup? And what I’ve outlined is more of a… mirror of it?
Here’s a different example. Imagine feeding the AI a dataset consisting of a bunch of ethical dilemmas, and thumbing it up every time it does something “good” according to you. Your goal is to grow something which cares about the human flourishing, maybe a consequentialist utilitarian, and you think that’s the way to go. But in actuality, your deontology is very flawed, so in actuality you grow a bullet-biting evil deontologist. I think that’s analogous to the human raters setup, right?
And then the equal-and-opposite failure mode is if you’re feeding the AI some ethics dataset in an attempt to teach it deontological injunctions, but it actually distills them into “consequentialistic utilitarianism”, in a surprising and upsetting-to-you manner.
I have a different example in mind, from the one John provided. @johnswentworth, do mention if I’m misunderstanding what you’re getting at there.
Suppose you train your AI to show respect to your ancestors. Your understanding of what this involves contains things like “preserve accurate history” and “teach the next generations about the ancestors’ deeds” and “pray to the ancestors daily” and “ritually consult the ancestors before making big decisions”.
In the standard reward-misspecification setup, the AI doesn’t actually internalize the intended goal of “respect the ancestors”. Instead, it grows a bunch of values about the upstream correlates of that, like “preserving accurate history” and “doing elaborate ritual dances” (or, more realistically, some completely alien variants of this). It starts to care about the correlates terminally. Then it tiles the universe with dancing books or something, with no “ancestors” mentioned anywhere in them.
In the “unexpected generalization” setup, the AI does end up caring about the ancestors directly. But as it learns more about the world, more than you, its ontology is updated, and it discovers that, why, actually spirits aren’t real and “praying to” and “consulting” the ancestors are just arbitrary behaviors that don’t have anything in particular to do with keeping the ancestors happy and respected. So the AI keeps on telling accurate histories and teaching them, but entirely drops the ritualistic elements of your culture.
But what if actually, what you cared about was preserving your culture? Rituals included, even if you learn that they don’t do anything, because you still want them for the aesthetic/cultural connection?
Well, then you’re out of luck. You thought you knew what you wanted, but your lack of knowledge of the structure of the domain in which you operated foiled you. And the AI doesn’t care; it was taught to respect the ancestors, not be corrigible to your shifting opinions.
It’s similar to the original post’s example of using “zero correlation” as a proxy for “zero mutual information” to minimize information leaks. You think you know what your target is, but you don’t actually know its True Name, so even optimizing for your actual not-Goodharted best understanding of it still leads to unintended outcomes.
“The AI starts to care about making humans rate its actions as good” is a particularly extreme example of it: where whatever concept the humans care about is so confused there’s nothing in reality outside their minds that it corresponds to, so there’s nothing for the AI to latch onto except the raters themselves.
That is a different phenomenon than the thing I was getting at with the particular block Daniel quoted at top-of-thread. It is, however, an excellent example (better than the example I used) of the sort of thing the “metaphilosophy” section of the post was getting at.
Yeah, I guess that block was about more concrete issues with the “humans rate things” setup? And what I’ve outlined is more of a… mirror of it?
Here’s a different example. Imagine feeding the AI a dataset consisting of a bunch of ethical dilemmas, and thumbing it up every time it does something “good” according to you. Your goal is to grow something which cares about the human flourishing, maybe a consequentialist utilitarian, and you think that’s the way to go. But in actuality, your deontology is very flawed, so in actuality you grow a bullet-biting evil deontologist. I think that’s analogous to the human raters setup, right?
And then the equal-and-opposite failure mode is if you’re feeding the AI some ethics dataset in an attempt to teach it deontological injunctions, but it actually distills them into “consequentialistic utilitarianism”, in a surprising and upsetting-to-you manner.