I think this is the wrong way of looking at it. Because in this analogy, the PA is “genuinely trying their hardest to optimize for your values”, it’s just poor at understanding these values. That problem is basically ignorance, and so by making the PA smarter or more aware, we can solve the problem.
But an AGI that fully understood your values, would still not optimise for them if it had a bad goal. The AGI is not well-intentioned-but-weird-in-implementation; its intentions themselves are alien/weird to us.
I wish I were clearer in my title that I’m not trying to reframe all misaligned AGI’s, just a particular class of them. I agree that an AGI that fully understood your values would not optimize for them (and would not be “well-intentioned”) if it had a bad goal.
That problem is basically ignorance, and so by making the PA smarter or more aware, we can solve the problem.
I think if we’ve correctly specified the values in an AGI, then I agree that when the AGI is smart enough it’ll correctly optimize for our values. But it’s not necessarily robust to scaling down, and I think it’s likely to hit a weird place where it’s trying and failing to optimize for our values. This post is about my intuitions for what that might look like.
I think this is the wrong way of looking at it. Because in this analogy, the PA is “genuinely trying their hardest to optimize for your values”, it’s just poor at understanding these values. That problem is basically ignorance, and so by making the PA smarter or more aware, we can solve the problem.
But an AGI that fully understood your values, would still not optimise for them if it had a bad goal. The AGI is not well-intentioned-but-weird-in-implementation; its intentions themselves are alien/weird to us.
I wish I were clearer in my title that I’m not trying to reframe all misaligned AGI’s, just a particular class of them. I agree that an AGI that fully understood your values would not optimize for them (and would not be “well-intentioned”) if it had a bad goal.
I think if we’ve correctly specified the values in an AGI, then I agree that when the AGI is smart enough it’ll correctly optimize for our values. But it’s not necessarily robust to scaling down, and I think it’s likely to hit a weird place where it’s trying and failing to optimize for our values. This post is about my intuitions for what that might look like.
Ok; within that subset of problems, I agree.