I wish I were clearer in my title that I’m not trying to reframe all misaligned AGI’s, just a particular class of them. I agree that an AGI that fully understood your values would not optimize for them (and would not be “well-intentioned”) if it had a bad goal.
That problem is basically ignorance, and so by making the PA smarter or more aware, we can solve the problem.
I think if we’ve correctly specified the values in an AGI, then I agree that when the AGI is smart enough it’ll correctly optimize for our values. But it’s not necessarily robust to scaling down, and I think it’s likely to hit a weird place where it’s trying and failing to optimize for our values. This post is about my intuitions for what that might look like.
I wish I were clearer in my title that I’m not trying to reframe all misaligned AGI’s, just a particular class of them. I agree that an AGI that fully understood your values would not optimize for them (and would not be “well-intentioned”) if it had a bad goal.
I think if we’ve correctly specified the values in an AGI, then I agree that when the AGI is smart enough it’ll correctly optimize for our values. But it’s not necessarily robust to scaling down, and I think it’s likely to hit a weird place where it’s trying and failing to optimize for our values. This post is about my intuitions for what that might look like.
Ok; within that subset of problems, I agree.