The primary thing I am arguing, as Max as already said, is that the AI alignment paradigm obscures the most fundamental problem of AI safety: That of prediction. It does this by conflating various interpretations of what values or utility functions are.
One of the most fundamental insights entailed by a move from an alignment to a predictive paradigm is that is becomes far from clear whether the relevant problems are solvable.
Nothing in this shows that AGI is “guaranteed to be destructive to human preferences” of course. Rather, it shows that various paradigms that one may choose to try to make AI safe, like RLHF, should actually not make us more confident in our AGI systems at all because they address the wrong questions: We can never know in advance whether they will work and this is true for all paradigms that try to sidestep the prediction problem by appealing to values (bracketing hard-wired values of course).
I’d just say that we’ve never known anything in advance for certain. The idea that we’d be able to prove analytically that an ASI was safe was always hopeless. And that’s been accepted by most of the alignment community for some time. I don’t see that this perspective changes the predictive power of the theories we’re applying.
I think the interesting discussion is not about how certain exactly our predictions of doom are or can be.
Let me put the central point another way: However pessimistic you are about the success of alignment you should become more pessimistic once you realize that alignment requires the prediction of an AIs actions. Any notion that we could circumvent this by engineering values into the system is illusory.
The primary thing I am arguing, as Max as already said, is that the AI alignment paradigm obscures the most fundamental problem of AI safety: That of prediction. It does this by conflating various interpretations of what values or utility functions are.
One of the most fundamental insights entailed by a move from an alignment to a predictive paradigm is that is becomes far from clear whether the relevant problems are solvable.
Nothing in this shows that AGI is “guaranteed to be destructive to human preferences” of course. Rather, it shows that various paradigms that one may choose to try to make AI safe, like RLHF, should actually not make us more confident in our AGI systems at all because they address the wrong questions: We can never know in advance whether they will work and this is true for all paradigms that try to sidestep the prediction problem by appealing to values (bracketing hard-wired values of course).
I’d just say that we’ve never known anything in advance for certain. The idea that we’d be able to prove analytically that an ASI was safe was always hopeless. And that’s been accepted by most of the alignment community for some time. I don’t see that this perspective changes the predictive power of the theories we’re applying.
I think the interesting discussion is not about how certain exactly our predictions of doom are or can be.
Let me put the central point another way: However pessimistic you are about the success of alignment you should become more pessimistic once you realize that alignment requires the prediction of an AIs actions. Any notion that we could circumvent this by engineering values into the system is illusory.