Are you arguing that AGI can’t be guaranteed to be safe? Or that it’s guaranteed to be destructive to human preferences? If the first, I take that to be old news.
I’ve long ago given up on guarantees of safety, and I read most of the community as being in a similar boat.
The primary thing I am arguing, as Max as already said, is that the AI alignment paradigm obscures the most fundamental problem of AI safety: That of prediction. It does this by conflating various interpretations of what values or utility functions are.
One of the most fundamental insights entailed by a move from an alignment to a predictive paradigm is that is becomes far from clear whether the relevant problems are solvable.
Nothing in this shows that AGI is “guaranteed to be destructive to human preferences” of course. Rather, it shows that various paradigms that one may choose to try to make AI safe, like RLHF, should actually not make us more confident in our AGI systems at all because they address the wrong questions: We can never know in advance whether they will work and this is true for all paradigms that try to sidestep the prediction problem by appealing to values (bracketing hard-wired values of course).
I’d just say that we’ve never known anything in advance for certain. The idea that we’d be able to prove analytically that an ASI was safe was always hopeless. And that’s been accepted by most of the alignment community for some time. I don’t see that this perspective changes the predictive power of the theories we’re applying.
I think the interesting discussion is not about how certain exactly our predictions of doom are or can be.
Let me put the central point another way: However pessimistic you are about the success of alignment you should become more pessimistic once you realize that alignment requires the prediction of an AIs actions. Any notion that we could circumvent this by engineering values into the system is illusory.
Neither captures it quite imo. I think it’s mostly an attempt at deconfusion:
We can’t hope to solve alignment by sufficiently nudging the relevant AI’s utility function since to know something about the utility function (as argued here) requires either predicting it (not just tweaking it & crossing your fingers really hard) or predicting the AI’s behavior. This is a substantially harder problem than the term alignment suggests on the surface and it’s one that it seems we cannot avoid. Interpretability (as far as I’m aware) is nowhere near this. The prediction section makes an epistemic argument though. It suggests that solving the alignment problem is much harder than one might think, not how likely doom by default is. For this, we would need to know the sampling space we’ll be in when developing AGI, and as shminux points out, a definition that delineates the safe part of that sampling space.
The second thesis is more strictly on the impossibility of safety guarantees but I think it’s most interesting in conjunction with the first one: If we need prediction for alignment and prediction is impossible for computationally universal systems, then the alignment problem may also be impossible to solve for such systems. Thus, the fire thesis (which is a possibility we should consider, nothing that’s shown here. There might be all kinds of informal clues for safety; a bit more on that in the discussion section here)
Are you arguing that AGI can’t be guaranteed to be safe? Or that it’s guaranteed to be destructive to human preferences? If the first, I take that to be old news.
I’ve long ago given up on guarantees of safety, and I read most of the community as being in a similar boat.
The primary thing I am arguing, as Max as already said, is that the AI alignment paradigm obscures the most fundamental problem of AI safety: That of prediction. It does this by conflating various interpretations of what values or utility functions are.
One of the most fundamental insights entailed by a move from an alignment to a predictive paradigm is that is becomes far from clear whether the relevant problems are solvable.
Nothing in this shows that AGI is “guaranteed to be destructive to human preferences” of course. Rather, it shows that various paradigms that one may choose to try to make AI safe, like RLHF, should actually not make us more confident in our AGI systems at all because they address the wrong questions: We can never know in advance whether they will work and this is true for all paradigms that try to sidestep the prediction problem by appealing to values (bracketing hard-wired values of course).
I’d just say that we’ve never known anything in advance for certain. The idea that we’d be able to prove analytically that an ASI was safe was always hopeless. And that’s been accepted by most of the alignment community for some time. I don’t see that this perspective changes the predictive power of the theories we’re applying.
I think the interesting discussion is not about how certain exactly our predictions of doom are or can be.
Let me put the central point another way: However pessimistic you are about the success of alignment you should become more pessimistic once you realize that alignment requires the prediction of an AIs actions. Any notion that we could circumvent this by engineering values into the system is illusory.
Neither captures it quite imo. I think it’s mostly an attempt at deconfusion:
We can’t hope to solve alignment by sufficiently nudging the relevant AI’s utility function since to know something about the utility function (as argued here) requires either predicting it (not just tweaking it & crossing your fingers really hard) or predicting the AI’s behavior. This is a substantially harder problem than the term alignment suggests on the surface and it’s one that it seems we cannot avoid. Interpretability (as far as I’m aware) is nowhere near this. The prediction section makes an epistemic argument though. It suggests that solving the alignment problem is much harder than one might think, not how likely doom by default is. For this, we would need to know the sampling space we’ll be in when developing AGI, and as shminux points out, a definition that delineates the safe part of that sampling space.
The second thesis is more strictly on the impossibility of safety guarantees but I think it’s most interesting in conjunction with the first one: If we need prediction for alignment and prediction is impossible for computationally universal systems, then the alignment problem may also be impossible to solve for such systems. Thus, the fire thesis (which is a possibility we should consider, nothing that’s shown here. There might be all kinds of informal clues for safety; a bit more on that in the discussion section here)