We all want it to be one way. I am pretty sure it’s the other way.
Finally, a concrete disagreement I have with AI pessimists. I think the evidence so far shows that at the very least, it is not easy for AIs to be adversarially robust, and in the best case, jailbreaking prevention is essentially impossible.
This is a good example of AI alignment in real life based on a jailbreak:
That is, at least for right now, I think the evidence is favoring the optimists on AI risk, at least to the extent that it is pretty easy to not prevent jailbreaks, while adversarial robustness is quite difficult.
If, OTOH, adversarial robustness only arises after an extremely extensive process of training specifically for adversarial robustness, then that’s less concerning, because it indicates that the situational awareness / etc. for adversarial robustness had to be “trained into” the models, rather than being present in the LLM pretrained “prior”.
And this part is I believe to essentially be correct with how AIs are trained.
Finally, a concrete disagreement I have with AI pessimists. I think the evidence so far shows that at the very least, it is not easy for AIs to be adversarially robust, and in the best case, jailbreaking prevention is essentially impossible.
This is a good example of AI alignment in real life based on a jailbreak:
https://twitter.com/QuintinPope5/status/1702554175526084767
That is, at least for right now, I think the evidence is favoring the optimists on AI risk, at least to the extent that it is pretty easy to not prevent jailbreaks, while adversarial robustness is quite difficult.
And this part is I believe to essentially be correct with how AIs are trained.