Thanks for your comment. In terms of technical AI safety, I think an interesting research question is the dynamics of multiple adversarial agents — i.e. is there a way to square the need for predicability and control with the need for a system to be unexploitable by an adversary, or are these in hopeless tension? This is relevant for AWs, but seems to also potentially be quite relevant for any multipolar AI world with strongly competitive dynamics.
To answer your research question, in much the same way that in computer security any non-understood behavior of the system which violates our beliefs about how it’s supposed to work is a “bug” and very likely en route to an exploit—in the same way that OpenBSD treats every crash as a security problem, because the system is not supposed to crash and therefore any crash proves that our beliefs about the system are false and therefore our beliefs about its security may also be false because its behavior is not known—in AI safety, you would expect system security to rest on understandable system behaviors. In AGI alignment, I do not expect to be working in an adversarial environment unless things are already far past having been lost, so it’s a moot point. Predictability, stability, and control are the keys to exploit-resistance and this will be as true in AI as it is in computer security, with a few extremely limited exceptions in which randomness is deployed across a constrained and well-understood range of randomized behaviors with numerical parameters, much as memory locations and private keys are randomized in computer security without say randomizing the code. I hope this allows you to lay this research question to rest and move on.
Thanks for your comment. In terms of technical AI safety, I think an interesting research question is the dynamics of multiple adversarial agents — i.e. is there a way to square the need for predicability and control with the need for a system to be unexploitable by an adversary, or are these in hopeless tension? This is relevant for AWs, but seems to also potentially be quite relevant for any multipolar AI world with strongly competitive dynamics.
To answer your research question, in much the same way that in computer security any non-understood behavior of the system which violates our beliefs about how it’s supposed to work is a “bug” and very likely en route to an exploit—in the same way that OpenBSD treats every crash as a security problem, because the system is not supposed to crash and therefore any crash proves that our beliefs about the system are false and therefore our beliefs about its security may also be false because its behavior is not known—in AI safety, you would expect system security to rest on understandable system behaviors. In AGI alignment, I do not expect to be working in an adversarial environment unless things are already far past having been lost, so it’s a moot point. Predictability, stability, and control are the keys to exploit-resistance and this will be as true in AI as it is in computer security, with a few extremely limited exceptions in which randomness is deployed across a constrained and well-understood range of randomized behaviors with numerical parameters, much as memory locations and private keys are randomized in computer security without say randomizing the code. I hope this allows you to lay this research question to rest and move on.
I started a reply to this comment and it turned into this shortform post.