What about quickly distributing frontier AI when it is shown to be safe? That is risky of course if it isn’t safe, however if the deployed AI is as powerful and distributed as far as possible, then a bad AI would need to be more powerful comparatively to take over.
So
AI(x-1) is everywhere and protecting as much as possible, AI(x) is sandboxed
VS
AI(x-2) is protecting everything, AI(x-1) is in a few places, AI(x) is sandboxed.
I think the relative difficulty of hacking AI(x-1) and AI(x-2) will be sensitive to how much emphasis you put on the “distribute AI(x-1) quickly” part. IE, if you rush it, you might make it worse, even if AI(x-1) has the potential to be more secure.
(Also, there is the “single point of failure” effect, though it seems unclear how large.)
What about quickly distributing frontier AI when it is shown to be safe? That is risky of course if it isn’t safe, however if the deployed AI is as powerful and distributed as far as possible, then a bad AI would need to be more powerful comparatively to take over.
So
AI(x-1) is everywhere and protecting as much as possible, AI(x) is sandboxed
VS
AI(x-2) is protecting everything, AI(x-1) is in a few places, AI(x) is sandboxed.
or the bad ai is able to hack every copy of the widely distributed ai the same way, making the question moot.
But it would surely be more likely to hack x-2 than x-1?
Right, and it would be easier to hack, since it has the same adversarial examples, right?
Oh, wait, I see what you’re saying. No I think hacking x-1 and x-2 will both be trivial. AIs are basically zero secure right now.
I think the relative difficulty of hacking AI(x-1) and AI(x-2) will be sensitive to how much emphasis you put on the “distribute AI(x-1) quickly” part. IE, if you rush it, you might make it worse, even if AI(x-1) has the potential to be more secure. (Also, there is the “single point of failure” effect, though it seems unclear how large.)