I think some things we can do to better our chances include:
enforcing sandboxed testing of frontier models before they are deployed, using independent audits by governments or outside companies. This could potentially prevent a model which has undergone a sharp left turn from escaping.
better ways of testing for potential harms from AI systems, expanding the set of available evals of various sorts of risk
putting more collective resources into AI safety: alignment research, containment preparations, worldwide monitoring, international treaties
ensure that a militarily dominant coalition of nations is in agreement that should a Rogue AGI arise in the world, that their best chance of survival would be a rapid forceful response to stomp it out before it gains too much power. Have sufficient definitions and agreed upon procedures in place such that action could follow automatically from detection without need for lengthy discussion.
What about quickly distributing frontier AI when it is shown to be safe? That is risky of course if it isn’t safe, however if the deployed AI is as powerful and distributed as far as possible, then a bad AI would need to be more powerful comparatively to take over.
So
AI(x-1) is everywhere and protecting as much as possible, AI(x) is sandboxed
VS
AI(x-2) is protecting everything, AI(x-1) is in a few places, AI(x) is sandboxed.
I think the relative difficulty of hacking AI(x-1) and AI(x-2) will be sensitive to how much emphasis you put on the “distribute AI(x-1) quickly” part. IE, if you rush it, you might make it worse, even if AI(x-1) has the potential to be more secure.
(Also, there is the “single point of failure” effect, though it seems unclear how large.)
To clarify: The question about improving Steps 1-2 was meant specifically for [improving things that resemble Steps 1-2], rather than [improving alignment stuff in general]. And the things you mention seem only tangentially related to that, to me.
But that complaint aside: sure, all else being equal, all of the points you mention seem better having than not having.
I think some things we can do to better our chances include:
enforcing sandboxed testing of frontier models before they are deployed, using independent audits by governments or outside companies. This could potentially prevent a model which has undergone a sharp left turn from escaping.
better ways of testing for potential harms from AI systems, expanding the set of available evals of various sorts of risk
putting more collective resources into AI safety: alignment research, containment preparations, worldwide monitoring, international treaties
ensure that a militarily dominant coalition of nations is in agreement that should a Rogue AGI arise in the world, that their best chance of survival would be a rapid forceful response to stomp it out before it gains too much power. Have sufficient definitions and agreed upon procedures in place such that action could follow automatically from detection without need for lengthy discussion.
What about quickly distributing frontier AI when it is shown to be safe? That is risky of course if it isn’t safe, however if the deployed AI is as powerful and distributed as far as possible, then a bad AI would need to be more powerful comparatively to take over.
So
AI(x-1) is everywhere and protecting as much as possible, AI(x) is sandboxed
VS
AI(x-2) is protecting everything, AI(x-1) is in a few places, AI(x) is sandboxed.
or the bad ai is able to hack every copy of the widely distributed ai the same way, making the question moot.
But it would surely be more likely to hack x-2 than x-1?
Right, and it would be easier to hack, since it has the same adversarial examples, right?
Oh, wait, I see what you’re saying. No I think hacking x-1 and x-2 will both be trivial. AIs are basically zero secure right now.
I think the relative difficulty of hacking AI(x-1) and AI(x-2) will be sensitive to how much emphasis you put on the “distribute AI(x-1) quickly” part. IE, if you rush it, you might make it worse, even if AI(x-1) has the potential to be more secure. (Also, there is the “single point of failure” effect, though it seems unclear how large.)
To clarify: The question about improving Steps 1-2 was meant specifically for [improving things that resemble Steps 1-2], rather than [improving alignment stuff in general]. And the things you mention seem only tangentially related to that, to me.
But that complaint aside: sure, all else being equal, all of the points you mention seem better having than not having.