There are various things that could happen and it depends on your estimate of likelihoods. If we have today’s level of safety measures, and a new model comes along with world-ending capabilities that we can’t detect, then, yeah, we don’t detect them and the model goes out and ends the world.
It’s possible, though, that before that happens, they’ll create a model that has dangerous (possibly world-ending, possibly just highly destructive) capabilities (like knowing how to cross-breed smallpox with COVID, or how to hack into most internet-attached computers and parlay that into all sorts of mayhem) but that isn’t good at concealing them, and they’ll detect this, and announce it to the world, which would hopefully say a collective “Oh fuck” and use this result to justify imposing majorly increased security and safety mandates for further AI development.
That would put us in a better position. Then maybe further iteration on that would convince the relevant people “You can’t make it safe; abandon this stuff, use only the previous generation for the economic value it provides, and go and forcibly prevent everyone in the world from making a model this capable.”
With LLMs as they are, I do suspect that dangerous cybersecurity capabilities are fairly near-term, and I don’t think near-term models are likely to hide those capabilities (though it may take some effort to elicit them). So some portions of the above seem likely to me; others are much more of a gamble. I’d say some version of the above has a … 10-30% chance of saving us?
There are various things that could happen and it depends on your estimate of likelihoods. If we have today’s level of safety measures, and a new model comes along with world-ending capabilities that we can’t detect, then, yeah, we don’t detect them and the model goes out and ends the world.
It’s possible, though, that before that happens, they’ll create a model that has dangerous (possibly world-ending, possibly just highly destructive) capabilities (like knowing how to cross-breed smallpox with COVID, or how to hack into most internet-attached computers and parlay that into all sorts of mayhem) but that isn’t good at concealing them, and they’ll detect this, and announce it to the world, which would hopefully say a collective “Oh fuck” and use this result to justify imposing majorly increased security and safety mandates for further AI development.
That would put us in a better position. Then maybe further iteration on that would convince the relevant people “You can’t make it safe; abandon this stuff, use only the previous generation for the economic value it provides, and go and forcibly prevent everyone in the world from making a model this capable.”
With LLMs as they are, I do suspect that dangerous cybersecurity capabilities are fairly near-term, and I don’t think near-term models are likely to hide those capabilities (though it may take some effort to elicit them). So some portions of the above seem likely to me; others are much more of a gamble. I’d say some version of the above has a … 10-30% chance of saving us?