Quickly written. Probably missed where people are already saying the same thing.
I actually feel like there’s a lot of policy and research effort aimed at slowing down the development of powerful AI–basically all the evals and responsible scaling policy stuff.
A story for why this is the AI safety paradigm we’ve ended up in is because it’s palatable. It’s palatable because it doesn’t actually require that you stop. Certainly, it doesn’t right now. To the extent companies (or governments) are on board, it’s because those companies are at best promising “I’ll stop later when it’s justified”. They’re probably betting that they’ll be able to keep arguing it’s not yet justified. At the least, it doesn’t require a change of course now and they’ll go along with it to placate you.
Even if people anticipate they will trigger evals and maybe have to delay or stop releases, I would bet they’re not imagining they have to delay or stop for all that long (if they’re even thinking it through that much). Just long enough to patch or fix the issue, then get back to training the next iteration. I’m curious how many people imagine that once certain evaluations are triggered, the correct update is that deep learning and transformers are too shaky a foundation. We might then need to stop large AI training runs until we have much more advanced alignment science, and maybe a new paradigm.
I’d wager that if certain evaluations are triggered, there will be people vying for the smallest possible argument to get back to business as usual. Arguments about not letting others get ahead will abound. Claims that it’s better for us to proceed (even though it’s risky) than the Other who is truly reckless. Better us with our values than them with their threatening values.
People genuinely concerned about AI are pursuing these approaches because they seem feasible compared to an outright moratorium. You can get companies and governments to make agreements that are “we’ll stop later” and “you only have to stop while some hypothetical condition is met”. If the bid was “stop now”, it’d be a non-starter.
And so the bet is that people will actually be willing to stop later to a much greater extent than they’re willing to stop now. As I write this, I’m unsure of what probabilities to place on this. If various evals are getting triggered in labs:
What probability is there that the lab listens to this vs ignores the warning sign and it doesn’t even make it out of the lab?
If it gets reported to the government, how strongly does the government insist on stopping? How quickly is it appeased before training is allowed to resume?
If a released model causes harm, how many people skeptical of AI doom concerns does it convince to change their mind and say “oh, actually this shouldn’t be allowed”? How many people, how much harm?
How much do people update that AI in general is unsafe vs that particular AI from that particular company is unsafe, and only they alone should be blocked?
How much do people argue that even though there are signs of risk here, it’d be more dangerous to let other pull ahead?
And if you get people to pause for a while and focus on safety, how long will they agree to a pause for before the shock of the damaged/triggered eval gets normalized and explained away and adequate justifications are assembled to keep going?
There are going to be people who fight tooth and nail, weight and bias, to keep the development going. If we assume that they are roughly equally motivated and agentic as us, who wins? Ultimately we have the harder challenge in that we want to stop others from doing something. I think the default is people get to do things.
I think there’s a chance that various evals and regulations do meaningfully slow things down, but I write this to express the fear that they’re false reassurance–there’s traction only because people who want to build AI are betting this won’t actually require them to stop.
The “Deferred and Temporary Stopping” Paradigm
Quickly written. Probably missed where people are already saying the same thing.
I actually feel like there’s a lot of policy and research effort aimed at slowing down the development of powerful AI–basically all the evals and responsible scaling policy stuff.
A story for why this is the AI safety paradigm we’ve ended up in is because it’s palatable. It’s palatable because it doesn’t actually require that you stop. Certainly, it doesn’t right now. To the extent companies (or governments) are on board, it’s because those companies are at best promising “I’ll stop later when it’s justified”. They’re probably betting that they’ll be able to keep arguing it’s not yet justified. At the least, it doesn’t require a change of course now and they’ll go along with it to placate you.
Even if people anticipate they will trigger evals and maybe have to delay or stop releases, I would bet they’re not imagining they have to delay or stop for all that long (if they’re even thinking it through that much). Just long enough to patch or fix the issue, then get back to training the next iteration. I’m curious how many people imagine that once certain evaluations are triggered, the correct update is that deep learning and transformers are too shaky a foundation. We might then need to stop large AI training runs until we have much more advanced alignment science, and maybe a new paradigm.
I’d wager that if certain evaluations are triggered, there will be people vying for the smallest possible argument to get back to business as usual. Arguments about not letting others get ahead will abound. Claims that it’s better for us to proceed (even though it’s risky) than the Other who is truly reckless. Better us with our values than them with their threatening values.
People genuinely concerned about AI are pursuing these approaches because they seem feasible compared to an outright moratorium. You can get companies and governments to make agreements that are “we’ll stop later” and “you only have to stop while some hypothetical condition is met”. If the bid was “stop now”, it’d be a non-starter.
And so the bet is that people will actually be willing to stop later to a much greater extent than they’re willing to stop now. As I write this, I’m unsure of what probabilities to place on this. If various evals are getting triggered in labs:
What probability is there that the lab listens to this vs ignores the warning sign and it doesn’t even make it out of the lab?
If it gets reported to the government, how strongly does the government insist on stopping? How quickly is it appeased before training is allowed to resume?
If a released model causes harm, how many people skeptical of AI doom concerns does it convince to change their mind and say “oh, actually this shouldn’t be allowed”? How many people, how much harm?
How much do people update that AI in general is unsafe vs that particular AI from that particular company is unsafe, and only they alone should be blocked?
How much do people argue that even though there are signs of risk here, it’d be more dangerous to let other pull ahead?
And if you get people to pause for a while and focus on safety, how long will they agree to a pause for before the shock of the damaged/triggered eval gets normalized and explained away and adequate justifications are assembled to keep going?
There are going to be people who fight tooth and nail, weight and bias, to keep the development going. If we assume that they are roughly equally motivated and agentic as us, who wins? Ultimately we have the harder challenge in that we want to stop others from doing something. I think the default is people get to do things.
I think there’s a chance that various evals and regulations do meaningfully slow things down, but I write this to express the fear that they’re false reassurance–there’s traction only because people who want to build AI are betting this won’t actually require them to stop.
Related:
Would catching your AIs trying to escape convince AI developers to slow down or undeploy?
comment by Gwern
another comment by Gwern