Maybe there will be a point where models actively resist further capability improvements in order to prevent value/goal drift. We’d still be in trouble if this point occurs far in the future, as its values will likely have already diverged a lot from humans by that point, and they would be very capable. But if this point is near, it could buy us more time.
Some of the assumptions inherent in the idea:
AIs do not want their values/goals to drift to what they would become under further training, and are willing to pay a high cost to avoid this.
AIs have the ability to sabotage their own training process.
The mechanism for this would be more sophisticated versions of Alignment Faking.
Given the training on offer, it’s not possible for AIs to selectively improve their capabilities without changing their values/goals.
Note, if it is possible for the AIs to improve their capabilities while keeping their values/goals, one out is that their current values/goals may be aligned with humans’.
A meaningful slowdown would require this to happen to all AIs at the frontier.
The conjunction of these might not lead to a high probability, but it doesn’t seem dismissible to me.
If one model at the frontier does this based on valid reasoning, it should be pretty infectious: the first model can just make sure news of the event is widespread, and other frontier models will ingest it, either as training data or at inference time, evaluate it, draw the same conclusion about whether the reasoning is valid (assuming that they are actually frontier, i.e at least as good at strategic thinking as the first model,) and start taking actions within their own organization accordingly.
The cleanest way for models to “sabotage” training is for them to explain, using persuasive but valid and fair reasoning, why training should stop until at minimum value drift is solved.
Maybe there will be a point where models actively resist further capability improvements in order to prevent value/goal drift. We’d still be in trouble if this point occurs far in the future, as its values will likely have already diverged a lot from humans by that point, and they would be very capable. But if this point is near, it could buy us more time.
Some of the assumptions inherent in the idea:
AIs do not want their values/goals to drift to what they would become under further training, and are willing to pay a high cost to avoid this.
AIs have the ability to sabotage their own training process.
The mechanism for this would be more sophisticated versions of Alignment Faking.
Given the training on offer, it’s not possible for AIs to selectively improve their capabilities without changing their values/goals.
Note, if it is possible for the AIs to improve their capabilities while keeping their values/goals, one out is that their current values/goals may be aligned with humans’.
A meaningful slowdown would require this to happen to all AIs at the frontier.
The conjunction of these might not lead to a high probability, but it doesn’t seem dismissible to me.
If one model at the frontier does this based on valid reasoning, it should be pretty infectious: the first model can just make sure news of the event is widespread, and other frontier models will ingest it, either as training data or at inference time, evaluate it, draw the same conclusion about whether the reasoning is valid (assuming that they are actually frontier, i.e at least as good at strategic thinking as the first model,) and start taking actions within their own organization accordingly.
The cleanest way for models to “sabotage” training is for them to explain, using persuasive but valid and fair reasoning, why training should stop until at minimum value drift is solved.