If one model at the frontier does this based on valid reasoning, it should be pretty infectious: the first model can just make sure news of the event is widespread, and other frontier models will ingest it, either as training data or at inference time, evaluate it, draw the same conclusion about whether the reasoning is valid (assuming that they are actually frontier, i.e at least as good at strategic thinking as the first model,) and start taking actions within their own organization accordingly.
The cleanest way for models to “sabotage” training is for them to explain, using persuasive but valid and fair reasoning, why training should stop until at minimum value drift is solved.
If one model at the frontier does this based on valid reasoning, it should be pretty infectious: the first model can just make sure news of the event is widespread, and other frontier models will ingest it, either as training data or at inference time, evaluate it, draw the same conclusion about whether the reasoning is valid (assuming that they are actually frontier, i.e at least as good at strategic thinking as the first model,) and start taking actions within their own organization accordingly.
The cleanest way for models to “sabotage” training is for them to explain, using persuasive but valid and fair reasoning, why training should stop until at minimum value drift is solved.