The first line of defence is to avoid training models that have sufficient dangerous capabilities and misalignment to pose extreme risk. Sufficiently concerning evaluation results should warrant delaying a scheduled training run or pausing an existing one
It’s very disappointing to me that this sentence doesn’t say “cancel”. As far as I understand, most people on this paper agree that we do not have alignment techniques to align superintelligence. Therefor, if the model evaluations predict an AI that is sufficiently smarter than humans, the training run should be cancelled.
Sure. Fwiw I read “delay” and “pause” as stop until it’s safe, not stop for a while and resume while the eval result is still concerning, but I agree being explicit would be nice.
Yeah, this is fair, and later in the section they say:
Careful scaling. If the developer is not confident it can train a safe model at the scale it initially had planned, they could instead train a smaller or otherwise weaker model.
Which is good, supports your interpretation, and gets close to the thing I want, albeit less explicitly than I would have liked.
I still think the “delay/pause” wording pretty strongly implies that the default is to wait for a short amount of time, and then keep going at the intended capability level. I think there’s some sort of implicit picture that the eval result will become unconcerning in a matter of weeks-months, which I just don’t see the mechanism for short of actually good alignment progress.
It’s very disappointing to me that this sentence doesn’t say “cancel”. As far as I understand, most people on this paper agree that we do not have alignment techniques to align superintelligence. Therefor, if the model evaluations predict an AI that is sufficiently smarter than humans, the training run should be cancelled.
Sure. Fwiw I read “delay” and “pause” as stop until it’s safe, not stop for a while and resume while the eval result is still concerning, but I agree being explicit would be nice.
Yeah, this is fair, and later in the section they say:
Which is good, supports your interpretation, and gets close to the thing I want, albeit less explicitly than I would have liked.
I still think the “delay/pause” wording pretty strongly implies that the default is to wait for a short amount of time, and then keep going at the intended capability level. I think there’s some sort of implicit picture that the eval result will become unconcerning in a matter of weeks-months, which I just don’t see the mechanism for short of actually good alignment progress.