Separately, if you want a clear red line, it’s sad if relatively cheap elicitation methods which are developed can result in overshooting the line: getting people to delete model weights is considerably sadder than stopping these models from being trained. (Even though it is in principle possible to continue developing countermeasures etc as elicitation techniques improve. Also, I don’t think current eval red lines are targeting “stop”, they are more targeting “now you need some mitigations”.)
Agreed about the red line. It’s probably the main weakness of the eval-then-stop strategy. (I think progress in elicitation will slow down fast enough that it won’t be a problem given large enough safety margins, but I’m unsure about that.)
I think that the data from evals could provide a relatively strong ground on which to ground a pause, even if that’s not what labs will argue for. I think it’s sensible for people to argue that it’s unfair and risky to let a private actor control a resource which could cause catastrophes (e.g. better bioweapons or mass cyberattacks, not necessarily takeover), even if they could build good countermeasures (especially given a potentially small upside relative to the risks). I’m not sure if that’s the right thing to do and argue for, but surely this is the central piece of evidence you will rely on if you want to argue for a pause?
Separately, if you want a clear red line, it’s sad if relatively cheap elicitation methods which are developed can result in overshooting the line: getting people to delete model weights is considerably sadder than stopping these models from being trained. (Even though it is in principle possible to continue developing countermeasures etc as elicitation techniques improve. Also, I don’t think current eval red lines are targeting “stop”, they are more targeting “now you need some mitigations”.)
Agreed about the red line. It’s probably the main weakness of the eval-then-stop strategy. (I think progress in elicitation will slow down fast enough that it won’t be a problem given large enough safety margins, but I’m unsure about that.)
I think that the data from evals could provide a relatively strong ground on which to ground a pause, even if that’s not what labs will argue for. I think it’s sensible for people to argue that it’s unfair and risky to let a private actor control a resource which could cause catastrophes (e.g. better bioweapons or mass cyberattacks, not necessarily takeover), even if they could build good countermeasures (especially given a potentially small upside relative to the risks). I’m not sure if that’s the right thing to do and argue for, but surely this is the central piece of evidence you will rely on if you want to argue for a pause?