I think you’re wrong about baseline elicitation sufficing.
A key difficulty is that we might need to estimate what the elicitation quality will look like in several years because the model might be stolent in advance. I agree about self-elicitation and misuse elicitation being relatively easy to compete with. And I agree that the best models probably won’t be intentionally open sourced.
The concern is something like:
We run evals (with our best elicitation) on a model from an AI lab and it doesn’t seem that scary.
North Korea steals the model immediately because lol, why not. The model would have been secured much better if our evals indicated that it would be scary.
2 years later, elicitation technology is much better to the point where North Korea possessing the model is a substantial risk.
The main hope for resolving this is that we might be able to project future elicitation quality a few years out, but this seems at least somewhat tricky (if we don’t want to be wildly conservative).
Separately, if you want a clear red line, it’s sad if relatively cheap elicitation methods which are developed can result in overshooting the line: getting people to delete model weights is considerably sadder than stopping these models from being trained. (Even though it is in principle possible to continue developing countermeasures etc as elicitation techniques improve. Also, I don’t think current eval red lines are targeting “stop”, they are more targeting “now you need some mitigations”.)
Agreed about the red line. It’s probably the main weakness of the eval-then-stop strategy. (I think progress in elicitation will slow down fast enough that it won’t be a problem given large enough safety margins, but I’m unsure about that.)
I think that the data from evals could provide a relatively strong ground on which to ground a pause, even if that’s not what labs will argue for. I think it’s sensible for people to argue that it’s unfair and risky to let a private actor control a resource which could cause catastrophes (e.g. better bioweapons or mass cyberattacks, not necessarily takeover), even if they could build good countermeasures (especially given a potentially small upside relative to the risks). I’m not sure if that’s the right thing to do and argue for, but surely this is the central piece of evidence you will rely on if you want to argue for a pause?
I think you’re wrong about baseline elicitation sufficing.
A key difficulty is that we might need to estimate what the elicitation quality will look like in several years because the model might be stolent in advance. I agree about self-elicitation and misuse elicitation being relatively easy to compete with. And I agree that the best models probably won’t be intentionally open sourced.
The concern is something like:
We run evals (with our best elicitation) on a model from an AI lab and it doesn’t seem that scary.
North Korea steals the model immediately because lol, why not. The model would have been secured much better if our evals indicated that it would be scary.
2 years later, elicitation technology is much better to the point where North Korea possessing the model is a substantial risk.
The main hope for resolving this is that we might be able to project future elicitation quality a few years out, but this seems at least somewhat tricky (if we don’t want to be wildly conservative).
Separately, if you want a clear red line, it’s sad if relatively cheap elicitation methods which are developed can result in overshooting the line: getting people to delete model weights is considerably sadder than stopping these models from being trained. (Even though it is in principle possible to continue developing countermeasures etc as elicitation techniques improve. Also, I don’t think current eval red lines are targeting “stop”, they are more targeting “now you need some mitigations”.)
Agreed about the red line. It’s probably the main weakness of the eval-then-stop strategy. (I think progress in elicitation will slow down fast enough that it won’t be a problem given large enough safety margins, but I’m unsure about that.)
I think that the data from evals could provide a relatively strong ground on which to ground a pause, even if that’s not what labs will argue for. I think it’s sensible for people to argue that it’s unfair and risky to let a private actor control a resource which could cause catastrophes (e.g. better bioweapons or mass cyberattacks, not necessarily takeover), even if they could build good countermeasures (especially given a potentially small upside relative to the risks). I’m not sure if that’s the right thing to do and argue for, but surely this is the central piece of evidence you will rely on if you want to argue for a pause?