Agreed that it’s sad if AI labs control what evals are being run. Doesn’t seem to be the case in practice (even when the AI lab runs the eval themselves, what you should measure is usually decided elsewhere + there are many mostly independent orgs).
We don’t have good ways to conduct evals
I think this is relatively weak. Consider the baseline elicitation technique where domain expert + LLM expert build a supervised training set of LLM-friendly CoT (+tools) and then do some version of pass@k, and labs check what users are using models for. There are 3 ways in which that could fail:
Models are closed source, and people find ways to misuse it way better than the eval orgs --> unlikely in my opinion because you would need lots of iteration to get better than the eval orgs, so it’s likely you’d get caught.
Models are open sourced, and people find drastic improvements scaffolding improvements that enable open source model to outperform evals on close source models --> unlikely in my opinion because I don’t think the baseline I described can be beaten by more OOMs than the current gap between close and open source (but it’s starts to bite in worlds where the best models are open sourced).
Models are closed source, and the models can zero-shot find ways to self-elicit capabilities to cause a catastrophe (and sandbags when we use RL to elicit these capabilities) --> unlikely in my opinion for models that mostly learned to do next-token prediction. I think amazing self-elicitation abilities don’t happen prior to humans eliciting dangerous capabilities in the usual ways.
I think people massively over-index on prompting being difficult. Fine-tuning is such a good capability elicitation strategy!
I think you’re wrong about baseline elicitation sufficing.
A key difficulty is that we might need to estimate what the elicitation quality will look like in several years because the model might be stolent in advance. I agree about self-elicitation and misuse elicitation being relatively easy to compete with. And I agree that the best models probably won’t be intentionally open sourced.
The concern is something like:
We run evals (with our best elicitation) on a model from an AI lab and it doesn’t seem that scary.
North Korea steals the model immediately because lol, why not. The model would have been secured much better if our evals indicated that it would be scary.
2 years later, elicitation technology is much better to the point where North Korea possessing the model is a substantial risk.
The main hope for resolving this is that we might be able to project future elicitation quality a few years out, but this seems at least somewhat tricky (if we don’t want to be wildly conservative).
Separately, if you want a clear red line, it’s sad if relatively cheap elicitation methods which are developed can result in overshooting the line: getting people to delete model weights is considerably sadder than stopping these models from being trained. (Even though it is in principle possible to continue developing countermeasures etc as elicitation techniques improve. Also, I don’t think current eval red lines are targeting “stop”, they are more targeting “now you need some mitigations”.)
Agreed about the red line. It’s probably the main weakness of the eval-then-stop strategy. (I think progress in elicitation will slow down fast enough that it won’t be a problem given large enough safety margins, but I’m unsure about that.)
I think that the data from evals could provide a relatively strong ground on which to ground a pause, even if that’s not what labs will argue for. I think it’s sensible for people to argue that it’s unfair and risky to let a private actor control a resource which could cause catastrophes (e.g. better bioweapons or mass cyberattacks, not necessarily takeover), even if they could build good countermeasures (especially given a potentially small upside relative to the risks). I’m not sure if that’s the right thing to do and argue for, but surely this is the central piece of evidence you will rely on if you want to argue for a pause?
Agreed that it’s sad if AI labs control what evals are being run. Doesn’t seem to be the case in practice (even when the AI lab runs the eval themselves, what you should measure is usually decided elsewhere + there are many mostly independent orgs).
I think this is relatively weak. Consider the baseline elicitation technique where domain expert + LLM expert build a supervised training set of LLM-friendly CoT (+tools) and then do some version of pass@k, and labs check what users are using models for. There are 3 ways in which that could fail:
Models are closed source, and people find ways to misuse it way better than the eval orgs --> unlikely in my opinion because you would need lots of iteration to get better than the eval orgs, so it’s likely you’d get caught.
Models are open sourced, and people find drastic improvements scaffolding improvements that enable open source model to outperform evals on close source models --> unlikely in my opinion because I don’t think the baseline I described can be beaten by more OOMs than the current gap between close and open source (but it’s starts to bite in worlds where the best models are open sourced).
Models are closed source, and the models can zero-shot find ways to self-elicit capabilities to cause a catastrophe (and sandbags when we use RL to elicit these capabilities) --> unlikely in my opinion for models that mostly learned to do next-token prediction. I think amazing self-elicitation abilities don’t happen prior to humans eliciting dangerous capabilities in the usual ways.
I think people massively over-index on prompting being difficult. Fine-tuning is such a good capability elicitation strategy!
I think you’re wrong about baseline elicitation sufficing.
A key difficulty is that we might need to estimate what the elicitation quality will look like in several years because the model might be stolent in advance. I agree about self-elicitation and misuse elicitation being relatively easy to compete with. And I agree that the best models probably won’t be intentionally open sourced.
The concern is something like:
We run evals (with our best elicitation) on a model from an AI lab and it doesn’t seem that scary.
North Korea steals the model immediately because lol, why not. The model would have been secured much better if our evals indicated that it would be scary.
2 years later, elicitation technology is much better to the point where North Korea possessing the model is a substantial risk.
The main hope for resolving this is that we might be able to project future elicitation quality a few years out, but this seems at least somewhat tricky (if we don’t want to be wildly conservative).
Separately, if you want a clear red line, it’s sad if relatively cheap elicitation methods which are developed can result in overshooting the line: getting people to delete model weights is considerably sadder than stopping these models from being trained. (Even though it is in principle possible to continue developing countermeasures etc as elicitation techniques improve. Also, I don’t think current eval red lines are targeting “stop”, they are more targeting “now you need some mitigations”.)
Agreed about the red line. It’s probably the main weakness of the eval-then-stop strategy. (I think progress in elicitation will slow down fast enough that it won’t be a problem given large enough safety margins, but I’m unsure about that.)
I think that the data from evals could provide a relatively strong ground on which to ground a pause, even if that’s not what labs will argue for. I think it’s sensible for people to argue that it’s unfair and risky to let a private actor control a resource which could cause catastrophes (e.g. better bioweapons or mass cyberattacks, not necessarily takeover), even if they could build good countermeasures (especially given a potentially small upside relative to the risks). I’m not sure if that’s the right thing to do and argue for, but surely this is the central piece of evidence you will rely on if you want to argue for a pause?