Example consequence: Why Conjecture is not sold on evals being a good idea. Even under the assumption that evals add marginal safety if they are applied during the training of frontier models from AGI companies, evals labs could still be negative. This comes from the fact that eval labs are part of the AGI orthodoxy ecosystem. They add marginal safety but nowhere near a satisfactory level. If eval labs become stronger, this might make them appear as sufficient safety interventions and lock us in the AGI orthodoxy worldview.
My guess is that:
Dangerous capability evals (Bio, cyber, persuasion and ARA) are by default (if ran correctly, with fine-tuning) very conservative estimates of danger. They measure quite accurately the ability of AIs to do bad things if they try and in the absence of countermeasures. (They may fail for dangers which humans can’t demonstrate, but I expect AIs to be able to fit human-demonstrated-danger before they become dangerous in superhuman ways.)
They don’t feed into the AGI orthodoxy, as they imply sth like “don’t build things which have the capability to do sth dangerous” (at odds with “scale and find safety measures along the way”).
Dangerous capability evals is the main way a narrative shift could happen (the world in which it is well known that GPT-6 can build bioweapons almost as well as bio experts and launch massive cyberattacks given only minor human assistance is a world in which it’s really easy to sell anti-AGI policies). Shitting on people who are doing these evals seems super counterproductive, as long as they are running the conservative evals they are currently running.
My steelman of Conjecture’s position here would be:
Current evals orgs are tightly integrated with AGI labs. AGI labs can pick which evals org to collaborate with, control the model access, which kind of evals will be conducted, which kind of report will be public, etc. This is this power position that makes current evals feed into AGI orthodoxy.
We don’t have good ways to conduct evals. We have wide error bars over how much juice one can extract from models and we are nowhere close to having the tools to upper bound capabilities from evals. I remember this being a very strong argument internally: we are very bad at extracting capabilities from pre-trained models and unforeseen breakthroughs (like a mega-CoT, giving much more improvement than a fine-tuning baseline) could create improvement of several compute-equivalent OOM in the short term, rendering all past evals useless.
Evals draw attention away from other kinds of limits, in particular compute limits. Conjecture is much more optimistic about (stringent) compute limits as they are harder to game.
My opinion is:
For evals to be fully trusted, we need more independence such as third party auditing designated by public actors with a legal framework that gives modalities for access to the models. External accountability is the condition needed for evals not to feed into AGI orthodoxy. I’m quite optimistic that we’ll get there soon, e.g. thanks to the effort of the UK AI Safety Institute, the EU AI Act, etc.
Re point I: the field of designing scaffolding is still very young. I think it’s possible we can see surprising discontinuous progress in this domain such that current evals were in fact far from the upper bound of capabilities we can extract from models. If we base deployment / training actions on such evals and find out later a better technique, it’s really hard to revert (e.g. for open source, but also it’s much easier to stop a model halfway through training when finding a scary ability than deleting it after a first period of deployment). See https://www.lesswrong.com/posts/fnc6Sgt3CGCdFmmgX/we-need-a-science-of-evals
I agree with the point 3. I’m generally quite happy with what we learned from the conservative evals and the role they played in raising public awareness of the risks. I’d like to see evals org finding more robust ways to evaluate performances and go toward more independence from the AGI labs.
Agreed that it’s sad if AI labs control what evals are being run. Doesn’t seem to be the case in practice (even when the AI lab runs the eval themselves, what you should measure is usually decided elsewhere + there are many mostly independent orgs).
We don’t have good ways to conduct evals
I think this is relatively weak. Consider the baseline elicitation technique where domain expert + LLM expert build a supervised training set of LLM-friendly CoT (+tools) and then do some version of pass@k, and labs check what users are using models for. There are 3 ways in which that could fail:
Models are closed source, and people find ways to misuse it way better than the eval orgs --> unlikely in my opinion because you would need lots of iteration to get better than the eval orgs, so it’s likely you’d get caught.
Models are open sourced, and people find drastic improvements scaffolding improvements that enable open source model to outperform evals on close source models --> unlikely in my opinion because I don’t think the baseline I described can be beaten by more OOMs than the current gap between close and open source (but it’s starts to bite in worlds where the best models are open sourced).
Models are closed source, and the models can zero-shot find ways to self-elicit capabilities to cause a catastrophe (and sandbags when we use RL to elicit these capabilities) --> unlikely in my opinion for models that mostly learned to do next-token prediction. I think amazing self-elicitation abilities don’t happen prior to humans eliciting dangerous capabilities in the usual ways.
I think people massively over-index on prompting being difficult. Fine-tuning is such a good capability elicitation strategy!
I think you’re wrong about baseline elicitation sufficing.
A key difficulty is that we might need to estimate what the elicitation quality will look like in several years because the model might be stolent in advance. I agree about self-elicitation and misuse elicitation being relatively easy to compete with. And I agree that the best models probably won’t be intentionally open sourced.
The concern is something like:
We run evals (with our best elicitation) on a model from an AI lab and it doesn’t seem that scary.
North Korea steals the model immediately because lol, why not. The model would have been secured much better if our evals indicated that it would be scary.
2 years later, elicitation technology is much better to the point where North Korea possessing the model is a substantial risk.
The main hope for resolving this is that we might be able to project future elicitation quality a few years out, but this seems at least somewhat tricky (if we don’t want to be wildly conservative).
Separately, if you want a clear red line, it’s sad if relatively cheap elicitation methods which are developed can result in overshooting the line: getting people to delete model weights is considerably sadder than stopping these models from being trained. (Even though it is in principle possible to continue developing countermeasures etc as elicitation techniques improve. Also, I don’t think current eval red lines are targeting “stop”, they are more targeting “now you need some mitigations”.)
Agreed about the red line. It’s probably the main weakness of the eval-then-stop strategy. (I think progress in elicitation will slow down fast enough that it won’t be a problem given large enough safety margins, but I’m unsure about that.)
I think that the data from evals could provide a relatively strong ground on which to ground a pause, even if that’s not what labs will argue for. I think it’s sensible for people to argue that it’s unfair and risky to let a private actor control a resource which could cause catastrophes (e.g. better bioweapons or mass cyberattacks, not necessarily takeover), even if they could build good countermeasures (especially given a potentially small upside relative to the risks). I’m not sure if that’s the right thing to do and argue for, but surely this is the central piece of evidence you will rely on if you want to argue for a pause?
I don’t get the stance on evals.
My guess is that:
Dangerous capability evals (Bio, cyber, persuasion and ARA) are by default (if ran correctly, with fine-tuning) very conservative estimates of danger. They measure quite accurately the ability of AIs to do bad things if they try and in the absence of countermeasures. (They may fail for dangers which humans can’t demonstrate, but I expect AIs to be able to fit human-demonstrated-danger before they become dangerous in superhuman ways.)
They don’t feed into the AGI orthodoxy, as they imply sth like “don’t build things which have the capability to do sth dangerous” (at odds with “scale and find safety measures along the way”).
Dangerous capability evals is the main way a narrative shift could happen (the world in which it is well known that GPT-6 can build bioweapons almost as well as bio experts and launch massive cyberattacks given only minor human assistance is a world in which it’s really easy to sell anti-AGI policies). Shitting on people who are doing these evals seems super counterproductive, as long as they are running the conservative evals they are currently running.
With which points do you disagree?
My steelman of Conjecture’s position here would be:
Current evals orgs are tightly integrated with AGI labs. AGI labs can pick which evals org to collaborate with, control the model access, which kind of evals will be conducted, which kind of report will be public, etc. This is this power position that makes current evals feed into AGI orthodoxy.
We don’t have good ways to conduct evals. We have wide error bars over how much juice one can extract from models and we are nowhere close to having the tools to upper bound capabilities from evals. I remember this being a very strong argument internally: we are very bad at extracting capabilities from pre-trained models and unforeseen breakthroughs (like a mega-CoT, giving much more improvement than a fine-tuning baseline) could create improvement of several compute-equivalent OOM in the short term, rendering all past evals useless.
Evals draw attention away from other kinds of limits, in particular compute limits. Conjecture is much more optimistic about (stringent) compute limits as they are harder to game.
My opinion is:
For evals to be fully trusted, we need more independence such as third party auditing designated by public actors with a legal framework that gives modalities for access to the models. External accountability is the condition needed for evals not to feed into AGI orthodoxy. I’m quite optimistic that we’ll get there soon, e.g. thanks to the effort of the UK AI Safety Institute, the EU AI Act, etc.
Re point I: the field of designing scaffolding is still very young. I think it’s possible we can see surprising discontinuous progress in this domain such that current evals were in fact far from the upper bound of capabilities we can extract from models. If we base deployment / training actions on such evals and find out later a better technique, it’s really hard to revert (e.g. for open source, but also it’s much easier to stop a model halfway through training when finding a scary ability than deleting it after a first period of deployment). See https://www.lesswrong.com/posts/fnc6Sgt3CGCdFmmgX/we-need-a-science-of-evals
I agree with the point 3. I’m generally quite happy with what we learned from the conservative evals and the role they played in raising public awareness of the risks. I’d like to see evals org finding more robust ways to evaluate performances and go toward more independence from the AGI labs.
Agreed that it’s sad if AI labs control what evals are being run. Doesn’t seem to be the case in practice (even when the AI lab runs the eval themselves, what you should measure is usually decided elsewhere + there are many mostly independent orgs).
I think this is relatively weak. Consider the baseline elicitation technique where domain expert + LLM expert build a supervised training set of LLM-friendly CoT (+tools) and then do some version of pass@k, and labs check what users are using models for. There are 3 ways in which that could fail:
Models are closed source, and people find ways to misuse it way better than the eval orgs --> unlikely in my opinion because you would need lots of iteration to get better than the eval orgs, so it’s likely you’d get caught.
Models are open sourced, and people find drastic improvements scaffolding improvements that enable open source model to outperform evals on close source models --> unlikely in my opinion because I don’t think the baseline I described can be beaten by more OOMs than the current gap between close and open source (but it’s starts to bite in worlds where the best models are open sourced).
Models are closed source, and the models can zero-shot find ways to self-elicit capabilities to cause a catastrophe (and sandbags when we use RL to elicit these capabilities) --> unlikely in my opinion for models that mostly learned to do next-token prediction. I think amazing self-elicitation abilities don’t happen prior to humans eliciting dangerous capabilities in the usual ways.
I think people massively over-index on prompting being difficult. Fine-tuning is such a good capability elicitation strategy!
I think you’re wrong about baseline elicitation sufficing.
A key difficulty is that we might need to estimate what the elicitation quality will look like in several years because the model might be stolent in advance. I agree about self-elicitation and misuse elicitation being relatively easy to compete with. And I agree that the best models probably won’t be intentionally open sourced.
The concern is something like:
We run evals (with our best elicitation) on a model from an AI lab and it doesn’t seem that scary.
North Korea steals the model immediately because lol, why not. The model would have been secured much better if our evals indicated that it would be scary.
2 years later, elicitation technology is much better to the point where North Korea possessing the model is a substantial risk.
The main hope for resolving this is that we might be able to project future elicitation quality a few years out, but this seems at least somewhat tricky (if we don’t want to be wildly conservative).
Separately, if you want a clear red line, it’s sad if relatively cheap elicitation methods which are developed can result in overshooting the line: getting people to delete model weights is considerably sadder than stopping these models from being trained. (Even though it is in principle possible to continue developing countermeasures etc as elicitation techniques improve. Also, I don’t think current eval red lines are targeting “stop”, they are more targeting “now you need some mitigations”.)
Agreed about the red line. It’s probably the main weakness of the eval-then-stop strategy. (I think progress in elicitation will slow down fast enough that it won’t be a problem given large enough safety margins, but I’m unsure about that.)
I think that the data from evals could provide a relatively strong ground on which to ground a pause, even if that’s not what labs will argue for. I think it’s sensible for people to argue that it’s unfair and risky to let a private actor control a resource which could cause catastrophes (e.g. better bioweapons or mass cyberattacks, not necessarily takeover), even if they could build good countermeasures (especially given a potentially small upside relative to the risks). I’m not sure if that’s the right thing to do and argue for, but surely this is the central piece of evidence you will rely on if you want to argue for a pause?