Why are you sure that effective “evals” can exist even in principle?
Relatedly, the point which is least clear to me is what exactly would it mean to solve the “proper elicitation problem” and what exactly are the “requirements” laid out by the blue line on the graph. I think I’d need to get clear on this problem scope before beginning to assess whether this elicitation gap can even in principle be crossed via the methods which are being proposed (i.e. better design & coverage of black box evaluations).
As a non-example, possessing the kind of foundational scientific understanding which would allow someone to confidently say “We have run this evaluation suite and we now know once and for all that this system is definitely not capable of x, regardless of whatever elicitation techniques are developed in the future” seems me to be Science-of-AI-complete and is thus a non-starter for a north star for an agenda aimed at developing stronger inability arguments.
When I fast forward the development of black box evals aimed at supporting inability arguments, I see us arriving at a place where we have:
More SWE-Benchesque evaluations across critical domains which are perhaps “more Verified” by having higher quality expert judgement passed upon them.
Some kind of library which brings together a family of different SOTA prompting and finetuning recipes to apply to any evaluation scenario.
Which would allow us to make the claim “Given these trends in PTEs, and this coverage in evaluations, experts have vibed out that the probability that of this model being capable of producing catastrophe x is under an acceptable threshold” for a wider range of domains. To be clear, that’s a better place than we are now and something worth striving for but not something which I would qualify as “having solved the elicitation problem”. There are fundamental limitations to the kinds of claims which black box evaluations can reasonably support, and if we are to posit that the “elicitation gap” is solvable it needs have the right sorts of qualifications, amendments and hedging such that it’s on the right side of this fundamental divide.
Note, I don’t work on evals and expect that others have better models than this. My guess is that @Marius Hobbhahn has strong hopes on the field developing more formal statistical guarantees and other meta-evaluative practices as outlined in the references in the science of evals post, and would thus predict a stronger safety case sketch than the one laid out in the previous paragraph, but what the type signature of that sketch would be, and consequently how reasonable this sketch is given fundamental limitations of black box evaluations, is currently unclear to me.
These are all good points. I think there are two types of forecasts we could make with evals:
1. strict guarantees: almost like mathematical predictions where we can proof that the model is not going to behave in a specific way even with future elicitation techniques. 2. probabilistic predictions: We predict a distribution of capabilities or a range and agree on a threshold that should not be crossed. For example, if the 95% upper bound of that distribution crosses our specified capability level, we treat the model differently.
I think the second is achievable (and this is what the post is about), while the first is not. I expect we will have some sort of detailed scaling laws for LM agent capabilities and we will have a decent sense of the algorithmic progress of elicitation techniques. This would allow us to make a probabilistic prediction about what capabilities any given model is likely to have, e.g. if a well-motivated actor is willing to spend $1M on PTE in 4 years.
Additionally, I expect that we would get quite a long way with what Lucas calls “meta-evaluative practices”, e.g. getting a better sense of how wrong our past predictions were and accounting for that. I think this could have the form of “We invested $1M, 10 FTE-years and X FLOP to elicit the best capabilities; Let’s predict what 10x, 100x, 1000x, etc.” of that could achieve accounting for algorithmic progress.
Finally, I really think evals are just one part of a bigger defense-in-depth strategy. We still need control, scalable oversight, interpretability, governance, etc. The post is merely trying to express that for the evals part of that strategy, we should internalize which kind of scientific rigor we will likely need for the decisions we have tied to evals results and make sure that we can achieve them.
For context, I just trialed at METR and talked to various people there, but this take is my own.
I think further development of evals is likely to either get effective evals (informal upper bound on the future probability of catastrophe) or exciting negative results (“models do not follow reliable scaling laws, so AI development should be accordingly more cautious”).
The way to do this is just to examine models and fit scaling laws for catastrophe propensity, or various precursors thereof. Scaling laws would be fit to elicitation quality as well as things like pretraining compute, RL compute, and thinking time.
In a world where elicitation quality has very reliable scaling laws, we would observe that there are diminishing returns to better scaffolds. Elicitation quality is predictable, ideally an additive term on top of model quality, but more likely requiring some more information about the model. It is rare to ever discover a new scaffold that can 2x the performance of an already well-tested models.
In a world where elicitation quality is not reliably modelable, we would observe that different methods of elicitation routinely get wildly different bottom-line performance, and sometimes a new elicitation method makes models 10x smarter than before, making error bars on the best undiscovered elicitation method very wide. Different models may benefit from different elicitation methods, and some get 10x benefits while others are unaffected.
It is NOT KNOWN what world we are in (worst-case assumptions would put us in 2 though I’m optimistic we’re closer to 1 in practice), and determining this is just a matter of data collection. If our evals are still not good enough but we don’t seem to be in World 2 either, there are endless of tricks to add that make evals more thorough, some of which are already being used. Like evaluating models with limited human assistance, or dividing tasks into subtasks and sampling a huge number of tries for each.
I agree. In which case, I think the concrete proposal of “We need to invest more resources in this” is even more important. That way, we can find out if it’s impossible soon enough to use it as justification to make people stop pretending they’ve got it under control.
Yeah, it’s not a watertight argument and somewhat based on my current interpretation of past progress and projects in the making.
1. Intuitively, I would say for the problems we’re facing in evals, a ton of progress is bottlenecked by running fairly simple experiments and iterating fast. A reasonable part of it feels very parallelizable and the skill required is quite reachable for many people. 2. Most evals questions feel like we have a decent number of “obvious things” to try and since we have very tight feedback loops, making progress feels quite doable.
Intuitively, the “hardness level” to get to a robust science of evals and good coverage may be similar to going from the first transformer to GPT-3.5; You need to make a lot of design choices along the way, lots of research and spend some money but ultimately it’s just “do much more of the process you’re currently doing” (but we should probably spend more resources and intensify our efforts because I don’t feel like we’re on pace).
In contrast, there are other questions like “how do we fully map the human brain” that just seem like they come with a lot more fundamental questions along the way.
… and yet...
Why are you sure that effective “evals” can exist even in principle?
I think I’m seeing a “we really want this, therefore it must be possible” shift here.
Relatedly, the point which is least clear to me is what exactly would it mean to solve the “proper elicitation problem” and what exactly are the “requirements” laid out by the blue line on the graph. I think I’d need to get clear on this problem scope before beginning to assess whether this elicitation gap can even in principle be crossed via the methods which are being proposed (i.e. better design & coverage of black box evaluations).
As a non-example, possessing the kind of foundational scientific understanding which would allow someone to confidently say “We have run this evaluation suite and we now know once and for all that this system is definitely not capable of x, regardless of whatever elicitation techniques are developed in the future” seems me to be Science-of-AI-complete and is thus a non-starter for a north star for an agenda aimed at developing stronger inability arguments.
When I fast forward the development of black box evals aimed at supporting inability arguments, I see us arriving at a place where we have:
More SWE-Benchesque evaluations across critical domains which are perhaps “more Verified” by having higher quality expert judgement passed upon them.
Some kind of library which brings together a family of different SOTA prompting and finetuning recipes to apply to any evaluation scenario.
More data points and stronger forecasts for post training enhancements (PTEs).
Which would allow us to make the claim “Given these trends in PTEs, and this coverage in evaluations, experts have vibed out that the probability that of this model being capable of producing catastrophe x is under an acceptable threshold” for a wider range of domains. To be clear, that’s a better place than we are now and something worth striving for but not something which I would qualify as “having solved the elicitation problem”. There are fundamental limitations to the kinds of claims which black box evaluations can reasonably support, and if we are to posit that the “elicitation gap” is solvable it needs have the right sorts of qualifications, amendments and hedging such that it’s on the right side of this fundamental divide.
Note, I don’t work on evals and expect that others have better models than this. My guess is that @Marius Hobbhahn has strong hopes on the field developing more formal statistical guarantees and other meta-evaluative practices as outlined in the references in the science of evals post, and would thus predict a stronger safety case sketch than the one laid out in the previous paragraph, but what the type signature of that sketch would be, and consequently how reasonable this sketch is given fundamental limitations of black box evaluations, is currently unclear to me.
These are all good points. I think there are two types of forecasts we could make with evals:
1. strict guarantees: almost like mathematical predictions where we can proof that the model is not going to behave in a specific way even with future elicitation techniques.
2. probabilistic predictions: We predict a distribution of capabilities or a range and agree on a threshold that should not be crossed. For example, if the 95% upper bound of that distribution crosses our specified capability level, we treat the model differently.
I think the second is achievable (and this is what the post is about), while the first is not. I expect we will have some sort of detailed scaling laws for LM agent capabilities and we will have a decent sense of the algorithmic progress of elicitation techniques. This would allow us to make a probabilistic prediction about what capabilities any given model is likely to have, e.g. if a well-motivated actor is willing to spend $1M on PTE in 4 years.
Additionally, I expect that we would get quite a long way with what Lucas calls “meta-evaluative practices”, e.g. getting a better sense of how wrong our past predictions were and accounting for that. I think this could have the form of “We invested $1M, 10 FTE-years and X FLOP to elicit the best capabilities; Let’s predict what 10x, 100x, 1000x, etc.” of that could achieve accounting for algorithmic progress.
Finally, I really think evals are just one part of a bigger defense-in-depth strategy. We still need control, scalable oversight, interpretability, governance, etc. The post is merely trying to express that for the evals part of that strategy, we should internalize which kind of scientific rigor we will likely need for the decisions we have tied to evals results and make sure that we can achieve them.
For context, I just trialed at METR and talked to various people there, but this take is my own.
I think further development of evals is likely to either get effective evals (informal upper bound on the future probability of catastrophe) or exciting negative results (“models do not follow reliable scaling laws, so AI development should be accordingly more cautious”).
The way to do this is just to examine models and fit scaling laws for catastrophe propensity, or various precursors thereof. Scaling laws would be fit to elicitation quality as well as things like pretraining compute, RL compute, and thinking time.
In a world where elicitation quality has very reliable scaling laws, we would observe that there are diminishing returns to better scaffolds. Elicitation quality is predictable, ideally an additive term on top of model quality, but more likely requiring some more information about the model. It is rare to ever discover a new scaffold that can 2x the performance of an already well-tested models.
In a world where elicitation quality is not reliably modelable, we would observe that different methods of elicitation routinely get wildly different bottom-line performance, and sometimes a new elicitation method makes models 10x smarter than before, making error bars on the best undiscovered elicitation method very wide. Different models may benefit from different elicitation methods, and some get 10x benefits while others are unaffected.
It is NOT KNOWN what world we are in (worst-case assumptions would put us in 2 though I’m optimistic we’re closer to 1 in practice), and determining this is just a matter of data collection. If our evals are still not good enough but we don’t seem to be in World 2 either, there are endless of tricks to add that make evals more thorough, some of which are already being used. Like evaluating models with limited human assistance, or dividing tasks into subtasks and sampling a huge number of tries for each.
I agree. In which case, I think the concrete proposal of “We need to invest more resources in this” is even more important. That way, we can find out if it’s impossible soon enough to use it as justification to make people stop pretending they’ve got it under control.
Yeah, it’s not a watertight argument and somewhat based on my current interpretation of past progress and projects in the making.
1. Intuitively, I would say for the problems we’re facing in evals, a ton of progress is bottlenecked by running fairly simple experiments and iterating fast. A reasonable part of it feels very parallelizable and the skill required is quite reachable for many people.
2. Most evals questions feel like we have a decent number of “obvious things” to try and since we have very tight feedback loops, making progress feels quite doable.
Intuitively, the “hardness level” to get to a robust science of evals and good coverage may be similar to going from the first transformer to GPT-3.5; You need to make a lot of design choices along the way, lots of research and spend some money but ultimately it’s just “do much more of the process you’re currently doing” (but we should probably spend more resources and intensify our efforts because I don’t feel like we’re on pace).
In contrast, there are other questions like “how do we fully map the human brain” that just seem like they come with a lot more fundamental questions along the way.