These are all good points. I think there are two types of forecasts we could make with evals:
1. strict guarantees: almost like mathematical predictions where we can proof that the model is not going to behave in a specific way even with future elicitation techniques. 2. probabilistic predictions: We predict a distribution of capabilities or a range and agree on a threshold that should not be crossed. For example, if the 95% upper bound of that distribution crosses our specified capability level, we treat the model differently.
I think the second is achievable (and this is what the post is about), while the first is not. I expect we will have some sort of detailed scaling laws for LM agent capabilities and we will have a decent sense of the algorithmic progress of elicitation techniques. This would allow us to make a probabilistic prediction about what capabilities any given model is likely to have, e.g. if a well-motivated actor is willing to spend $1M on PTE in 4 years.
Additionally, I expect that we would get quite a long way with what Lucas calls “meta-evaluative practices”, e.g. getting a better sense of how wrong our past predictions were and accounting for that. I think this could have the form of “We invested $1M, 10 FTE-years and X FLOP to elicit the best capabilities; Let’s predict what 10x, 100x, 1000x, etc.” of that could achieve accounting for algorithmic progress.
Finally, I really think evals are just one part of a bigger defense-in-depth strategy. We still need control, scalable oversight, interpretability, governance, etc. The post is merely trying to express that for the evals part of that strategy, we should internalize which kind of scientific rigor we will likely need for the decisions we have tied to evals results and make sure that we can achieve them.
These are all good points. I think there are two types of forecasts we could make with evals:
1. strict guarantees: almost like mathematical predictions where we can proof that the model is not going to behave in a specific way even with future elicitation techniques.
2. probabilistic predictions: We predict a distribution of capabilities or a range and agree on a threshold that should not be crossed. For example, if the 95% upper bound of that distribution crosses our specified capability level, we treat the model differently.
I think the second is achievable (and this is what the post is about), while the first is not. I expect we will have some sort of detailed scaling laws for LM agent capabilities and we will have a decent sense of the algorithmic progress of elicitation techniques. This would allow us to make a probabilistic prediction about what capabilities any given model is likely to have, e.g. if a well-motivated actor is willing to spend $1M on PTE in 4 years.
Additionally, I expect that we would get quite a long way with what Lucas calls “meta-evaluative practices”, e.g. getting a better sense of how wrong our past predictions were and accounting for that. I think this could have the form of “We invested $1M, 10 FTE-years and X FLOP to elicit the best capabilities; Let’s predict what 10x, 100x, 1000x, etc.” of that could achieve accounting for algorithmic progress.
Finally, I really think evals are just one part of a bigger defense-in-depth strategy. We still need control, scalable oversight, interpretability, governance, etc. The post is merely trying to express that for the evals part of that strategy, we should internalize which kind of scientific rigor we will likely need for the decisions we have tied to evals results and make sure that we can achieve them.