Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. Open-source is still fine below the capabilities benchmarks, and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.
Can you clarify whether this is implying that open-source capability benchmark thresholds will be at the same or similar levels to closed-source ones? That is how I initially read it, but not sure that it’s the intended meaning.
More thoughts that are only semi-relevant if I misunderstood below.
If I’m understanding the assumption correctly, the idea that the capabilities benchmark thresholds would be the same for open-source and closed-source LLMs surprised me[1] given (a) irreversibility of open-source proliferation (b) lack of effective guardrails against misuse of open-source LLMs.
Perhaps the implicit argument is that labs should assume their models will be leaked when doing risk evaluations unless they have insanely good infosec so they should effectively treat their models as open-source. Anthropic does say in their RSP:
To account for the possibility of model theft and subsequent fine-tuning, ASL-3 is intended to characterize the model’s underlying knowledge and abilities
This makes some sense to me, but looking at the definition of ASL-3 as if the model is effectively open-sourced:
We define an ASL-3 model as one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost, do at least one of the following two things
I understand that limiting to only 1% of the model costs and only existing post-training techniques makes it more tractable to measure the risk, but it strikes me as far from a conservative bound if we are assuming the model might be stolen and/or leaked. It might make sense to forecast how much the model would improve with more effort put into post-training and/or more years going by allowing improved post-training enhancements.
Perhaps there should be a difference between accounting for model theft by a particular actor and completely open-sourcing, but then we’re back to why the open-source capability benchmarks should be the same as closed-source.
This is not to take a stance on the effect of open-sourcing LLMs at current capabilities levels, but rather being surprised that the capability threshold for when open-source is too dangerous would be the same as closed-source.
Can you clarify whether this is implying that open-source capability benchmark thresholds will be at the same or similar levels to closed-source ones? That is how I initially read it, but not sure that it’s the intended meaning.
No, you’d want the benchmarks to be different for open-source vs. closed-source, since there are some risks (e.g. bio-related misuse) that are much scarier for open-source models. I tried to mention this here:
Early on, as models are not too powerful, almost all of the work is being done by capabilities evaluations that determine that the model isn’t capable of e.g. takeover. The safety evaluations are mostly around security and misuse risks.
I’ll edit the post to be more clear on this point.
Can you clarify whether this is implying that open-source capability benchmark thresholds will be at the same or similar levels to closed-source ones? That is how I initially read it, but not sure that it’s the intended meaning.
More thoughts that are only semi-relevant if I misunderstood below.
------------------------------------------------------------------------------------------------------------------------------------------------------------------
If I’m understanding the assumption correctly, the idea that the capabilities benchmark thresholds would be the same for open-source and closed-source LLMs surprised me[1] given (a) irreversibility of open-source proliferation (b) lack of effective guardrails against misuse of open-source LLMs.
Perhaps the implicit argument is that labs should assume their models will be leaked when doing risk evaluations unless they have insanely good infosec so they should effectively treat their models as open-source. Anthropic does say in their RSP:
This makes some sense to me, but looking at the definition of ASL-3 as if the model is effectively open-sourced:
I understand that limiting to only 1% of the model costs and only existing post-training techniques makes it more tractable to measure the risk, but it strikes me as far from a conservative bound if we are assuming the model might be stolen and/or leaked. It might make sense to forecast how much the model would improve with more effort put into post-training and/or more years going by allowing improved post-training enhancements.
Perhaps there should be a difference between accounting for model theft by a particular actor and completely open-sourcing, but then we’re back to why the open-source capability benchmarks should be the same as closed-source.
This is not to take a stance on the effect of open-sourcing LLMs at current capabilities levels, but rather being surprised that the capability threshold for when open-source is too dangerous would be the same as closed-source.
No, you’d want the benchmarks to be different for open-source vs. closed-source, since there are some risks (e.g. bio-related misuse) that are much scarier for open-source models. I tried to mention this here:
I’ll edit the post to be more clear on this point.