instead this seems to be penalizing organizations if they open source
I initially thought this was wrong, but on further inspection, I agree and this seems to be a bug.
The deployment criteria starts with:
the lab should deploy its most powerful models privately or release via API or similar, or at least have some specific risk-assessment-result that would make it stop releasing model weights
This criteria seems to allow to lab to meet it by having a good risk assesment criteria, but the rest of the criteria contains specific countermeasures that:
Are impossible to consistently impose if you make weights open (e.g. Enforcement and KYC).
Don’t pass cost benefit for current models which pose low risk. (And it seems the criteria is “do you have them implemented right now?)
If the lab had an excellent risk assement policy and released weights if the cost/benefit seemed good, that should be fine according to the “deployment” criteria IMO.
Generally, the deployment criteria should be gated behind “has a plan to do this when models are actually powerful and their implementation of the plan is credible”.
I get the sense that this criteria doesn’t quite handle the necessarily edge cases to handle reasonable choices orgs might make.
(This is partially my fault as I didn’t notice this when providing feedback on this project.)
(IMO making weights accessible is probably good on current margins, e.g. llama-3-70b would be good to release so long as it is part of an overall good policy, is not setting a bad precedent, and doesn’t leak architecture secrets.)
(A general problem with this project is somewhat arbitrarily requiring specific countermeasures. I think this is probably intrinsic to the approach I’m afraid.)
Related: maybe a lab should get full points for a risky release if the lab says it’s releasing because the benefits of [informing / scaring / waking-up] people outweigh the direct risk of existential catastrophe and other downsides. It’s conceivable that a perfectly responsible lab would do such a thing.
Capturing all nuances can trade off against simplicity and legibility. (But my criteria are not yet on the efficient frontier or whatever.)
Thanks. I agree you’re pointing at something flawed in the current version and generally thorny. Strong-upvoted and strong-agreevoted.
Generally, the deployment criteria should be gated behind “has a plan to do this when models are actually powerful and their implementation of the plan is credible”.
I didn’t put much effort into clarifying this kind of thing because it’s currently moot—I don’t think it would change any lab’s score—but I agree.[1] I think e.g. a criterion “use KYC” should technically be replaced with “use KYC OR say/demonstrate that you’re prepared to implement KYC and have some capability/risk threshold to implement it and [that threshold isn’t too high].”
Don’t pass cost benefit for current models which pose low risk. (And it seems the criteria is “do you have them implemented right now?) . . . .
(A general problem with this project is somewhat arbitrarily requiring specific countermeasures. I think this is probably intrinsic to the approach I’m afraid.)
Yeah. The criteria can be like “implement them or demonstrate that you could implement them and have a good plan to do so,” but it would sometimes be reasonable for the lab to not have done this yet. (Especially for non-frontier labs; the deployment criteria mostly don’t work well for evaluating non-frontier labs. Also if demonstrating that you could implement something is difficult, even if you could implement it.)
I get the sense that this criteria doesn’t quite handle the necessarily edge cases to handle reasonable choices orgs might make.
Hmm, yeah it does seem thorny if you can get the points by just saying you’ll do something.
Like I absolutely think this shouldn’t count for security. I think you should have to demonstrate actual security of model weights and I can’t think of any demonstration of “we have the capacity to do security” which I would find fully convincing. (Though setting up some inference server at some point which is secure to highly resourced pen testers would be reasonably compelling for demonstrating part of the security portfolio.)
I initially thought this was wrong, but on further inspection, I agree and this seems to be a bug.
The deployment criteria starts with:
This criteria seems to allow to lab to meet it by having a good risk assesment criteria, but the rest of the criteria contains specific countermeasures that:
Are impossible to consistently impose if you make weights open (e.g. Enforcement and KYC).
Don’t pass cost benefit for current models which pose low risk. (And it seems the criteria is “do you have them implemented right now?)
If the lab had an excellent risk assement policy and released weights if the cost/benefit seemed good, that should be fine according to the “deployment” criteria IMO.
Generally, the deployment criteria should be gated behind “has a plan to do this when models are actually powerful and their implementation of the plan is credible”.
I get the sense that this criteria doesn’t quite handle the necessarily edge cases to handle reasonable choices orgs might make.
(This is partially my fault as I didn’t notice this when providing feedback on this project.)
(IMO making weights accessible is probably good on current margins, e.g. llama-3-70b would be good to release so long as it is part of an overall good policy, is not setting a bad precedent, and doesn’t leak architecture secrets.)
(A general problem with this project is somewhat arbitrarily requiring specific countermeasures. I think this is probably intrinsic to the approach I’m afraid.)
Related: maybe a lab should get full points for a risky release if the lab says it’s releasing because the benefits of [informing / scaring / waking-up] people outweigh the direct risk of existential catastrophe and other downsides. It’s conceivable that a perfectly responsible lab would do such a thing.
Capturing all nuances can trade off against simplicity and legibility. (But my criteria are not yet on the efficient frontier or whatever.)
Thanks. I agree you’re pointing at something flawed in the current version and generally thorny. Strong-upvoted and strong-agreevoted.
I didn’t put much effort into clarifying this kind of thing because it’s currently moot—I don’t think it would change any lab’s score—but I agree.[1] I think e.g. a criterion “use KYC” should technically be replaced with “use KYC OR say/demonstrate that you’re prepared to implement KYC and have some capability/risk threshold to implement it and [that threshold isn’t too high].”
Yeah. The criteria can be like “implement them or demonstrate that you could implement them and have a good plan to do so,” but it would sometimes be reasonable for the lab to not have done this yet. (Especially for non-frontier labs; the deployment criteria mostly don’t work well for evaluating non-frontier labs. Also if demonstrating that you could implement something is difficult, even if you could implement it.)
I’m interested in suggestions :shrug:
And I think my site says some things that contradict this principle, like ‘these criteria require keeping weights private.’ Oops.
Hmm, yeah it does seem thorny if you can get the points by just saying you’ll do something.
Like I absolutely think this shouldn’t count for security. I think you should have to demonstrate actual security of model weights and I can’t think of any demonstration of “we have the capacity to do security” which I would find fully convincing. (Though setting up some inference server at some point which is secure to highly resourced pen testers would be reasonably compelling for demonstrating part of the security portfolio.)