They were changing around their deployment system, infra, etc. We wanted uptime and throughput. Big pain in the ass to keep the model up (with proper access control) while they were overhauling stuff. Furthermore, anthropic and METR kept changing points of contact (rapidly growing teams).
This sounds like an unusual source of difficulty. Some Anthropic statements have suggested that sharing is hard in general. I hope it is just stuff like this. [Edit: possibly those statements were especially referring to deep model access or high-touch support. But then they don’t explain the labs’ lack of more basic sharing.]
Some Anthropic statements have suggested that sharing is hard in general.
If they said that then they are speaking nonsense IMO. Once you have your stuff set up it’s a button you click. You have to trust that the evaluator won’t leak info or soil your reputation without good cause though.
Possibly he didn’t just mean technically difficult. And possibly Politico took this out of context. But I agree this quote seems bad and clarification would be nice.
Thanks.
This sounds like an unusual source of difficulty. Some Anthropic statements have suggested that sharing is hard in general. I hope it is just stuff like this. [Edit: possibly those statements were especially referring to deep model access or high-touch support. But then they don’t explain the labs’ lack of more basic sharing.]
If they said that then they are speaking nonsense IMO. Once you have your stuff set up it’s a button you click. You have to trust that the evaluator won’t leak info or soil your reputation without good cause though.
Jack Clark: “Pre-deployment testing is a nice idea but very difficult to implement,” from https://www.politico.eu/article/rishi-sunak-ai-testing-tech-ai-safety-institute/
Possibly he didn’t just mean technically difficult. And possibly Politico took this out of context. But I agree this quote seems bad and clarification would be nice.