As I asked someone who challenged this point on Twitter, if you think you have a test that is lighter touch or more accurate than the compute threshold for determining where we need to monitor for potential dangers, then what is the proposal? So far, the only reasonable alternative I have heard is no alternative at all. Everyone seems to understand that ‘use benchmark scores’ would be worse.
As someone who thinks it’s a bad idea to try to write legislation focused on compute thresholds, because I believe that compute thresholds will become outdated suddenly in the not so distant future… I would far rather that legislators say something along the lines of, “We do not currently have a good way to measure how risky a given AI system is. As a first step we are going to commission the creation of a battery of tests, some public and some classified, to thoroughly evaluate a given system. We will require all companies wishing to do business in our country to submit their models to our examination.”
I’ve been working for the past 8 months on trying to create good evaluations of AI biorisk. My team’s initial attempts to do this were met with the accusation that our evaluations were insufficiently precise and objective. That’s not wrong. They were the best we could do at short notice, but far from adequate. We’ve been working hard since then to develop better evals, thorough and objective enough to convince skeptics. But this isn’t easy. It’s a labor intensive process, and we can’t afford much labor. The US Federal Government CAN afford to hire a bunch of scientists to design and author and review thousands of in-depth questions.
Criticisms of the biorisk evals so far have pointed out:
‘Yes, the models show a lot of book knowledge about virology and genetic engineering, but that’s because reciting facts from papers and textbooks plays to their strengths. Their high scores on such tests don’t imply the same level of understanding or skill or utility as would similarly high scores from a human expert. This fails to evaluate the most important bottlenecks such as the detailed tacit knowledge of hands-on wetlab skills.’
Sure, we need to check for both. But without adequate funding, how can we be expected to be able to hire people to go set up fake lab experiments and photograph and videotape them going wrong to create tests to see if models like GPT-4o can help troubleshoot well enough to be a significant uplift for inexpert lab workers? That’s inherently a time-intensive and material-intensive sort of test to create! And until we do, and then show that the AI models get low scores on those exams, we are operating under uncertainty about the models’ skills. Our critics assume the models are currently incapable at this and will remain so, but they offer no proof of that. They are not scrambling themselves to create the tests which could prove the models’ incapability. Given the novel territory rapidly being broken by new models, we should start considering new models ‘dangerous until proven safe’ not ‘innocent until proven guilty’.
My vision of model regulation
To be clear, my goal is not to stifle model development and release, or harm the open-source community. I expect the process of evaluating models to be something we can do cheaply, automatically, quickly. You submit your model weights and code through a web form, and get back a thumbs up within minutes. It’s free and easy. You never see anyone failing to pass. The first failure will very likely occur when one of the largest labs submits their latest experimental model’s checkpoint, long before they’d even considered releasing it publicly, just to satisfy their curiosity. And when that day comes, we will all be immensely grateful that we had the safety checks in place.
The expense of designing, creating, and operating this will be substantial. But it is in the service of preventing a national security catastrophe, so it seems to me like a very worthwhile expenditure of taxpayer funds.
As someone who thinks it’s a bad idea to try to write legislation focused on compute thresholds, because I believe that compute thresholds will become outdated suddenly in the not so distant future… I would far rather that legislators say something along the lines of, “We do not currently have a good way to measure how risky a given AI system is. As a first step we are going to commission the creation of a battery of tests, some public and some classified, to thoroughly evaluate a given system. We will require all companies wishing to do business in our country to submit their models to our examination.”
I’ve been working for the past 8 months on trying to create good evaluations of AI biorisk. My team’s initial attempts to do this were met with the accusation that our evaluations were insufficiently precise and objective. That’s not wrong. They were the best we could do at short notice, but far from adequate. We’ve been working hard since then to develop better evals, thorough and objective enough to convince skeptics. But this isn’t easy. It’s a labor intensive process, and we can’t afford much labor. The US Federal Government CAN afford to hire a bunch of scientists to design and author and review thousands of in-depth questions.
Criticisms of the biorisk evals so far have pointed out:
‘Yes, the models show a lot of book knowledge about virology and genetic engineering, but that’s because reciting facts from papers and textbooks plays to their strengths. Their high scores on such tests don’t imply the same level of understanding or skill or utility as would similarly high scores from a human expert. This fails to evaluate the most important bottlenecks such as the detailed tacit knowledge of hands-on wetlab skills.’
Sure, we need to check for both. But without adequate funding, how can we be expected to be able to hire people to go set up fake lab experiments and photograph and videotape them going wrong to create tests to see if models like GPT-4o can help troubleshoot well enough to be a significant uplift for inexpert lab workers? That’s inherently a time-intensive and material-intensive sort of test to create! And until we do, and then show that the AI models get low scores on those exams, we are operating under uncertainty about the models’ skills. Our critics assume the models are currently incapable at this and will remain so, but they offer no proof of that. They are not scrambling themselves to create the tests which could prove the models’ incapability. Given the novel territory rapidly being broken by new models, we should start considering new models ‘dangerous until proven safe’ not ‘innocent until proven guilty’.
My vision of model regulation
To be clear, my goal is not to stifle model development and release, or harm the open-source community. I expect the process of evaluating models to be something we can do cheaply, automatically, quickly. You submit your model weights and code through a web form, and get back a thumbs up within minutes. It’s free and easy. You never see anyone failing to pass. The first failure will very likely occur when one of the largest labs submits their latest experimental model’s checkpoint, long before they’d even considered releasing it publicly, just to satisfy their curiosity. And when that day comes, we will all be immensely grateful that we had the safety checks in place.
The expense of designing, creating, and operating this will be substantial. But it is in the service of preventing a national security catastrophe, so it seems to me like a very worthwhile expenditure of taxpayer funds.