Could someone who thinks capabilities benchmarks are safety work explain the basic idea to me?
It’s not all that valuable for my personal work to know how good models are at ML tasks. Is it supposed to be valuable to legislators writing regulation? To SWAT teams calculating when to bust down the datacenter door and turn the power off? I’m not clear.
But it sure seems valuable to someone building an AI to do ML research, to have a benchmark that will tell you where you can improve.
But clearly other people think differently than me.
I think the core argument is “if you want to slow down, or somehow impose restrictions on AI research and deployment, you need some way of defining thresholds. Also, most policymaker’s cruxes appear to be that AI will not be a big deal, but if they thought it was going to be a big deal they would totally want to regulate it much more. Therefore, having policy proposals that can use future eval results as a triggering mechanism is politically more feasible, and also, epistemically helpful since it allows people who do think it will be a big deal to establish a track record”.
I find these arguments reasonably compelling, FWIW.
I think it would be good for more people to explicitly ask political staffers and politicians the question: “What hypothetical eval result would change your mind if you saw it?”
I think a lot of the evals are more targeted towards convincing tech workers than convincing politicians.
My sense is political staffers and politicians aren’t that great at predicting their future epistemic states this way, and so you won’t get great answers for this question. I do think it’s a really important one to model!
Perhaps the reasoning is that the AGI labs already have all kinds of internal benchmarks of their own, no external help needed, but the progress on these benchmarks isn’t a matter of public knowledge. Creating and open-sourcing these benchmarks, then, only lets the society better orient to the capabilities progress taking place, and so make more well-informed decisions, without significantly advantaging the AGI labs.
At the very least, evals for automated ML R&D should be a very decent proxy for when it might be feasible to automate very large chunks of prosaic AI safety R&D.
I think I saw someone arguing that their particular capability benchmark was good for evaluating the capability, but of limited use for training the capability because their task only covered a small fraction of that domain.
You probably don’t mean dangerous capabilities evals, right? I mean, I do feel hesitant even about those. I would really not want someone using my work on WMDP to increase their model’s ability to make bioweapons.
In Connor Leahy’s recent interview on Trajectory he argues that scientists making evals are being “used” as tools by the AI corporations in a similar way to how cancer researchers were used by cigarette companies to throw confusion into the path of concluding cigarettes cause cancer.
With bioweapons evals at least the profit motive of AI companies is aligned with the common interest here; a big benefit of your work comes from when companies use it to improve their product. I’m not at all confused about why people would think this is useful safety work, even if I haven’t personally hashed out the cost/benefit to any degree of confidence.
I’m mostly confused about ML / SWE / research benchmarks.
I’m not sure but I have a guess.
A lot of “normies” I talk to in the tech industry are anchored hard on the idea that AI is mostly a useless fad and will never get good enough to be useful.
They laugh off any suggestions that the trends point towards rapid improvements that can end up with superhuman abilities. Similarly, completely dismiss arguments that AI might used for building better AI. ‘Feed the bots their own slop and they’ll become even dumber than they already are!’
So, people who do believe that the trends are meaningful, and that we are near to a dangerous threshold, want some kind of proof to show the doubters. They want people to start taking this seriously before it’s too late.
I do agree that the targeting of benchmarks by capabilities developers is totally a thing. The doubting-Thomases of the world are also standing in the way of the capabilities folks of getting the cred and funding they desire. A benchmark designed specifically to convince doubters is a perfect tool for… convincing doubters who might then fund you and respect you.
I’m really getting annoyed by AI safety people making analogies towards things that had way more evidence than the AI risk field ever got, and it also happens with comparisons to climate change.
Capabilities benchmarks can be highly useful in safety applications. You raised a great example with ML benchmarks. Strong ML R&D capabilities lie upstream of many potential risks:
Labs may begin automating research, which could shorten timelines.
These capabilities may increase proliferation risks of techniques used to develop frontier models.
In the extremes, these capabilities may increase the risk of uncontrolled recursive self-improvement.
Labs, governments, and everyone else involved should have an accurate understanding of where the capabilities frontier lies to enable good decision making. The only quantitatively rigorous way of doing that is with good benchmarks.
Capabilities are not bottlenecked on benchmarks to inform where model developers could make improvements, and adding more is extremely unlikely to make any significant difference to capabilities progress.
Therefore, I think having more capabilities benchmarks a good thing because it can greatly increase our understanding of model capabilities without making much of a difference in timelines. However, if you are interested in doing safety work, building capabilities benchmarks is probably not the most effective thing you could be doing.
Could someone who thinks capabilities benchmarks are safety work explain the basic idea to me?
It’s not all that valuable for my personal work to know how good models are at ML tasks. Is it supposed to be valuable to legislators writing regulation? To SWAT teams calculating when to bust down the datacenter door and turn the power off? I’m not clear.
But it sure seems valuable to someone building an AI to do ML research, to have a benchmark that will tell you where you can improve.
But clearly other people think differently than me.
Not representative of motivations for all people for all types of evals, but https://www.openphilanthropy.org/rfp-llm-benchmarks/, https://www.lesswrong.com/posts/7qGxm2mgafEbtYHBf/survey-on-the-acceleration-risks-of-our-new-rfps-to-study, https://docs.google.com/document/d/1UwiHYIxgDFnl_ydeuUq0gYOqvzdbNiDpjZ39FEgUAuQ/edit, and some posts in https://www.lesswrong.com/tag/ai-evaluations seem relevant.
I’m unable to open the google docs file in the third link.
Sorry, fixed
I think the core argument is “if you want to slow down, or somehow impose restrictions on AI research and deployment, you need some way of defining thresholds. Also, most policymaker’s cruxes appear to be that AI will not be a big deal, but if they thought it was going to be a big deal they would totally want to regulate it much more. Therefore, having policy proposals that can use future eval results as a triggering mechanism is politically more feasible, and also, epistemically helpful since it allows people who do think it will be a big deal to establish a track record”.
I find these arguments reasonably compelling, FWIW.
I think it would be good for more people to explicitly ask political staffers and politicians the question: “What hypothetical eval result would change your mind if you saw it?”
I think a lot of the evals are more targeted towards convincing tech workers than convincing politicians.
My sense is political staffers and politicians aren’t that great at predicting their future epistemic states this way, and so you won’t get great answers for this question. I do think it’s a really important one to model!
I believe the actual answer is “when it starts automating everything in the real world.”
Perhaps the reasoning is that the AGI labs already have all kinds of internal benchmarks of their own, no external help needed, but the progress on these benchmarks isn’t a matter of public knowledge. Creating and open-sourcing these benchmarks, then, only lets the society better orient to the capabilities progress taking place, and so make more well-informed decisions, without significantly advantaging the AGI labs.
At the very least, evals for automated ML R&D should be a very decent proxy for when it might be feasible to automate very large chunks of prosaic AI safety R&D.
I think I saw someone arguing that their particular capability benchmark was good for evaluating the capability, but of limited use for training the capability because their task only covered a small fraction of that domain.
In case you didn’t read Paul’s reasoning.
What will you do if nobody makes a successful case?
Be sad.
You probably don’t mean dangerous capabilities evals, right? I mean, I do feel hesitant even about those. I would really not want someone using my work on WMDP to increase their model’s ability to make bioweapons.
In Connor Leahy’s recent interview on Trajectory he argues that scientists making evals are being “used” as tools by the AI corporations in a similar way to how cancer researchers were used by cigarette companies to throw confusion into the path of concluding cigarettes cause cancer.
With bioweapons evals at least the profit motive of AI companies is aligned with the common interest here; a big benefit of your work comes from when companies use it to improve their product. I’m not at all confused about why people would think this is useful safety work, even if I haven’t personally hashed out the cost/benefit to any degree of confidence.
I’m mostly confused about ML / SWE / research benchmarks.
I’m not sure but I have a guess. A lot of “normies” I talk to in the tech industry are anchored hard on the idea that AI is mostly a useless fad and will never get good enough to be useful.
They laugh off any suggestions that the trends point towards rapid improvements that can end up with superhuman abilities. Similarly, completely dismiss arguments that AI might used for building better AI. ‘Feed the bots their own slop and they’ll become even dumber than they already are!’
So, people who do believe that the trends are meaningful, and that we are near to a dangerous threshold, want some kind of proof to show the doubters. They want people to start taking this seriously before it’s too late.
I do agree that the targeting of benchmarks by capabilities developers is totally a thing. The doubting-Thomases of the world are also standing in the way of the capabilities folks of getting the cred and funding they desire. A benchmark designed specifically to convince doubters is a perfect tool for… convincing doubters who might then fund you and respect you.
I’m really getting annoyed by AI safety people making analogies towards things that had way more evidence than the AI risk field ever got, and it also happens with comparisons to climate change.
Capabilities benchmarks can be highly useful in safety applications. You raised a great example with ML benchmarks. Strong ML R&D capabilities lie upstream of many potential risks:
Labs may begin automating research, which could shorten timelines.
These capabilities may increase proliferation risks of techniques used to develop frontier models.
In the extremes, these capabilities may increase the risk of uncontrolled recursive self-improvement.
Labs, governments, and everyone else involved should have an accurate understanding of where the capabilities frontier lies to enable good decision making. The only quantitatively rigorous way of doing that is with good benchmarks.
Capabilities are not bottlenecked on benchmarks to inform where model developers could make improvements, and adding more is extremely unlikely to make any significant difference to capabilities progress.
Therefore, I think having more capabilities benchmarks a good thing because it can greatly increase our understanding of model capabilities without making much of a difference in timelines. However, if you are interested in doing safety work, building capabilities benchmarks is probably not the most effective thing you could be doing.