I have a few potential criticisms of this paper. I think my criticisms are probably wrong and the paper’s conclusion is right, but I’ll just put them out there:
Nearly half the tasks in the benchmark take 1 to 30 seconds (the ones from the SWAA set). According to the fitted task time <> P(success) curve, most tested LLMs should be able to complete those with high probability, so they don’t provide much independent signal.
However, I expect task time <> P(success) curve would look largely the same if you excluded the SWAA tasks.
SWAA tasks take humans 1 to 30 seconds and HCAST tasks take 1 minute to 30 hours. The two different sets are non-overlapping. If HCAST tasks are harder than SWAA tasks for LLMs, then a regression will indicate that LLMs are getting better at longer tasks when really they’re just getting better at HCAST tasks.
I think this criticism is wrong—if it were true, the across-dataset correlation between time and LLM-difficulty should be higher than the within-dataset correlation, but from eyeballing Figure 4 (page 10), it looks like it’s not higher (or at least not much).
The benchmark tasks could have a bias where longer tasks are more difficult in general (not just because they’re longer). I haven’t looked through all the HCAST tasks (in fact I couldn’t find where they were listed) but Figure 16 on page 29 shows that humans had lower success rates on longer tasks. As example tasks, the paper gives, among others, “Research simple factual information from Wikipedia” = 1 minute and “Write a Python script to transform JSON data” = 56 minutes (page 6). I think a more comparable 56-minute task would be something like “find some factual information that’s buried in a long article”, which I believe even a GPT-3-eara LLM would perform well on.
I don’t know enough about the tasks to know whether this criticism is correct. My uneducated guess is that there’s a true positive relationship between task length and (non-length-related-)task difficulty, but that if you adjusted for this, you’d still see an exponential trend in task time <> P(success), and the curve would just be dampened a bit.
The authors also suspect that longer tasks might be more difficult, and “[i]f this is the case, we may be underestimating the pace of model improvement.” I think it would mean we’re underestimating the pace of improvement on hard tasks, while simultaneously overestimating the pace of improvement on long tasks.
I think this criticism is wrong—if it were true, the across-dataset correlation between time and LLM-difficulty should be higher than the within-dataset correlation, but from eyeballing Figure 4 (page 10), it looks like it’s not higher (or at least not much).
It is much higher. I’m not sure how/if I can post images of the graph here, but the R^2 for SWAA only is 0.27, HCAST only is 0.48, and RE-bench only is 0.01.
Also, HCAST R^2 goes down to 0.41 if you exclude the 21⁄97 data points where the human time source is an estimate. I’m not really sure why these are included in the paper—it seems bizarre to me to extend these any credence.
I think “human time to complete” is a poor proxy of what they’re actually measuring here, and a lot of it is actually explained by what types of tasks are included for each time length. For example, doubling or quadrupling the amount of time a human would need to write a script that transforms JSON data (by adding a lot more fields without making the fields much more complex) doesn’t seem to affect success rates nearly as much as this paper would predict.
Note that the REBench correlation definitionally has to be 0 because all tasks have the same length. SWAA similarly has range restriction, though not as severe.
Well, the REBench tasks don’t all have the same length, at least in the data METR is using. It’s all tightly clustered around 8 hours though, so I take your point that it’s not a very meaningful correlation.
I thought you could post images by dragging and dropping files into the comment box, I seem to recall doing that in the past, but now it doesn’t seem to work for me. Maybe that only works for top-level posts?
Regarding 1 and 2, I basically agree that SWAA doesn’t provide much independent signal. The reason we made SWAA was that models before GPT-4 got ~0% on HCAST, so we needed shorter tasks to measure their time horizon. 3 is definitely a concern and we’re currently collecting data on open-source PRs to get a more representative sample of long tasks.
I have a few potential criticisms of this paper. I think my criticisms are probably wrong and the paper’s conclusion is right, but I’ll just put them out there:
Nearly half the tasks in the benchmark take 1 to 30 seconds (the ones from the SWAA set). According to the fitted task time <> P(success) curve, most tested LLMs should be able to complete those with high probability, so they don’t provide much independent signal.
However, I expect task time <> P(success) curve would look largely the same if you excluded the SWAA tasks.
SWAA tasks take humans 1 to 30 seconds and HCAST tasks take 1 minute to 30 hours. The two different sets are non-overlapping. If HCAST tasks are harder than SWAA tasks for LLMs, then a regression will indicate that LLMs are getting better at longer tasks when really they’re just getting better at HCAST tasks.
I think this criticism is wrong—if it were true, the across-dataset correlation between time and LLM-difficulty should be higher than the within-dataset correlation, but from eyeballing Figure 4 (page 10), it looks like it’s not higher (or at least not much).
The benchmark tasks could have a bias where longer tasks are more difficult in general (not just because they’re longer). I haven’t looked through all the HCAST tasks (in fact I couldn’t find where they were listed) but Figure 16 on page 29 shows that humans had lower success rates on longer tasks. As example tasks, the paper gives, among others, “Research simple factual information from Wikipedia” = 1 minute and “Write a Python script to transform JSON data” = 56 minutes (page 6). I think a more comparable 56-minute task would be something like “find some factual information that’s buried in a long article”, which I believe even a GPT-3-eara LLM would perform well on.
I don’t know enough about the tasks to know whether this criticism is correct. My uneducated guess is that there’s a true positive relationship between task length and (non-length-related-)task difficulty, but that if you adjusted for this, you’d still see an exponential trend in task time <> P(success), and the curve would just be dampened a bit.
The authors also suspect that longer tasks might be more difficult, and “[i]f this is the case, we may be underestimating the pace of model improvement.” I think it would mean we’re underestimating the pace of improvement on hard tasks, while simultaneously overestimating the pace of improvement on long tasks.
It is much higher.
I’m not sure how/if I can post images of the graph here, butthe R^2 for SWAA only is 0.27, HCAST only is 0.48, and RE-bench only is 0.01.Also, HCAST R^2 goes down to 0.41 if you exclude the 21⁄97 data points where the human time source is an estimate. I’m not really sure why these are included in the paper—it seems bizarre to me to extend these any credence.
I think “human time to complete” is a poor proxy of what they’re actually measuring here, and a lot of it is actually explained by what types of tasks are included for each time length. For example, doubling or quadrupling the amount of time a human would need to write a script that transforms JSON data (by adding a lot more fields without making the fields much more complex) doesn’t seem to affect success rates nearly as much as this paper would predict.
Note that the REBench correlation definitionally has to be 0 because all tasks have the same length. SWAA similarly has range restriction, though not as severe.
Well, the REBench tasks don’t all have the same length, at least in the data METR is using. It’s all tightly clustered around 8 hours though, so I take your point that it’s not a very meaningful correlation.
Thanks, that’s useful info!
I thought you could post images by dragging and dropping files into the comment box, I seem to recall doing that in the past, but now it doesn’t seem to work for me. Maybe that only works for top-level posts?
Maybe you switched to the Markdown editor at some point. It still works in the (default) WYSIWYG editor.
Regarding 1 and 2, I basically agree that SWAA doesn’t provide much independent signal. The reason we made SWAA was that models before GPT-4 got ~0% on HCAST, so we needed shorter tasks to measure their time horizon. 3 is definitely a concern and we’re currently collecting data on open-source PRs to get a more representative sample of long tasks.
Re: HCAST tasks, most are being kept private since it’s a benchmark. If you want to learn more here’s the METR’s paper on HCAST.