FWIW, that’s not a crux for me. I can totally see METR’s agency-horizon trend continuing, such that 21 months later, the SOTA model beats METR’s 8-hour tests. What I expect is that this won’t transfer to real-world performance: you wouldn’t be able to plop that model into a software engineer’s chair, prompt it with the information in the engineer’s workstation, and get one workday’s worth of output from it.
At least, not reliably and not in the generel-coding setting. It’s possible this sort of performance would be achieved in some narrow domains, and that this would happen once in a while on any task. (Indeed, I think that’s already the case?) And I do expect nonzero extension of general-purpose real-world agency horizons. But what I expect is slower growth, with the real-world performance increasingly lagging behind the performance on the agency-horizon benchmark.
Indeed, and maintaining this release schedule is indeed a bit impressive. Though note that “a model called o4 is released” and “the pace of progress from o1 to o3 is maintained” are slightly different. Hopefully the release is combined with a proper report on o4 (not just o4-mini), so we get actual data regarding how well RL-on-CoTs scales.