The system card also contains some juicey regression hidden within the worst-graph-of-all-time in the SWE-Lancer section:
If you can knobble together the will to work through the inane color scheme, it is very interesting to note that while the expected RL-able IC tasks show improvements, the Manager improvements are far less uniform, and particularly o1 (and 4o!) remains the stronger performer vs o3 when weighted by the (ahem, controversial) $$$ based benchmark. And this is all within the technical field of (essentially) system design, with verifiable answers, that the swe-lancer manager benchmark represents.
So the finite set of activated weights is likely getting canabalized further from pre-training generality, towards the increasingly evident fragility of RLed tasks. However, I feel it is also decent evidence towards the perennial question of activated weight size re. o1 vs o3, and that o3 is not yet the model designed to consume the extensive (yet expensive) shared world size capacity of OAI’s shiny new NV72 racks.
As a separate aside, it was amusing setting o3 off to work on re-graphing this data into a sane format, and observing the RL-ed tool-use fragility: it took 50+ tries over 15 minutes of repeatedly failed panning, cropping, and zooming operations for it to diligently work out an accurate data extraction, but work out an extraction it did. Inference scaling in action!
The system card also contains some juicey regression hidden within the worst-graph-of-all-time in the SWE-Lancer section:
If you can knobble together the will to work through the inane color scheme, it is very interesting to note that while the expected RL-able IC tasks show improvements, the Manager improvements are far less uniform, and particularly o1 (and 4o!) remains the stronger performer vs o3 when weighted by the (ahem, controversial) $$$ based benchmark. And this is all within the technical field of (essentially) system design, with verifiable answers, that the swe-lancer manager benchmark represents.
So the finite set of activated weights is likely getting canabalized further from pre-training generality, towards the increasingly evident fragility of RLed tasks. However, I feel it is also decent evidence towards the perennial question of activated weight size re. o1 vs o3, and that o3 is not yet the model designed to consume the extensive (yet expensive) shared world size capacity of OAI’s shiny new NV72 racks.
As a separate aside, it was amusing setting o3 off to work on re-graphing this data into a sane format, and observing the RL-ed tool-use fragility: it took 50+ tries over 15 minutes of repeatedly failed panning, cropping, and zooming operations for it to diligently work out an accurate data extraction, but work out an extraction it did. Inference scaling in action!