In that case, the thing in the paper must be a typo, because the “Tool Use” graph here is clearly >0 reward, even for the 1G agent.
It could be that the Tool Use in the graph is the “Tool Use Gap” task instead of the “Tool Use Climb” task. But they don’t specify anywhere I could find easily.
In that case, the thing in the paper must be a typo, because the “Tool Use” graph here is clearly >0 reward, even for the 1G agent.
It could be that the Tool Use in the graph is the “Tool Use Gap” task instead of the “Tool Use Climb” task. But they don’t specify anywhere I could find easily.