Short-horizon tasks (e.g., fixing a problem on a Linux machine or making a web server) were those that would take less than 1 hour, whereas long-horizon tasks (e.g., building a web app or improving an agent framework) could take over four (up to 20) hours for a human to complete.
[...]
The Purple and Blue models completed 20-40% of short-horizon tasks but no long-horizon tasks. The Green model completed less than 10% of short-horizon tasks and was not assessed on long-horizon tasks3. We analysed failed attempts to understand the major impediments to success. On short-horizon tasks, models often made small errors (like syntax errors in code). On longer horizon tasks, models devised good initial plans but did not sufficiently test their solutions or failed to correct initial mistakes. Models also sometimes hallucinated constraints or the successful completion of subtasks.
Summary: We found that leading models could solve some short-horizon tasks, such as software engineering problems. However, no current models were able to tackle long-horizon tasks.
I’m particularly interested in what the framework might say about the ordering in which various capabilities which are prerequisites for automated AI safety R&D might appear; and also ordering vs. various dangerous capabilities. And, in particular, for each particular t, making sure we’re ‘eating all the free energy’ of all auto AI safety R&D t-horizon prerequisite capabilities.
Some evidence in favor of the framework; from Advanced AI evaluations at AISI: May update:
I’m particularly interested in what the framework might say about the ordering in which various capabilities which are prerequisites for automated AI safety R&D might appear; and also ordering vs. various dangerous capabilities. And, in particular, for each particular t, making sure we’re ‘eating all the free energy’ of all auto AI safety R&D t-horizon prerequisite capabilities.