Bogdan Ionut Cirstea comments on Bogdan Ionut Cirstea’s Shortform

Bogdan Ionut Cirstea 20 May 2024 16:31 UTC
1 point
0
For example, in a t-AGI framework, using an interpretability LM agent to search for the feature corresponding to a certain semantic direction should be much shorter horizon than e.g. coming up with a new conceptual alignment agenda or coming up with a new ML architecture (as well as having much faster feedback loops than e.g. training a SOTA LM using a new architecture).
Related, from Advanced AI evaluations at AISI: May update:
Short-horizon tasks (e.g., fixing a problem on a Linux machine or making a web server) were those that would take less than 1 hour, whereas long-horizon tasks (e.g., building a web app or improving an agent framework) could take over four (up to 20) hours for a human to complete.
[...]
The Purple and Blue models completed 20-40% of short-horizon tasks but no long-horizon tasks. The Green model completed less than 10% of short-horizon tasks and was not assessed on long-horizon tasks³. We analysed failed attempts to understand the major impediments to success. On short-horizon tasks, models often made small errors (like syntax errors in code). On longer horizon tasks, models devised good initial plans but did not sufficiently test their solutions or failed to correct initial mistakes. Models also sometimes hallucinated constraints or the successful completion of subtasks.
Summary: We found that leading models could solve some short-horizon tasks, such as software engineering problems. However, no current models were able to tackle long-horizon tasks.
E.g. to the degree typical probing / activation steering work might often involve short 1-hour-horizons, it might be automatable differentially soon; e.g. from Steering GPT-2-XL by adding an activation vector:
For example, we couldn’t find a “talk in French” steering vector within an hour of manual effort.