I think your comment illustrates my point. You’re describing current systems and their properties, then implying that these properties will stay the same as we push up the level of goal-directedness to human-level. But you’ve not made any comment about why the goal-directedness doesn’t affect all the nice tool-like properties.
don’t see any obvious reason to expect much more cajoling to be necessary
It’s the difference in levels of goal-directedness. That’s the reason.
For example, I’m pretty optimistic about 1.8 million years MATS-graduate-level work building on top of other MATS-graduate-level work
I’m not completely sure what happens when you try this. But there seem to be two main options. Either you’ve got a small civilization of goal-directed human-level agents, who have their own goals and need to be convinced to solve someone else’s problems. And then to solve those problems, need to be given freedom and time to learn and experiment, gaining sixty thousand lifetimes worth of skills along the way.
Or, you’ve got a large collection of not-quite-agents that aren’t really capable of directing research but will often complete a well-scoped task if given it by someone who understands its limitations. Now your bottleneck is human research leads (presumably doing agent foundations). That’s a rather small resource. So your speedup isn’t massive, it’s only moderate, and you’re on a time limit and didn’t put much effort into getting a head start.
You’re describing current systems and their properties, then implying that these properties will stay the same as we push up the level of goal-directedness to human-level. But you’ve not made any comment about why the goal-directedness doesn’t affect all the nice tool-like properties.
I think the goal-directedness framing has been unhelpful when it comes to predicting AI progress (especially LLM progress), and will probably keep being so at least in the near-term; and plausibly net-negative, when it comes to alignment research progress. E.g. where exactly would you place the goal-directedness in Sakana’s AI agent? If I really had to pick, I’d probably say something like ‘the system prompt’ - but those are pretty transparent, so as long as this is the case, it seems like we’ll be in ‘pretty easy’ worlds w.r.t. alignment. I still think something like control and other safety / alignment measures are important, but currently-shaped scaffolds being pretty transparent seems to me like a very important and often neglected point.
I’m not completely sure what happens when you try this. But there seem to be two main options. Either you’ve got a small civilization of goal-directed human-level agents, who have their own goals and need to be convinced to solve someone else’s problems. And then to solve those problems, need to be given freedom and time to learn and experiment, gaining sixty thousand lifetimes worth of skills along the way.
If by goal-directed you mean something like ‘context-independent goal-directedness’ (e.g. changing the system prompt doesn’t affect the behavior much), then this isn’t what I expect SOTA systems to look like, at least in the next 5 years.
Or, you’ve got a large collection of not-quite-agents that aren’t really capable of directing research but will often complete a well-scoped task if given it by someone who understands its limitations. Now your bottleneck is human research leads (presumably doing agent foundations). That’s a rather small resource. So your speedup isn’t massive, it’s only moderate, and you’re on a time limit and didn’t put much effort into getting a head start.
I am indeed at least somewhat worried about the humans in the loop being a potential bottleneck. But I expect their role to often look more like (AI-assisted) reviewing, rather than necessarily setting (detailed) research directions. Well-scoped tasks seem great, whenever they’re feasible, and indeed I expect this to be a factor in which tasks get automated differentially soon (together with, e.g. short task horizons or tasks requiring less compute—so that solutions can be iterated on more cheaply).
I think your comment illustrates my point. You’re describing current systems and their properties, then implying that these properties will stay the same as we push up the level of goal-directedness to human-level. But you’ve not made any comment about why the goal-directedness doesn’t affect all the nice tool-like properties.
It’s the difference in levels of goal-directedness. That’s the reason.
I’m not completely sure what happens when you try this. But there seem to be two main options. Either you’ve got a small civilization of goal-directed human-level agents, who have their own goals and need to be convinced to solve someone else’s problems. And then to solve those problems, need to be given freedom and time to learn and experiment, gaining sixty thousand lifetimes worth of skills along the way.
Or, you’ve got a large collection of not-quite-agents that aren’t really capable of directing research but will often complete a well-scoped task if given it by someone who understands its limitations. Now your bottleneck is human research leads (presumably doing agent foundations). That’s a rather small resource. So your speedup isn’t massive, it’s only moderate, and you’re on a time limit and didn’t put much effort into getting a head start.
I think the goal-directedness framing has been unhelpful when it comes to predicting AI progress (especially LLM progress), and will probably keep being so at least in the near-term; and plausibly net-negative, when it comes to alignment research progress. E.g. where exactly would you place the goal-directedness in Sakana’s AI agent? If I really had to pick, I’d probably say something like ‘the system prompt’ - but those are pretty transparent, so as long as this is the case, it seems like we’ll be in ‘pretty easy’ worlds w.r.t. alignment. I still think something like control and other safety / alignment measures are important, but currently-shaped scaffolds being pretty transparent seems to me like a very important and often neglected point.
If by goal-directed you mean something like ‘context-independent goal-directedness’ (e.g. changing the system prompt doesn’t affect the behavior much), then this isn’t what I expect SOTA systems to look like, at least in the next 5 years.
I am indeed at least somewhat worried about the humans in the loop being a potential bottleneck. But I expect their role to often look more like (AI-assisted) reviewing, rather than necessarily setting (detailed) research directions. Well-scoped tasks seem great, whenever they’re feasible, and indeed I expect this to be a factor in which tasks get automated differentially soon (together with, e.g. short task horizons or tasks requiring less compute—so that solutions can be iterated on more cheaply).