Seth Herd comments on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense

Seth Herd 25 Nov 2023 6:22 UTC
4 points
2
Thinking about it a little more, there may be a good reason to consider how humans pursue mid-horizon goals.

I think I do make a goal of answering Paul’s question. It’s not a subgoal of my primary values of getting food, status, etc, because backward-chaining is too complex. It’s based on a vague estimate of the value (total future reward) of that action in context. I wrote about this in Human preferences as RL critic values—implications for alignment, but I’m not sure how clear that brief post was.

I was addressing a different part of Paul’s comment than the original question. I mentioned that I didn’t have an answer to the question of whether one can make long-range plans without wanting anything. I did try an answer in a separate top-level response:

it doesn’t matter much whether a system can pursue long-horizon tasks without wanting, because agency is useful for long-horizon tasks, and it’s not terribly complicated to implement. So AGI will likely have it built in, whether or not it would emerge from adequate non-agentic training. I think people will rapidly agentize any oracle system. It’s useful to have a system that does things for you. And to do anything more complicated than answer one email, the user will be giving it a goal that may include instrumental subgoals.

The possibility of emergent wanting might still be important in an agent scaffolded around a foundation model.

Perhaps I’m confused about the scenarios you’re considering here. I’m less worried about LLMs achieving AGI and developing emergent agency, because we’ll probably give them agency before that happens.