porby comments on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense

porby 29 Nov 2023 2:10 UTC
10 points
0
I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.
Hm, I’m sufficiently surprised at this claim that I’m not sure that I understand what you mean. I’ll attempt a response on the assumption that I do understand; apologies if I don’t:
I think of tools as agents with oddly shaped utility functions. They tend to be conditional in nature.
A common form is to be a mapping between inputs and outputs that isn’t swayed by anything outside of the context of that mapping (which I’ll term “external world states”). You can view a calculator as a coherent agent, but you can’t usefully describe the calculator as a coherent agent with a utility function regarding world states that are external to the calculator’s process.
You could use a calculator within a larger system that is describable as a maximizer over a utility function that includes unconditional terms for external world states, but that doesn’t change the nature of the calculator. Draw the box around the calculator within the system? Pretty obviously a tool. Draw the box around the whole system? Not a tool.
I’ve been using the following two requirements to point at a maximally^[1] tool-like set of agents. This composes what I’ve been calling goal agnosticism:
1. The agent cannot be usefully described^[2] as having unconditional preferences about external world states.
2. Any uniformly random sampling of behavior from the agent has a negligible probability of being a strong and incorrigible optimizer.
Note that this isn’t the same thing as a definition for “tool.” An idle rock uselessly obeys this definition; tools tend to useful for something. This definition is meant to capture the distinction between things that feel like tools and those that feel like “proper” agents.
To phrase it another way, the intuitive degree of “toolness” is a spectrum of how much the agent exhibits unconditional preferences about external world states through instrumental behavior.
Notably, most pretrained LLMs with the usual autoregressive predictive loss and a diverse training set are heavily constrained into fitting this definition. Anything equivalent to RL agents trained with sparse/distant rewards is not. RLHF bakes a condition into the model of peculiar shape. I wouldn’t be surprised if it doesn’t strictly obey the definition anymore, but it’s close enough along the spectrum that it still feels intuitive to call it a tool.
Further, just like in the case of the calculator, you can easily build a system around a goal agnostic “tool” LLM that is not, itself, goal agnostic. Even prompting is enough to elicit a new agent-in-effect that is not necessarily goal agnostic. The ability for a goal agnostic agent to yield non-goal agnostic agents does not break the underlying agent’s properties.^[3]
1. ^
  For one critical axis in the toolishness basis, anyway.
2. ^
  Tricky stuff like having a bunch of terms regarding external world states that just so happen to always cancel don’t count.
3. ^
  This does indeed sound kind of useless, but I promise the distinction does actually end up mattering quite a lot! That discussion goes beyond the scope of this post. The FAQ goes into more depth.