I am trying to wrap my head around the high-level implications of this statement. I can come up with two interpretations:
What LLMs are doing is similar to what people do as they go about their day. When I walk down the street, I am simultaneously using visual and other input to assess the state of the world around me (“that looks like a car”), running a world model based on that assessment (“the car is coming this way”), and then using some other internal mechanism to decide what to do (“I’d better move to the sidewalk”).
What LLMs are doing is harder than what people do. When I converse with someone, I have some internal state, and I run some process in my head – based on that state – to generate my side of the conversation. When an LLM converses with someone, instead of maintaining internal state, needs to maintain a probability distribution over possible states, make next-token predictions according to that distribution, and simultaneously update the distribution.
(2) seems more technically correct, but my intuition dislikes the conclusion, for reasons I am struggling to articulate. …aha, I think this may be what is bothering me: I have glossed over the distinction between input and output tokens. When an LLM is processing input tokens, it is working to synchronize its state to the state of the generator. Once it switches to output mode, there is no functional benefit to continuing to synchronize state (what is it synchronizing to?), so ideally we’d move to a simpler neural net that does not carry the weight of needing to maintain and update a probability distribution of possible states. (Glossing over the fact that LLMs as used in practice sometimes need to repeatedly transition between input and output modes.) LLMs need the capability to ease themselves into any conversation without knowing the complete history of the participant they are emulating, while people have (in principle) access to their own complete history and so don’t need to be able to jump into a random point in their life and synchronize state on the fly.
So the implication is that the computational task faced by an LLM which can emulate Einstein is harder than the computational task of being Einstein… is that right? If so, that in turn leads to the question of whether there are alternative modalities for AI which have the advantages of LLMs (lots of high-quality training data) but don’t impose this extra burden. It also raises the question of how substantial this burden is in practice, in particular for leading-edge models.
You are drawing a distinction between agents that maintain a probability distribution over possible states and those that don’t and you’re putting humans in the latter category. It seems clear to me that all agents are always doing what you describe in (2), which I think clears up what you don’t like about it.
It also seems like humans spend varying amounts of energy on updating probability distributions vs. predicting within a specific model, but I would guess that LLMs can learn to do the same on their own.
As I go about my day, I need to maintain a probability distribution over states of the world. If an LLM tries to imitate me (i.e. repeatedly predict my next output token), it needs to maintain a probability distribution, not just over states of the world, but also over my internal state (i.e. the state of the agent whose outputs it is predicting). I don’t need to keep track of multiple states that I myself might be in, but the LLM does. Seems like that makes its task more difficult?
Or to put an entirely different frame on the the whole thing: the job of a traditional agent, such as you or me, is to make intelligent decisions. An LLM’s job is to make the exact same intelligent decision that a certain specific actor being imitated would make. Seems harder?
I am trying to wrap my head around the high-level implications of this statement. I can come up with two interpretations:
What LLMs are doing is similar to what people do as they go about their day. When I walk down the street, I am simultaneously using visual and other input to assess the state of the world around me (“that looks like a car”), running a world model based on that assessment (“the car is coming this way”), and then using some other internal mechanism to decide what to do (“I’d better move to the sidewalk”).
What LLMs are doing is harder than what people do. When I converse with someone, I have some internal state, and I run some process in my head – based on that state – to generate my side of the conversation. When an LLM converses with someone, instead of maintaining internal state, needs to maintain a probability distribution over possible states, make next-token predictions according to that distribution, and simultaneously update the distribution.
(2) seems more technically correct, but my intuition dislikes the conclusion, for reasons I am struggling to articulate. …aha, I think this may be what is bothering me: I have glossed over the distinction between input and output tokens. When an LLM is processing input tokens, it is working to synchronize its state to the state of the generator. Once it switches to output mode, there is no functional benefit to continuing to synchronize state (what is it synchronizing to?), so ideally we’d move to a simpler neural net that does not carry the weight of needing to maintain and update a probability distribution of possible states. (Glossing over the fact that LLMs as used in practice sometimes need to repeatedly transition between input and output modes.) LLMs need the capability to ease themselves into any conversation without knowing the complete history of the participant they are emulating, while people have (in principle) access to their own complete history and so don’t need to be able to jump into a random point in their life and synchronize state on the fly.
So the implication is that the computational task faced by an LLM which can emulate Einstein is harder than the computational task of being Einstein… is that right? If so, that in turn leads to the question of whether there are alternative modalities for AI which have the advantages of LLMs (lots of high-quality training data) but don’t impose this extra burden. It also raises the question of how substantial this burden is in practice, in particular for leading-edge models.
You are drawing a distinction between agents that maintain a probability distribution over possible states and those that don’t and you’re putting humans in the latter category. It seems clear to me that all agents are always doing what you describe in (2), which I think clears up what you don’t like about it.
It also seems like humans spend varying amounts of energy on updating probability distributions vs. predicting within a specific model, but I would guess that LLMs can learn to do the same on their own.
As I go about my day, I need to maintain a probability distribution over states of the world. If an LLM tries to imitate me (i.e. repeatedly predict my next output token), it needs to maintain a probability distribution, not just over states of the world, but also over my internal state (i.e. the state of the agent whose outputs it is predicting). I don’t need to keep track of multiple states that I myself might be in, but the LLM does. Seems like that makes its task more difficult?
Or to put an entirely different frame on the the whole thing: the job of a traditional agent, such as you or me, is to make intelligent decisions. An LLM’s job is to make the exact same intelligent decision that a certain specific actor being imitated would make. Seems harder?