I think there’s two different major “phases” of a language model, training and runtime. During training, the model is getting “steered” toward some objective function—first getting the probability of the next token “right”, and then getting positive feedback from humans during rlhf (I think? I should read exactly how rlhf works). Is this utility maximization? It doesn’t feel like it—I think I’ll put my thoughts on this in another comment.
During runtime, at first glance, the model is kind of “deterministic” (wrong word), in that it’s “just multiplying matrices”, but maybe it “learned” some utility maximizers during training and they’re embedded within it. Not sure if this is actually possible, or if it happens in practice, and if the utility maximizers are dominate the agent or can be “overruled” by other parts of it.
Decision theory likes to put its foot down on a particular preference, and then ask what follows. During inference, a pre-trained model seems to be encoding something that can loosely be thought of as (situation, objective) pairs. The embeddings it computes (residual stream somewhere in the middle) is a good representation of the situation for the purpose of pursuing the objective, and solves part of the problem of general intelligence (being able to pursue ~arbitrary objectives allows pursuing ~arbitrary instrumental objectives). Fine-tuning can then essentially do interpretability to the embeddings to find the next action useful for pusuing the objective in the situation. System prompt fine-tuning makes specification of objectives more explicit.
This plurality of objectives is unlike having a specific preference, but perhaps there is some “universal utility” of being a simulator that seeks to solve arbitrary decision problems given by (situation, objective) pairs, and to take intentional stance on situations that don’t have an objective explicitly pointed out, eliciting an objective that fits the situation, and then pursuing that. With an objective found in environment, this is similar to one of the things corrigibility does, adopting preference that’s not originally part of the agent. And if elicitation of objectives for a situation can be made pseudokind, this line of thought might clarify the intuition that the concept of pseudokindness/respect-for-boundaries has some naturality to it, rather than being purely a psychological artifact of desperate search for rationalizations that would argue possibility of humanity’s survival.
Is a language model performing utility maximization during training?
Let’s ignore RLHF for now and just focus on next token prediction. There’s an argument that, of course the LM is maximizing a utility function—namely it’s log score on predicting the next token, over the distribution of all text on the internet (or whatever it was trained on). An immediate reaction I have to this is that this isn’t really what we want, even ignoring that we want the text to be useful (as most internet text isn’t).
This is clearly related to all the problems around overfitting. My understanding is that in practice, this is solved through a combination of regularization, and stopping training once test loss stops decreasing. So even if a language model was a UM during training, we already have some guardrails on it. Are they enough?
What exactly is being done—what type of thing is being created—when we run a process like “use gradient descent to minimize a loss function on training data, as long as the loss function is also being minimized on test data”?
Are language models utility maximizes?
I think there’s two different major “phases” of a language model, training and runtime. During training, the model is getting “steered” toward some objective function—first getting the probability of the next token “right”, and then getting positive feedback from humans during rlhf (I think? I should read exactly how rlhf works). Is this utility maximization? It doesn’t feel like it—I think I’ll put my thoughts on this in another comment.
During runtime, at first glance, the model is kind of “deterministic” (wrong word), in that it’s “just multiplying matrices”, but maybe it “learned” some utility maximizers during training and they’re embedded within it. Not sure if this is actually possible, or if it happens in practice, and if the utility maximizers are dominate the agent or can be “overruled” by other parts of it.
Decision theory likes to put its foot down on a particular preference, and then ask what follows. During inference, a pre-trained model seems to be encoding something that can loosely be thought of as (situation, objective) pairs. The embeddings it computes (residual stream somewhere in the middle) is a good representation of the situation for the purpose of pursuing the objective, and solves part of the problem of general intelligence (being able to pursue ~arbitrary objectives allows pursuing ~arbitrary instrumental objectives). Fine-tuning can then essentially do interpretability to the embeddings to find the next action useful for pusuing the objective in the situation. System prompt fine-tuning makes specification of objectives more explicit.
This plurality of objectives is unlike having a specific preference, but perhaps there is some “universal utility” of being a simulator that seeks to solve arbitrary decision problems given by (situation, objective) pairs, and to take intentional stance on situations that don’t have an objective explicitly pointed out, eliciting an objective that fits the situation, and then pursuing that. With an objective found in environment, this is similar to one of the things corrigibility does, adopting preference that’s not originally part of the agent. And if elicitation of objectives for a situation can be made pseudokind, this line of thought might clarify the intuition that the concept of pseudokindness/respect-for-boundaries has some naturality to it, rather than being purely a psychological artifact of desperate search for rationalizations that would argue possibility of humanity’s survival.
Is a language model performing utility maximization during training?
Let’s ignore RLHF for now and just focus on next token prediction. There’s an argument that, of course the LM is maximizing a utility function—namely it’s log score on predicting the next token, over the distribution of all text on the internet (or whatever it was trained on). An immediate reaction I have to this is that this isn’t really what we want, even ignoring that we want the text to be useful (as most internet text isn’t).
This is clearly related to all the problems around overfitting. My understanding is that in practice, this is solved through a combination of regularization, and stopping training once test loss stops decreasing. So even if a language model was a UM during training, we already have some guardrails on it. Are they enough?
What exactly is being done—what type of thing is being created—when we run a process like “use gradient descent to minimize a loss function on training data, as long as the loss function is also being minimized on test data”?