Is a language model performing utility maximization during training?
Let’s ignore RLHF for now and just focus on next token prediction. There’s an argument that, of course the LM is maximizing a utility function—namely it’s log score on predicting the next token, over the distribution of all text on the internet (or whatever it was trained on). An immediate reaction I have to this is that this isn’t really what we want, even ignoring that we want the text to be useful (as most internet text isn’t).
This is clearly related to all the problems around overfitting. My understanding is that in practice, this is solved through a combination of regularization, and stopping training once test loss stops decreasing. So even if a language model was a UM during training, we already have some guardrails on it. Are they enough?
What exactly is being done—what type of thing is being created—when we run a process like “use gradient descent to minimize a loss function on training data, as long as the loss function is also being minimized on test data”?
Is a language model performing utility maximization during training?
Let’s ignore RLHF for now and just focus on next token prediction. There’s an argument that, of course the LM is maximizing a utility function—namely it’s log score on predicting the next token, over the distribution of all text on the internet (or whatever it was trained on). An immediate reaction I have to this is that this isn’t really what we want, even ignoring that we want the text to be useful (as most internet text isn’t).
This is clearly related to all the problems around overfitting. My understanding is that in practice, this is solved through a combination of regularization, and stopping training once test loss stops decreasing. So even if a language model was a UM during training, we already have some guardrails on it. Are they enough?
What exactly is being done—what type of thing is being created—when we run a process like “use gradient descent to minimize a loss function on training data, as long as the loss function is also being minimized on test data”?