I’m confused about the distributional generalization thing. Why is that different from minimizing log loss? The loss function (for the base network, not RL-finetuning) is computed based on the logits, not on the temperature-0 sample, right? So a calibrated probability distribution should minimize loss.
The paper explains it better than I can, but essentially: if I give you an imbalanced labeling problem, where 60% are A and 40% are B, and I remove all the actual features and just replace them with noise, the Bayes-optimal thing to do is output B every time, but in fact large neural networks will learn to output A 60% of the time and B 40% of the time even in that setting.
I’m skeptical of all of those proposed markers of agentic behavior. Being able to predict what an agent would say, when prompted, is different than being an agent in the sense that causes concern (although it certainly lets some actor build an agent using the predictive model as a prior on policies.). What we’d see if a LLM was “secretly” an agent is that it would deviate from being a predictive model, in ways that systematically steered towards some goal—just outputting “I want money” is weaksauce evidence for agency, especially if it’s the sort of thing a predictive model would output and also doesn’t actually steer the world towards some goal we could impute to the network.
Yes, I agree—these markers mostly don’t test whether the model is a predictor (though that’s not entirely true, I do think the delta in markers of agency between different training regimes is a useful datapoint there). Primarily, however, what they do test is, if it is a predictor, how agentic is the thing that it is predicting . And I think that’s extremely important, since we really want to avoid predictive models that are simulating potentially malign agents.
The paper explains it better than I can, but essentially: if I give you an imbalanced labeling problem, where 60% are A and 40% are B, and I remove all the actual features and just replace them with noise, the Bayes-optimal thing to do is output B every time, but in fact large neural networks will learn to output A 60% of the time and B 40% of the time even in that setting.
Yes, I agree—these markers mostly don’t test whether the model is a predictor (though that’s not entirely true, I do think the delta in markers of agency between different training regimes is a useful datapoint there). Primarily, however, what they do test is, if it is a predictor, how agentic is the thing that it is predicting . And I think that’s extremely important, since we really want to avoid predictive models that are simulating potentially malign agents.
Thanks for the reply, that makes sense.