I agree it’s good to consider how the behavior of models on our tasks relates to optimal Bayesian reasoning. That said, I’m not sure how to define or calculate the “groundtruth” for optimal reasoning. (Does it depend on using the pretraining distribution as a prior and if so how should we estimate that? How to think about the distinction between in-context and out-of-context reasoning?).
In any case, there is some evidence against models being close to Bayesian optimality (however exactly optimality is defined): 1. Results on the same task differ between GPT-3 and Llama-2 models (two models that have fairly similar overall capabilities). Llama-2 being slightly more influenced by declarative information. 2. From the Bayesian perspective, including “realized descriptions” should have a significant impact on how much the model is influenced by “unrealized descriptions”. The effects we see seem smaller than expected (see Figure 4 and Table 2).
Incidentally, I like the idea of testing in different languages to see if the model is encoding in the information more abstractly.
I’m not sure how to define or calculate the “groundtruth” for optimal reasoning. (Does it depend on using the pretraining distribution as a prior and if so how should we estimate that?
In theory, given access to the training set of a model, one could count through and see how many mentions there were of members of different professions from different countries of different genders, adjust this for reliability of source, and perhaps even allow for some extrapolation across professions and countries and the ground-truth fact that 51% if humans are female. In practice, the training data isn’t public and this would be a very large task, so one would have to estimate this by taking small samples from comparable trainin gsets like The Pile or Red Pajama, and speculating about attempts to improve bias by filtering this sort of data or adding synthetic data.
How to think about the distinction between in-context and out-of-context reasoning?).
Base models are trained to predict tokens in the training set. Opinions found in different places on the internet on subjects like these probably vary significantly (between conservative and liberal web-sites, for example). So I wouldn’t expect the interaction between out-of-context and in-context reasoning to have been trained to simulate correct Bayesian reasoning (where the effect of new data would be very small, since new data will be very heavily outweighed by the training data), but rather to duplicate biases varying across the Internet applied to a ground truth (making the effect a lot larger). Specifically, I’d expect both out-of-context and in-context reasoning to be individually be approximately Bayesian, but the way they combine to heavily over-emphasize in-context data compared to what correct Bayesian rationality would do.
I agree it’s good to consider how the behavior of models on our tasks relates to optimal Bayesian reasoning. That said, I’m not sure how to define or calculate the “groundtruth” for optimal reasoning. (Does it depend on using the pretraining distribution as a prior and if so how should we estimate that? How to think about the distinction between in-context and out-of-context reasoning?).
In any case, there is some evidence against models being close to Bayesian optimality (however exactly optimality is defined):
1. Results on the same task differ between GPT-3 and Llama-2 models (two models that have fairly similar overall capabilities). Llama-2 being slightly more influenced by declarative information.
2. From the Bayesian perspective, including “realized descriptions” should have a significant impact on how much the model is influenced by “unrealized descriptions”. The effects we see seem smaller than expected (see Figure 4 and Table 2).
Incidentally, I like the idea of testing in different languages to see if the model is encoding in the information more abstractly.
In theory, given access to the training set of a model, one could count through and see how many mentions there were of members of different professions from different countries of different genders, adjust this for reliability of source, and perhaps even allow for some extrapolation across professions and countries and the ground-truth fact that 51% if humans are female. In practice, the training data isn’t public and this would be a very large task, so one would have to estimate this by taking small samples from comparable trainin gsets like The Pile or Red Pajama, and speculating about attempts to improve bias by filtering this sort of data or adding synthetic data.
Base models are trained to predict tokens in the training set. Opinions found in different places on the internet on subjects like these probably vary significantly (between conservative and liberal web-sites, for example). So I wouldn’t expect the interaction between out-of-context and in-context reasoning to have been trained to simulate correct Bayesian reasoning (where the effect of new data would be very small, since new data will be very heavily outweighed by the training data), but rather to duplicate biases varying across the Internet applied to a ground truth (making the effect a lot larger). Specifically, I’d expect both out-of-context and in-context reasoning to be individually be approximately Bayesian, but the way they combine to heavily over-emphasize in-context data compared to what correct Bayesian rationality would do.