I’m not sure how to define or calculate the “groundtruth” for optimal reasoning. (Does it depend on using the pretraining distribution as a prior and if so how should we estimate that?
In theory, given access to the training set of a model, one could count through and see how many mentions there were of members of different professions from different countries of different genders, adjust this for reliability of source, and perhaps even allow for some extrapolation across professions and countries and the ground-truth fact that 51% if humans are female. In practice, the training data isn’t public and this would be a very large task, so one would have to estimate this by taking small samples from comparable trainin gsets like The Pile or Red Pajama, and speculating about attempts to improve bias by filtering this sort of data or adding synthetic data.
How to think about the distinction between in-context and out-of-context reasoning?).
Base models are trained to predict tokens in the training set. Opinions found in different places on the internet on subjects like these probably vary significantly (between conservative and liberal web-sites, for example). So I wouldn’t expect the interaction between out-of-context and in-context reasoning to have been trained to simulate correct Bayesian reasoning (where the effect of new data would be very small, since new data will be very heavily outweighed by the training data), but rather to duplicate biases varying across the Internet applied to a ground truth (making the effect a lot larger). Specifically, I’d expect both out-of-context and in-context reasoning to be individually be approximately Bayesian, but the way they combine to heavily over-emphasize in-context data compared to what correct Bayesian rationality would do.
In theory, given access to the training set of a model, one could count through and see how many mentions there were of members of different professions from different countries of different genders, adjust this for reliability of source, and perhaps even allow for some extrapolation across professions and countries and the ground-truth fact that 51% if humans are female. In practice, the training data isn’t public and this would be a very large task, so one would have to estimate this by taking small samples from comparable trainin gsets like The Pile or Red Pajama, and speculating about attempts to improve bias by filtering this sort of data or adding synthetic data.
Base models are trained to predict tokens in the training set. Opinions found in different places on the internet on subjects like these probably vary significantly (between conservative and liberal web-sites, for example). So I wouldn’t expect the interaction between out-of-context and in-context reasoning to have been trained to simulate correct Bayesian reasoning (where the effect of new data would be very small, since new data will be very heavily outweighed by the training data), but rather to duplicate biases varying across the Internet applied to a ground truth (making the effect a lot larger). Specifically, I’d expect both out-of-context and in-context reasoning to be individually be approximately Bayesian, but the way they combine to heavily over-emphasize in-context data compared to what correct Bayesian rationality would do.