I agree that there are ways to explain the results and these points from Steven and Thane make sense.
I will note that the models are significantly more reliable at learning in-distribution (i.e. to predict the training set) than they are at generalizing to the evaluations that involve verbalizing the latent state (and answering downstream questions about it). So it’s not the case that learning to predict the training set (or inputs very similar to training inputs) automatically results in generalization to the verbalized evaluations.
We do see improvement in reliability with GPT-4 over GPT-3.5, but we don’t have enough information to draw any firm conclusions about scaling.
I agree that there are ways to explain the results and these points from Steven and Thane make sense. I will note that the models are significantly more reliable at learning in-distribution (i.e. to predict the training set) than they are at generalizing to the evaluations that involve verbalizing the latent state (and answering downstream questions about it). So it’s not the case that learning to predict the training set (or inputs very similar to training inputs) automatically results in generalization to the verbalized evaluations. We do see improvement in reliability with GPT-4 over GPT-3.5, but we don’t have enough information to draw any firm conclusions about scaling.