More generally, John Miller and colleagues have found training performance is an excellent predictor of test performance, even when the test set looks fairly different from the training set, across a wide variety of tasks and architectures.
Counterdatapoint to [training performance being an excellent predictor of test performance]: in this paper, GPT-3 was fine-tuned to multiply “small” (e.g., 3-digit by 3-digit) numbers, which didn’t generalize to multiplying bigger numbers.
Counterdatapoint to [training performance being an excellent predictor of test performance]: in this paper, GPT-3 was fine-tuned to multiply “small” (e.g., 3-digit by 3-digit) numbers, which didn’t generalize to multiplying bigger numbers.