More generally, John Miller and colleagues have found training performance is an excellent predictor of test performance, even when the test set looks fairly different from the training set, across a wide variety of tasks and architectures.
Seems like figure 1 from Miller et al is a plot of test performance vs. “out of distribution” test performance. One might expect plots of training performance vs. “out of distribution” test performance to have more spread.
I doubt there would be much difference, and I think the alignment-relevant comparison is to compare in-distribution but out-of-sample performance to out-of-distribution performance. We can easily do i.i.d. splits of our data, that’s not a problem. You might think it’s a problem to directly test the model in scenarios where it could legitimately execute a takeover if it wanted to.
Taking IID samples can be hard actually. Suppose you train an LLM on news articles. And each important real world event has 10 basically identical news articles written about it. Then a random split of the articles will leave the network being tested mostly on the same newsworthy events that were in the training data.
This leaves it passing the test, even if it’s hopeless at predicting new events and can only generate new articles about the same events.
When data duplication is extensive, making a meaningful train/test split is hard.
If the data was perfect copy and paste duplicated, that could be filtered out. But often things are rephrased a bit.
Fair enough for the alignment comparison, I was just hoping you could maybe correct the quoted paragraph to say “performance on the hold-out data” or something similar.
(The reason to expect more spread would be that training performance can’t detect overfitting but performance on the hold-out data can. I’m guessing some of the nets trained in Miller et al did indeed overfit (specifically the ones with lower performance).)
Seems like figure 1 from Miller et al is a plot of test performance vs. “out of distribution” test performance. One might expect plots of training performance vs. “out of distribution” test performance to have more spread.
I doubt there would be much difference, and I think the alignment-relevant comparison is to compare in-distribution but out-of-sample performance to out-of-distribution performance. We can easily do i.i.d. splits of our data, that’s not a problem. You might think it’s a problem to directly test the model in scenarios where it could legitimately execute a takeover if it wanted to.
Taking IID samples can be hard actually. Suppose you train an LLM on news articles. And each important real world event has 10 basically identical news articles written about it. Then a random split of the articles will leave the network being tested mostly on the same newsworthy events that were in the training data.
This leaves it passing the test, even if it’s hopeless at predicting new events and can only generate new articles about the same events.
When data duplication is extensive, making a meaningful train/test split is hard.
If the data was perfect copy and paste duplicated, that could be filtered out. But often things are rephrased a bit.
I actually wish this is done sometime in the future, but I’m okay with focusing on other things for now.
(specifically the Training vs Out Of Distribution test performance experiment, especially on more realistic neural nets.)
Fair enough for the alignment comparison, I was just hoping you could maybe correct the quoted paragraph to say “performance on the hold-out data” or something similar.
(The reason to expect more spread would be that training performance can’t detect overfitting but performance on the hold-out data can. I’m guessing some of the nets trained in Miller et al did indeed overfit (specifically the ones with lower performance).)