So, on ~28% of cases (70% * 40%), the strong student is wrong by “overfitting to weak supervision”.
Attributing all of these errors to overfitting implies that, if there were no overfitting, the strong student would get 100% accuracy on the subset where the weak model is wrong. But we have no reason to expect that. Instead, these errors are some mixture of overfitting and “just being dumb.”
Note that we should expect the strong and weak models to make somewhat correlated errors even when both are trained on gold labels, i.e. in the hypothetical case where overfitting to weak supervision is not possible. (The task examples vary in difficulty, the two models have various traits in common that could lead to shared “quirks,” etc.)
And indeed, when the weak and strong models use similar amounts of compute, they make very similar predictions—we see this in the upper-leftmost points on each line, which are especially noticeable in Fig 8c. In this regime, the hypothetical “what if we trained strong model on gold labels?” is ~equivalent to the weak model, so ~none of the strong model errors here can be attributed to “overfitting to weak supervision.”
As the compute ratio grows, the errors become both less frequent and less correlated. That’s the main trend we see in 8b and 8c. This reflects the strong model growing more capable, and thus making fewer “just being dumb” errors.
Fig 8 doesn’t provide enough information to determine how much the strong model is being held back by weak supervision at higher ratios, because it doesn’t show strong-trained-on-gold performance. (Fig. 3 does, though.)
IMO the strongest reasons to be skeptical of (the relevance of) these results is in Appendix E, where they show that the strong model overfits a lot when it can easily predict the weak errors.
Attributing all of these errors to overfitting implies that, if there were no overfitting, the strong student would get 100% accuracy on the subset where the weak model is wrong. But we have no reason to expect that. Instead, these errors are some mixture of overfitting and “just being dumb.”
Note that we should expect the strong and weak models to make somewhat correlated errors even when both are trained on gold labels, i.e. in the hypothetical case where overfitting to weak supervision is not possible. (The task examples vary in difficulty, the two models have various traits in common that could lead to shared “quirks,” etc.)
And indeed, when the weak and strong models use similar amounts of compute, they make very similar predictions—we see this in the upper-leftmost points on each line, which are especially noticeable in Fig 8c. In this regime, the hypothetical “what if we trained strong model on gold labels?” is ~equivalent to the weak model, so ~none of the strong model errors here can be attributed to “overfitting to weak supervision.”
As the compute ratio grows, the errors become both less frequent and less correlated. That’s the main trend we see in 8b and 8c. This reflects the strong model growing more capable, and thus making fewer “just being dumb” errors.
Fig 8 doesn’t provide enough information to determine how much the strong model is being held back by weak supervision at higher ratios, because it doesn’t show strong-trained-on-gold performance. (Fig. 3 does, though.)
IMO the strongest reasons to be skeptical of (the relevance of) these results is in Appendix E, where they show that the strong model overfits a lot when it can easily predict the weak errors.