David Scott Krueger (formerly: capybaralet) comments on the scaling “inconsistency”: openAI’s new insight

David Scott Krueger (formerly: capybaralet) 1 Feb 2021 12:11 UTC
LW: 4 AF: 3
AF
if your model gets more sample-efficient as it gets larger & n gets larger, it’s because it’s increasingly approaching a Bayes-optimal learner and so it gets more out of the more data, but then when you hit the Bayes-limit, how are you going to learn more from each datapoint? You have to switch over to a different and inferior scaling law. You can’t squeeze blood from a stone; once you approach the intrinsic entropy, there’s not much to learn.

I found this confusing. It sort of seems like you’re assuming that a Bayes-optimal learner achieves the Bayes error rate (are you ?), which seems wrong to me.
- What do you mean “the Bayes-limit”? At first, I assumed you were talking about the Bayes error rate (https://en.wikipedia.org/wiki/Bayes_error_rate), but that is (roughly) the error you coule expect to achieve with infinite data, and we’re still talking about finite data.
- What do you mean “Bayes-optimal learner”? I assume you just mean something that performs Bayes rule exactly (so depends on the prior/data).
- I’m confused by you talking about “approach[ing] the intrinsic entropy”… it seems like the figure in OP shows L(C) approaching L(D). But is L(D) supposed to represent intrinsic entropy? should we trust it as an estimate of intrinsic entropy?
I also don’t see how active learning is supposed to help (unless you’re talking about actively generating data)… I thought the whole point you were trying to make is that once you reach the Bayes error rate there’s literally nothing you can do to keep improving without more data.
You talk about using active learning to throw out data-points… but I thought the problem was not having enough data? So how is throwing out data supposed to help with that?