The lottery ticket hypothesis, as I (vaguely) understand it, is that artificial neural networks tend to work in the following way: When the network is randomly initialized, there is a sub-network that is already decent at the task. Then, when training happens, that sub-network is reinforced and all other sub-networks are dampened so as to not interfere.
[EDIT: This understanding goes beyond what the original paper proved, it draws from things proved (or allegedly proved) in later papers. See thread below. EDIT EDIT: Daniel Filan has now convinced me that my understanding of the LTH as expressed above was importantly wrong, or at least importantly goes-beyond-the-evidence.]
By the scaling hypothesis I mean that in the next five years, many other architectures besides the transformer will also be shown to get substantially better as they get bigger. I’m also interested in defining it differently, as whatever Gwern is talking about.
The implication depends on the distribution of lottery tickets. If there is a short-tailed distribution, then the rewards of scaling will be relatively small; bigger would still get better, but very slowly. A long-tailed distribution, on the other hand, would suggest continued returns to getting more lottery tickets.
I ask a question here about what’s true in practice.