If a researcher was given 1000X more data, 1000X CPU power, would he switch to a brute-force approach? I did not see the connection between “data and computation power” and the brute-force models.
A simple toy model: a roll a pair of dice many, many times. If we have a sufficiently large amount of data and computational power, then we can brute-force fit the distribution of outcomes—i.e. we can count how many times each pair of numbers is rolled, estimate the distribution of outcomes based solely on that, and get a very good fit to the distribution.
By contrast, if we have only a small amount of data/compute, we need to be more efficient in order to get a good estimate of the distribution. We need a prior which accounts for the fact that there are two dice whose outcomes are probably roughly independent, or that the dice are probably roughly symmetric. Leveraging that model structure is more work for the programmer—we need to code that structure into the model, and check that it’s correct, and so forth—but it lets us get good results with less data/compute.
So naturally, given more data/compute, people will avoid that extra modelling/programming work and lean towards more brute-force models—especially if they’re just measuring success by fit to their data.
But then, the distribution shifts—maybe one of the dice is swapped out for a weighted die. Because our brute force model has no internal structure, it doesn’t have a way to re-use its information. It doesn’t have a model of “two dice”, it just has a model of “distribution of outcomes”—there’s no notion of some outcomes corresponding to the same face on one of the two dice. But the more principled model does have that internal structure, so it can naturally re-use the still-valid subcomponents of the model when one subcomponent changes.
Conversely, additional data/compute doesn’t really help us make our models more principled—that’s mainly a problem of modelling/programming which currently needs to be handled by humans. To the extent that generalizability is the limiting factor to usefulness of models, additional data/compute alone doesn’t help much—and indeed, despite the flagship applications in vision and language, most of today’s brute-force-ish deep learning models do generalize very poorly.
If a researcher was given 1000X more data, 1000X CPU power, would he switch to a brute-force approach? I did not see the connection between “data and computation power” and the brute-force models.
A simple toy model: a roll a pair of dice many, many times. If we have a sufficiently large amount of data and computational power, then we can brute-force fit the distribution of outcomes—i.e. we can count how many times each pair of numbers is rolled, estimate the distribution of outcomes based solely on that, and get a very good fit to the distribution.
By contrast, if we have only a small amount of data/compute, we need to be more efficient in order to get a good estimate of the distribution. We need a prior which accounts for the fact that there are two dice whose outcomes are probably roughly independent, or that the dice are probably roughly symmetric. Leveraging that model structure is more work for the programmer—we need to code that structure into the model, and check that it’s correct, and so forth—but it lets us get good results with less data/compute.
So naturally, given more data/compute, people will avoid that extra modelling/programming work and lean towards more brute-force models—especially if they’re just measuring success by fit to their data.
But then, the distribution shifts—maybe one of the dice is swapped out for a weighted die. Because our brute force model has no internal structure, it doesn’t have a way to re-use its information. It doesn’t have a model of “two dice”, it just has a model of “distribution of outcomes”—there’s no notion of some outcomes corresponding to the same face on one of the two dice. But the more principled model does have that internal structure, so it can naturally re-use the still-valid subcomponents of the model when one subcomponent changes.
Conversely, additional data/compute doesn’t really help us make our models more principled—that’s mainly a problem of modelling/programming which currently needs to be handled by humans. To the extent that generalizability is the limiting factor to usefulness of models, additional data/compute alone doesn’t help much—and indeed, despite the flagship applications in vision and language, most of today’s brute-force-ish deep learning models do generalize very poorly.