gwern comments on Training our humans on the wrong dataset

gwern 21 Jun 2020 18:58 UTC
15 points

One way or another, we’d try to use the most relevant dataset first.

Otherwise known as “underfitting”...

Maybe you do this, but me, and many people in ML, do our best to avoid ever doing that. Transfer learning powers the best and highest-performing models. Even in pure supervised learning, you train on the largest dataset possible, and then finetune. And that works much better than training on just the target task. You cannot throw a stick in ML today without observing this basic paradigm.

I know, let’s take a dataset of 2d images of cars and their 3d rendering and train the model on that first.

There are GAN papers, among others, which do pretty much this for inferring models & depth maps.

But that’s just because the hard part, the training, is already done.

No. You don’t do it ‘just’ to save computation. You do it because it learns superior representations and generalizes better on less data. That finetuning is a lot cheaper is merely convenient.

Given that your motivating analogy to machine learning is comprehensively wrong, perhaps you should rethink this essay.
- George3d6 21 Jun 2020 22:22 UTC
  2 points
  Parent
  TL;DR Please provide references in order for me to give a more cohesive reply, see papers bellow + my reasoning & explanation as to why you are basically wrong and/or confusing things that work in RL with things that work in SL and/or confusing techniques being used to train with scarce data for ones that would work even when the data is large enough that compute is a bottleneck (which is the case I’m arguing for, i.e. that compute should first be thrown at the most relevant data)
  Maybe you do this, but me, and many people in ML, do our best to avoid ever doing that. Transfer learning powers the best and highest-performing models. Even in pure supervised learning, you train on the largest dataset possible, and then finetune. And that works much better than training on just the target task. You cannot throw a stick in ML today without observing this basic paradigm.
  I would ask for a citation on that.
  Never in any ML literature have I ever heard of people training models on datasets other than those they wanted to solve as a more efficient alternative to training on the dataset itself. Of course, provided more time once you converge on your data training on related data can be helpful, but my point is just that training on the actual data is the first approach one takes (obviously, depending on the size of the problem you might start with weight transfer directly)
  People transfer weights all the time, but that’s because it shortens training time.
  New examples of unrelated data (or less-related data) does not make a model converge faster on validation data assuming you could instead create a new example of problem-specific data.
  In theory it could make the model generalize better, but when I say “in theory” I mean in layman’s terms since doing research on this topic is hard and there’s scarce little in supervised learning.
  Most rigorous research on this topic seems to be in RL, e.g.: https://arxiv.org/pdf/1909.01331.pdf and it’s nowhere near clear cut.
  Out of the research that seems to apply better to SL I find this theory/paper to be most rigorous and up to date: https://openreview.net/pdf?id=ryfMLoCqtQ … and the findings here as in literally any other paper by a respected team or university you will find on the subject can be boiled down to:
  “Sometime it helps with generalization on the kind of data not present in the training set and sometime it just results in a shittier models and it depends a lot on the SNR of the data the model was trained on relative to the data you are training for now”
  There are GAN papers, among others, which do pretty much this for inferring models & depth maps.
  Again, links to papers please. My bet is that the GAN papers do this:
  a) Because they lack 3d rendering of the objects they want to create.
  b) Because they lack 3d renderings of most of the objects they want to create.
  c) Because they are trying to showcase an approach that generalizes to different classes of data that aren’t available at training time (I.e. showing that a car 3d rendering model can generalize to do 3d renderings of glasses, not that it can perform better than one that’s been specifically trained to generate 3d renderings of glasses).
  If one can achieve better results with unrelated data than with related data in similar compute time (i.e. up until either of the models has converged on a validation dataset/runs or in a predefined period of time), or even if one can achieve better results by training on unrelated data *first* and then on related data rather than vice versa… I will eat my metaphorical hat and retract this whole article. (Provided both models use appropriate regularization or that at least the relevant-data model uses it, otherwise I can see a hypothetical where a bit of high-noise data can serve as a form of regularization, but even this I would think to be highly unlikely)
  No. You don’t do it ‘just’ to save computation. You do it because it learns superior representations and generalizes better on less data. That finetuning is a lot cheaper is merely convenient.
  Again see my answers above and please provide relevant citations if you wish to claim the contrary, it seems to me that what you are saying here goes both against common sense. i.e. given a choice between problem-specific data and less-related data your claim is that at some point using less-related data is superior.
  A charitable reading of this is that introducing noise in the training data helps generalize (see e.g. techniques involving introducing noise in the training data, l2 regularization and dropout), which seems kind of true but far from true on that many tasks and I invite you to experiment with it an realize it actually doesn’t really apply to everything nor are the effect sizes large unless you are specifically focusing on adversarial examples or datasets where the train set covers only a minute portion of potential data.