I had an idea when reading it that I think is pretty interesting. You mention that both the grokking of a small amount of data repeated many times, and models trained on a great deal of data are highly general. Repeated data during training is also mentioned as a significant negative for large models. These are very much in tension.
My idea is this. Split the training data into two parts, one vastly larger than the other. First, train a model on a small amount of data many times in a way designed to make it grok the task, such as weight decay. Second, train it on the rest of the very large amount of data. Third, compare it to the a model trained on both parts without repeats. See how they compare. (I don’t know if people have done this. I’m definitely a layman when it comes to such things.)
Models are often ‘fine-tuned’ on things later. You could see this as sort of the opposite, where we tune it for the task first, and then train it.
Repeated data during training is also mentioned as a significant negative for large models. These are very much in tension.
To be clear, the paper I cite on data quality focuses on how repeated data is bad for generalisation. From the model’s perspective, the only thing it cares about is train loss (and maybe simplicity), and repeated data is great for train loss! The model doesn’t care whether it generalises, only whether generalisation is a “more efficient” solution. Grokking happens when the amount of data is such that the model marginally prefers the correct solution, but there’s no reason to expect that repeated data screwing over models is exactly the amount of data such that the correct solution is better.
Though the fact that larger models are messed up by fewer repeated data points is fascinating—I don’t know if this is a problem with my hypothesis, or just a statement about the relative complexity of different circuits in larger vs smaller models.
Your experiment idea is interesting, I’m not sure what I’d expect to happen! I’d love to see someone try it, and am not aware of anyone who has (the paper I cite is vaguely similar—there they train the model on the repeated data and unrepeated data shuffled together, and compare it to a model trained on just the unrepeated data).
Though I do think that if this is a real task there wouldn’t be an amount of data that leads to general grokking, rather than amount of data to grok varies heavily between different circuits.
What do you consider a real task? There are all sorts of small but important tasks that are unlikely to need a neural network to be large or have excessive amounts of data for it to do. If you can split a complicated task into a bunch of simple ones, but can’t actually solve it with the knowledge and approaches you have, you could have a task simple enough for generalized understanding via a neural network to be simple enough for this phenomenon to be obvious, couldn’t you? Yet it could then be composed with the networks for other tasks to get a valuable result. (This would perhaps allow rapid prototyping of things like programs by people who aren’t experts on the task, for instance.)
I don’t have any real examples at the moment, so I could be wrong. It might be interesting to test on a thing we do understand like sorting lists. This would have the advantage of being simple enough that you might even be able to pull of the trick of reverse engineering what algorithm is used too? Trivial to create the data too. The disadvantage would be that it probably wouldn’t lead anywhere useful in itself.
I had an idea when reading it that I think is pretty interesting. You mention that both the grokking of a small amount of data repeated many times, and models trained on a great deal of data are highly general. Repeated data during training is also mentioned as a significant negative for large models. These are very much in tension.
My idea is this. Split the training data into two parts, one vastly larger than the other. First, train a model on a small amount of data many times in a way designed to make it grok the task, such as weight decay. Second, train it on the rest of the very large amount of data. Third, compare it to the a model trained on both parts without repeats. See how they compare. (I don’t know if people have done this. I’m definitely a layman when it comes to such things.)
Models are often ‘fine-tuned’ on things later. You could see this as sort of the opposite, where we tune it for the task first, and then train it.
To be clear, the paper I cite on data quality focuses on how repeated data is bad for generalisation. From the model’s perspective, the only thing it cares about is train loss (and maybe simplicity), and repeated data is great for train loss! The model doesn’t care whether it generalises, only whether generalisation is a “more efficient” solution. Grokking happens when the amount of data is such that the model marginally prefers the correct solution, but there’s no reason to expect that repeated data screwing over models is exactly the amount of data such that the correct solution is better.
Though the fact that larger models are messed up by fewer repeated data points is fascinating—I don’t know if this is a problem with my hypothesis, or just a statement about the relative complexity of different circuits in larger vs smaller models.
Your experiment idea is interesting, I’m not sure what I’d expect to happen! I’d love to see someone try it, and am not aware of anyone who has (the paper I cite is vaguely similar—there they train the model on the repeated data and unrepeated data shuffled together, and compare it to a model trained on just the unrepeated data).
Though I do think that if this is a real task there wouldn’t be an amount of data that leads to general grokking, rather than amount of data to grok varies heavily between different circuits.
What do you consider a real task? There are all sorts of small but important tasks that are unlikely to need a neural network to be large or have excessive amounts of data for it to do. If you can split a complicated task into a bunch of simple ones, but can’t actually solve it with the knowledge and approaches you have, you could have a task simple enough for generalized understanding via a neural network to be simple enough for this phenomenon to be obvious, couldn’t you? Yet it could then be composed with the networks for other tasks to get a valuable result. (This would perhaps allow rapid prototyping of things like programs by people who aren’t experts on the task, for instance.)
I don’t have any real examples at the moment, so I could be wrong. It might be interesting to test on a thing we do understand like sorting lists. This would have the advantage of being simple enough that you might even be able to pull of the trick of reverse engineering what algorithm is used too? Trivial to create the data too. The disadvantage would be that it probably wouldn’t lead anywhere useful in itself.