Repeated data during training is also mentioned as a significant negative for large models. These are very much in tension.
To be clear, the paper I cite on data quality focuses on how repeated data is bad for generalisation. From the model’s perspective, the only thing it cares about is train loss (and maybe simplicity), and repeated data is great for train loss! The model doesn’t care whether it generalises, only whether generalisation is a “more efficient” solution. Grokking happens when the amount of data is such that the model marginally prefers the correct solution, but there’s no reason to expect that repeated data screwing over models is exactly the amount of data such that the correct solution is better.
Though the fact that larger models are messed up by fewer repeated data points is fascinating—I don’t know if this is a problem with my hypothesis, or just a statement about the relative complexity of different circuits in larger vs smaller models.
Your experiment idea is interesting, I’m not sure what I’d expect to happen! I’d love to see someone try it, and am not aware of anyone who has (the paper I cite is vaguely similar—there they train the model on the repeated data and unrepeated data shuffled together, and compare it to a model trained on just the unrepeated data).
Though I do think that if this is a real task there wouldn’t be an amount of data that leads to general grokking, rather than amount of data to grok varies heavily between different circuits.
What do you consider a real task? There are all sorts of small but important tasks that are unlikely to need a neural network to be large or have excessive amounts of data for it to do. If you can split a complicated task into a bunch of simple ones, but can’t actually solve it with the knowledge and approaches you have, you could have a task simple enough for generalized understanding via a neural network to be simple enough for this phenomenon to be obvious, couldn’t you? Yet it could then be composed with the networks for other tasks to get a valuable result. (This would perhaps allow rapid prototyping of things like programs by people who aren’t experts on the task, for instance.)
I don’t have any real examples at the moment, so I could be wrong. It might be interesting to test on a thing we do understand like sorting lists. This would have the advantage of being simple enough that you might even be able to pull of the trick of reverse engineering what algorithm is used too? Trivial to create the data too. The disadvantage would be that it probably wouldn’t lead anywhere useful in itself.
To be clear, the paper I cite on data quality focuses on how repeated data is bad for generalisation. From the model’s perspective, the only thing it cares about is train loss (and maybe simplicity), and repeated data is great for train loss! The model doesn’t care whether it generalises, only whether generalisation is a “more efficient” solution. Grokking happens when the amount of data is such that the model marginally prefers the correct solution, but there’s no reason to expect that repeated data screwing over models is exactly the amount of data such that the correct solution is better.
Though the fact that larger models are messed up by fewer repeated data points is fascinating—I don’t know if this is a problem with my hypothesis, or just a statement about the relative complexity of different circuits in larger vs smaller models.
Your experiment idea is interesting, I’m not sure what I’d expect to happen! I’d love to see someone try it, and am not aware of anyone who has (the paper I cite is vaguely similar—there they train the model on the repeated data and unrepeated data shuffled together, and compare it to a model trained on just the unrepeated data).
Though I do think that if this is a real task there wouldn’t be an amount of data that leads to general grokking, rather than amount of data to grok varies heavily between different circuits.
What do you consider a real task? There are all sorts of small but important tasks that are unlikely to need a neural network to be large or have excessive amounts of data for it to do. If you can split a complicated task into a bunch of simple ones, but can’t actually solve it with the knowledge and approaches you have, you could have a task simple enough for generalized understanding via a neural network to be simple enough for this phenomenon to be obvious, couldn’t you? Yet it could then be composed with the networks for other tasks to get a valuable result. (This would perhaps allow rapid prototyping of things like programs by people who aren’t experts on the task, for instance.)
I don’t have any real examples at the moment, so I could be wrong. It might be interesting to test on a thing we do understand like sorting lists. This would have the advantage of being simple enough that you might even be able to pull of the trick of reverse engineering what algorithm is used too? Trivial to create the data too. The disadvantage would be that it probably wouldn’t lead anywhere useful in itself.