In this toy model, is it really the case that the datapoint feature solutions are “more memorizing, less generalizing” than the axis-aligned feature solutions? I don’t feel totally convinced of this.
Well, empirically in this setup, (1) does generalize and get a lower test loss than (2). In fact, it’s the only version that does better than random. 🙂
But I think what you’re maybe saying is that from the neural network’s perspective, (2) is a very reasonable hypothesis when T < N, regardless of what is true in this specific setup. And you could perhaps imagine other data generating processes which would look similar for small data sets, but generalize differently. I think there’s something to that, and it depends a lot on your intuitions about what natural data is like.
Some important intuitions for me are:
Many natural language features are extremely sparse. For example, it seems likely LLMs probably have features for particular people, for particular street intersections, for specific restaurants… Each of these features is very very rarely occurring (many are probably present less than 1 in 10 million tokens).
Simultaneously, there are an enormous number of features (see above!).
While the datasets aren’t actually small, repeated data points effectively make many data points behave like we’re in the small data regime (see Adam’s repeated data experiment).
Thus, my intuition is that something directionally like this setup—having a large number of extremely sparse features—and then studying how representations change with dataset size is quite relevant. But that’s all just based on intuition!
Well, empirically in this setup, (1) does generalize and get a lower test loss than (2). In fact, it’s the only version that does better than random. 🙂
But I think what you’re maybe saying is that from the neural network’s perspective, (2) is a very reasonable hypothesis when T < N, regardless of what is true in this specific setup. And you could perhaps imagine other data generating processes which would look similar for small data sets, but generalize differently. I think there’s something to that, and it depends a lot on your intuitions about what natural data is like.
Some important intuitions for me are:
Many natural language features are extremely sparse. For example, it seems likely LLMs probably have features for particular people, for particular street intersections, for specific restaurants… Each of these features is very very rarely occurring (many are probably present less than 1 in 10 million tokens).
Simultaneously, there are an enormous number of features (see above!).
While the datasets aren’t actually small, repeated data points effectively make many data points behave like we’re in the small data regime (see Adam’s repeated data experiment).
Thus, my intuition is that something directionally like this setup—having a large number of extremely sparse features—and then studying how representations change with dataset size is quite relevant. But that’s all just based on intuition!
(By the way, I think there is a very deep observations about the duality of (1) vs (2) and T<N. See the observations about duality in https://​​arxiv.org/​​pdf/​​2210.16859.pdf )