Together our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.
i used to call this something like “tackling the OOD generalization problem by simply making the distribution so wide that it encompasses anything you might want to use it on”
also this quote from the abstract is great:
i used to call this something like “tackling the OOD generalization problem by simply making the distribution so wide that it encompasses anything you might want to use it on”