It’s worth noting that Jesse is mostly following the traditional “approximation, generalization, optimization” error decomposition from learning theory here—where “generalization” specifically refers to finite-sample generalization (gap between train/test loss), rather than something like OOD generalization. So e.g. a failure of transformers to solve recursive problems would be a failure of approximation, rather than a failure of generalization. Unless I misunderstood you?
It’s worth noting that Jesse is mostly following the traditional “approximation, generalization, optimization” error decomposition from learning theory here—where “generalization” specifically refers to finite-sample generalization (gap between train/test loss), rather than something like OOD generalization. So e.g. a failure of transformers to solve recursive problems would be a failure of approximation, rather than a failure of generalization. Unless I misunderstood you?
Ok, I understand now. You haven’t misunderstood me. I’m not sure what to do with my comment above now.
Thanks for raising that, it’s a good point. I’d appreciate it if you also cross-posted this to the approximation post here.
I’ll cross post it soon.
I actually did it: https://www.lesswrong.com/posts/gq9GR6duzcuxyxZtD/?commentId=feuGTuRRAi6r6DRRK