I don’t know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then.
It’s been pretty on-par.
But I would indeed be worried that a lot of the difficulty is going to be in getting any good results for deep learning, and that finding additional theoretical/conceptual results in other settings doesn’t constitute much progress on that.
Amusingly, I tend to worry more about the opposite failure mode: findings on today’s nets won’t generalize to tomorrow’s nets (even without another transformers-level paradigm shift), and therefore leveraging evidence from other places is the only way to do work which will actually be relevant.
(More accurately, I worry that the relevance or use-cases of findings on today’s nets won’t generalize to tomorrow’s nets. Central example: if we go from a GPT-style LLM to a much bigger o1/o3-style model which is effectively simulating a whole society talking to each other, then the relationship between the tokens and the real-world effects of the system changes a lot. So even if work on the GPT-style models tells us something about the o1/o3-style models, its relevance is potentially very different.)
I assume that was some other type of experiment involving image generators? (and the notion of “working well” there isn’t directly comparable to what you tried now?)
Yeah, that was on a little MNIST net. And the degree of success I saw in that earlier experiment was actually about on par with what we saw in our more recent experiments, our bar was just quite a lot higher this time around. This time we were aiming for things like e.g. “move one person’s head” rather than “move any stuff in any natural way at all”.
It’s been pretty on-par.
Amusingly, I tend to worry more about the opposite failure mode: findings on today’s nets won’t generalize to tomorrow’s nets (even without another transformers-level paradigm shift), and therefore leveraging evidence from other places is the only way to do work which will actually be relevant.
(More accurately, I worry that the relevance or use-cases of findings on today’s nets won’t generalize to tomorrow’s nets. Central example: if we go from a GPT-style LLM to a much bigger o1/o3-style model which is effectively simulating a whole society talking to each other, then the relationship between the tokens and the real-world effects of the system changes a lot. So even if work on the GPT-style models tells us something about the o1/o3-style models, its relevance is potentially very different.)
Yeah, that was on a little MNIST net. And the degree of success I saw in that earlier experiment was actually about on par with what we saw in our more recent experiments, our bar was just quite a lot higher this time around. This time we were aiming for things like e.g. “move one person’s head” rather than “move any stuff in any natural way at all”.