theoretical progress has been considerably faster than expected, while crossing the theory-practice gap has been mildly slower than expected. (Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet.)
I don’t know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then. But I would indeed be worried that a lot of the difficulty is going to be in getting any good results for deep learning, and that finding additional theoretical/conceptual results in other settings doesn’t constitute much progress on that. (But kudos for apparently working on image generator nets again!)
As a sidenote, your update from 2 years ago also mentioned:
I tried to calculate “local” natural abstractions (in a certain sense) in a generative image net, and that worked quite well.
I assume that was some other type of experiment involving image generators? (and the notion of “working well” there isn’t directly comparable to what you tried now?)
I don’t know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then.
It’s been pretty on-par.
But I would indeed be worried that a lot of the difficulty is going to be in getting any good results for deep learning, and that finding additional theoretical/conceptual results in other settings doesn’t constitute much progress on that.
Amusingly, I tend to worry more about the opposite failure mode: findings on today’s nets won’t generalize to tomorrow’s nets (even without another transformers-level paradigm shift), and therefore leveraging evidence from other places is the only way to do work which will actually be relevant.
(More accurately, I worry that the relevance or use-cases of findings on today’s nets won’t generalize to tomorrow’s nets. Central example: if we go from a GPT-style LLM to a much bigger o1/o3-style model which is effectively simulating a whole society talking to each other, then the relationship between the tokens and the real-world effects of the system changes a lot. So even if work on the GPT-style models tells us something about the o1/o3-style models, its relevance is potentially very different.)
I assume that was some other type of experiment involving image generators? (and the notion of “working well” there isn’t directly comparable to what you tried now?)
Yeah, that was on a little MNIST net. And the degree of success I saw in that earlier experiment was actually about on par with what we saw in our more recent experiments, our bar was just quite a lot higher this time around. This time we were aiming for things like e.g. “move one person’s head” rather than “move any stuff in any natural way at all”.
2 years ago, you wrote:
I don’t know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then. But I would indeed be worried that a lot of the difficulty is going to be in getting any good results for deep learning, and that finding additional theoretical/conceptual results in other settings doesn’t constitute much progress on that. (But kudos for apparently working on image generator nets again!)
As a sidenote, your update from 2 years ago also mentioned:
I assume that was some other type of experiment involving image generators? (and the notion of “working well” there isn’t directly comparable to what you tried now?)
It’s been pretty on-par.
Amusingly, I tend to worry more about the opposite failure mode: findings on today’s nets won’t generalize to tomorrow’s nets (even without another transformers-level paradigm shift), and therefore leveraging evidence from other places is the only way to do work which will actually be relevant.
(More accurately, I worry that the relevance or use-cases of findings on today’s nets won’t generalize to tomorrow’s nets. Central example: if we go from a GPT-style LLM to a much bigger o1/o3-style model which is effectively simulating a whole society talking to each other, then the relationship between the tokens and the real-world effects of the system changes a lot. So even if work on the GPT-style models tells us something about the o1/o3-style models, its relevance is potentially very different.)
Yeah, that was on a little MNIST net. And the degree of success I saw in that earlier experiment was actually about on par with what we saw in our more recent experiments, our bar was just quite a lot higher this time around. This time we were aiming for things like e.g. “move one person’s head” rather than “move any stuff in any natural way at all”.