While I’ve softened my position on this in the last year, I want to give a big +1 to this response, especially these two points:
It’s genuinely hard to come up with ideas that help capabilities a lot. I think you are severely underestimating how hard it is, and how much insight is required. I think one issue here is that most papers on arxiv are garbage and don’t actually make any progress, but those papers are not the ones that are pushing AGI forward anyways.
[..]
High level ideas are generally not that valuable in and of themselves. People generally learn to ignore ideas unless they have strong empirical evidence of correctness (or endorsement of highly respected researchers) because there are simply too many ideas. The valuable thing is not the idea itself, but the knowledge of which ideas are actually correct.
(emphasis added)
I think it’s often challenging to just understand where the frontier is, because it’s so far and so many things are secret. And if you’re not at a scaling lab and then also don’t keep up with the frontier of the literature, it’s natural to overestimate the novelty of your insights. And then, if you’re too scared to investigate your insights, you might continue to think that your ideas are better than they are. Meanwhile, as an AI Safety researcher, not only is there a lot less distance to the frontier of whatever subfield you’re in, you’ll probably spend most of your time doing work that keeps you on the frontier.
Random insights can be valuable, but the history of deep learning is full of random insights that were right but for arguably the wrong reasons (batch/layernorm, Adam, arguably the algorithm that would later be rebranded as PPO), as well as brilliant insights that turned out to be basically useless (e.g. consider a lot of the Bayesian neural network stuff, but there’s really too many examples to list) if not harmful in the long run (e.g. lots of “clever” or not-so-clever ways of adding inductive bias). Part of the reason is that people don’t get taught the history of the field, and see all the oh-so-clever ideas that didn’t work, or how a lot of the “insights” were invented post-hoc. So if you’re new to deep learning you might get the impression that insights were more causally responsible for the capabilities advancements, than they actually are. Insofar as good alignment requires deconfusion and rationality to generate good insights, and capabilities does not, then you should expect that the insights you get from improving rationality/doing deconfusion are more impactful for alignment than capabilities.
I mean, if you actually do come up with a better initialization scheme, a trick that improves GPU utilization, or some other sort of cheap algorithmic trick to improve training AND check it’s correct through some small/medium-scale empirical experiments, then sure, please reconsider publishing that. But it’s hard to incidentally do that—even if you do come up with some insight while doing say, mech interp, it feels like going out of your way to test your capability ideas should be a really obvious “you’re basically doing capabilities” sign? And maybe, you should be doing the safety work you claim to want to do instead?
While I’ve softened my position on this in the last year, I want to give a big +1 to this response, especially these two points:
(emphasis added)
I think it’s often challenging to just understand where the frontier is, because it’s so far and so many things are secret. And if you’re not at a scaling lab and then also don’t keep up with the frontier of the literature, it’s natural to overestimate the novelty of your insights. And then, if you’re too scared to investigate your insights, you might continue to think that your ideas are better than they are. Meanwhile, as an AI Safety researcher, not only is there a lot less distance to the frontier of whatever subfield you’re in, you’ll probably spend most of your time doing work that keeps you on the frontier.
Random insights can be valuable, but the history of deep learning is full of random insights that were right but for arguably the wrong reasons (batch/layernorm, Adam, arguably the algorithm that would later be rebranded as PPO), as well as brilliant insights that turned out to be basically useless (e.g. consider a lot of the Bayesian neural network stuff, but there’s really too many examples to list) if not harmful in the long run (e.g. lots of “clever” or not-so-clever ways of adding inductive bias). Part of the reason is that people don’t get taught the history of the field, and see all the oh-so-clever ideas that didn’t work, or how a lot of the “insights” were invented post-hoc. So if you’re new to deep learning you might get the impression that insights were more causally responsible for the capabilities advancements, than they actually are. Insofar as good alignment requires deconfusion and rationality to generate good insights, and capabilities does not, then you should expect that the insights you get from improving rationality/doing deconfusion are more impactful for alignment than capabilities.
I mean, if you actually do come up with a better initialization scheme, a trick that improves GPU utilization, or some other sort of cheap algorithmic trick to improve training AND check it’s correct through some small/medium-scale empirical experiments, then sure, please reconsider publishing that. But it’s hard to incidentally do that—even if you do come up with some insight while doing say, mech interp, it feels like going out of your way to test your capability ideas should be a really obvious “you’re basically doing capabilities” sign? And maybe, you should be doing the safety work you claim to want to do instead?
Is there anything you recommend for understanding the history of the field?