In practice I think if 10% of researchers are focused on safety, and none of them worry at all about capabilities externalities, you should expect them to accelerate overall progress by <1%.
It looks to me that some of the highest value ideas come from safety folk. On my model there are some key things that are unusually concentrated among people concerned with AI safety, like any ability to actually visualize AGI, and to seek system designs more interesting than “stack more layers”.
Your early work on human feedback, extrapolated forward by others, seems like a prime example here, at least of a design idea that took off and is looking quite relevant to capabilities progress? And it continues to mostly be pushed forward by safety folk afaict.
I anticipate that the mechanistic interpretability folk may become another example of this, by inspiring and enabling other researchers to invent better architectures (e.g. https://arxiv.org/abs/2212.14052).
Maybe the RL with world models stuff (https://worldmodels.github.io/) is a counterexample, in which non-”safety” folk are trying successfully to push the envelope in a non-standard way. I think they might be in our orbit though.
I agree that safety people have lots of ideas more interesting than stack more layers, but they mostly seem irrelevant to progress. People working in AI capabilities also have plenty of such ideas, and one of the most surprising and persistent inefficiencies of the field is how consistently it overweights clever ideas relative to just spending the money to stack more layers. (I think this is largely down to sociological and institutional factors.)
Indeed, to the extent that AI safety people have plausibly accelerated AI capabilities I think it’s almost entirely by correcting that inefficiency faster than might have happened otherwise, especially via OpenAI’s training of GPT-3. But this isn’t a case of safety people incidentally benefiting capabilities as a byproduct of their work, it was a case of some people who care about safety deliberately doing something they thought would be a big capabilities advance. I think those are much more plausible as a source of acceleration!
(I would describe RLHF as pretty prototypical: “Don’t be clever, just stack layers and optimize the thing you care about.” I feel like people on LW are being overly mystical about it.)
tbc, I don’t feel very concerned by safety-focused folk who are off working on their own ideas. I think the more damaging things are (1) trying to garner prestige with leading labs and the AI field by trying to make transformative ideas work (which I think is a large factor in ongoing RLHF efforts?); and (2) trying to “wake up” the AI field into a state of doing much more varied stuff that “stack layers”
I agree with your other points, but on this one:
It looks to me that some of the highest value ideas come from safety folk. On my model there are some key things that are unusually concentrated among people concerned with AI safety, like any ability to actually visualize AGI, and to seek system designs more interesting than “stack more layers”.
Your early work on human feedback, extrapolated forward by others, seems like a prime example here, at least of a design idea that took off and is looking quite relevant to capabilities progress? And it continues to mostly be pushed forward by safety folk afaict.
I anticipate that the mechanistic interpretability folk may become another example of this, by inspiring and enabling other researchers to invent better architectures (e.g. https://arxiv.org/abs/2212.14052).
Maybe the RL with world models stuff (https://worldmodels.github.io/) is a counterexample, in which non-”safety” folk are trying successfully to push the envelope in a non-standard way. I think they might be in our orbit though.
I agree that safety people have lots of ideas more interesting than stack more layers, but they mostly seem irrelevant to progress. People working in AI capabilities also have plenty of such ideas, and one of the most surprising and persistent inefficiencies of the field is how consistently it overweights clever ideas relative to just spending the money to stack more layers. (I think this is largely down to sociological and institutional factors.)
Indeed, to the extent that AI safety people have plausibly accelerated AI capabilities I think it’s almost entirely by correcting that inefficiency faster than might have happened otherwise, especially via OpenAI’s training of GPT-3. But this isn’t a case of safety people incidentally benefiting capabilities as a byproduct of their work, it was a case of some people who care about safety deliberately doing something they thought would be a big capabilities advance. I think those are much more plausible as a source of acceleration!
(I would describe RLHF as pretty prototypical: “Don’t be clever, just stack layers and optimize the thing you care about.” I feel like people on LW are being overly mystical about it.)
tbc, I don’t feel very concerned by safety-focused folk who are off working on their own ideas. I think the more damaging things are (1) trying to garner prestige with leading labs and the AI field by trying to make transformative ideas work (which I think is a large factor in ongoing RLHF efforts?); and (2) trying to “wake up” the AI field into a state of doing much more varied stuff that “stack layers”