Can you give some historical examples of work that lowered the amount-of-serial-research-left-till-doom? And examples of work that didn’t? Because an advance in alignment is often a direct advance in capabilities, and I’m a little confused about the spectrum of possibilities.
Here’s an example of my confusion. Clearly interpretability work is mostly good, right? Exploring semantic super-positions and other current advances seem like they’re clearly benificial to publish in spite of the fact that they advance capabilities. If we progress to the point where we can interpret the algorithms that a smallish NN is using, that still seems fine. But what if interpretability research progress to the point where they can decode the algorithms a NN is running, then the techniques that allow that level of interpretability are quite dangerous. For example, if we find large NNs have some kind of proto-general search which seems like it could be amplified easily to get a general agent, then, you know, it would be pretty bad if every AGI organization could find this out by just applying standard interpretability tool X. Or is that kind of work still worth publishing, because powerful interpretability would make alignment way easier and that outweighs the risk of reducing serial research time till doom?
I don’t know Nate’s response, but his take on agent-foundations-ish research in A note about differential technological development (and the fact that he and MIRI have been broadly pro-interpretability-work to date) might help clarify how he thinks about cases like this.
[...]
I feel relatively confident that a large percentage of people who do capabilities work at OpenAI, FAIR, DeepMind, Anthropic, etc. with justifications like “well, I’m helping with alignment some too” or “well, alignment will be easier when we get to the brink” (more often EA-adjacent than centrally “EA”, I think) are currently producing costs that outweigh the benefits.
Some relatively niche and theoretical agent-foundations-ish research directions might yield capabilities advances too, and I feel much more positive about those cases. I’m guessing it won’t work, but it’s the kind of research that seems positive-EV to me and that I’d like to see a larger network of researchers tackling, provided that they avoid publishing large advances that are especially likely to shorten AGI timelines.
The main reasons I feel more positive about the agent-foundations-ish cases I know about are:
The alignment progress in these cases appears to me to be much more serial, compared to the vast majority of alignment work the field outputs today.
I’m more optimistic about the total amount of alignment progress we’d see in the worlds where agent-foundations-ish research so wildly exceeded my expectations that it ended up boosting capabilities. Better understanding optimization in this way really would seem to me to take a significant bite out of the capabilities generalization problem, unlike most alignment work I’m aware of.
The kind of people working on agent-foundations-y work aren’t publishing new ML results that break SotA. Thus I consider it more likely that they’d avoid publicly breaking SotA on a bunch of AGI-relevant benchmarks given the opportunity, and more likely that they’d only direct their attention to this kind of intervention if it seemed helpful for humanity’s future prospects.
(Footnote: On the other hand, weirder research is more likely to shorten timelines a lot, if it shortens them at all. More mainstream research progress is less likely to have a large counterfactual impact, because it’s more likely that someone else has the same idea a few months or years later. “Low probability of shortening timelines a lot” and “higher probability of shortening timelines a smaller amount” both matter here, so I advocate that both niche and mainstream researchers be cautious and deliberate about publishing potentially timelines-shortening work.)
Relatedly, the energy and attention of ML is elsewhere, so if they do achieve a surprising AGI-relevant breakthrough and accidentally leak bits about it publicly, I put less probability on safety-unconscious ML researchers rushing to incorporate it.
I’m giving this example not to say “everyone should go do agent-foundations-y work exclusively now!”. I think it’s a neglected set of research directions that deserves far more effort, but I’m far too pessimistic about it to want humanity to put all its eggs in that basket.
Rather, my hope is that this example clarifies that I’m not saying “doing alignment research is bad” or even “all alignment research that poses a risk of advancing capabilities is bad”.
Can you give some historical examples of work that lowered the amount-of-serial-research-left-till-doom? And examples of work that didn’t? Because an advance in alignment is often a direct advance in capabilities, and I’m a little confused about the spectrum of possibilities.
Here’s an example of my confusion. Clearly interpretability work is mostly good, right? Exploring semantic super-positions and other current advances seem like they’re clearly benificial to publish in spite of the fact that they advance capabilities. If we progress to the point where we can interpret the algorithms that a smallish NN is using, that still seems fine. But what if interpretability research progress to the point where they can decode the algorithms a NN is running, then the techniques that allow that level of interpretability are quite dangerous. For example, if we find large NNs have some kind of proto-general search which seems like it could be amplified easily to get a general agent, then, you know, it would be pretty bad if every AGI organization could find this out by just applying standard interpretability tool X. Or is that kind of work still worth publishing, because powerful interpretability would make alignment way easier and that outweighs the risk of reducing serial research time till doom?
I don’t know Nate’s response, but his take on agent-foundations-ish research in A note about differential technological development (and the fact that he and MIRI have been broadly pro-interpretability-work to date) might help clarify how he thinks about cases like this.