And this part is what Robin Hanson predicted about a decade ago. If I remember it correctly, he wrote that AI Safety was a low-status thing, therefore everyone associated with it was low-status. And if AI Safety ever becomes a high-status thing, then the people in the field will not want to be associated with their low-status predecessors. So instead of referencing them, an alternative history will be established, where someone high-status will be credited for creating the field from scratch (maybe using some inspiration from high-status people in adjacent fields).
As someone who writes these kinds of papers, I try to make an effort to cite the original inspirations when possible. And although I agree with Robin’s theory broadly, there are also some mechanical reasons why Yudkowsky in particular is hard to cite.
The most valuable things about the academic paper style as a reader are: 1) Having a clear, short summary (the abstract) 2) Stating the claimed contributions explicitly 3) Using standard jargon, or if not, noting so explicitly 4) A related work section that contrasts one’s own position against others′ 5) Being explicit about what evidence you’re marshalling and where it comes from. 6) Listing main claims explicitly. 7) The best papers include a “limitations” or “why I might be wrong” section.
Yudkowsky mostly doesn’t do these things. That doesn’t mean he doesn’t deserve credit for making a clear and accessible case for many foundational aspects of AI safety. It’s just that in any particular context, it’s hard to say what, exactly, his claims or contributions were.
In this setting, maybe the most appropriate citation would be something like “as illustrated in many thought experiments by yudkowsky [cite particular sections of the sequences and hpmor], it’s dangerous to rely on any protocol for detecting scheming by agents more intelligent than oneself”. But that’s a pretty broad claim. Maybe I’m being unfair—but it’s not clear to me what exactly yudkowsky’s work says about the workability of these schemes other than “there be dragons here”.
And this part is what Robin Hanson predicted about a decade ago. If I remember it correctly, he wrote that AI Safety was a low-status thing, therefore everyone associated with it was low-status. And if AI Safety ever becomes a high-status thing, then the people in the field will not want to be associated with their low-status predecessors. So instead of referencing them, an alternative history will be established, where someone high-status will be credited for creating the field from scratch (maybe using some inspiration from high-status people in adjacent fields).
As someone who writes these kinds of papers, I try to make an effort to cite the original inspirations when possible. And although I agree with Robin’s theory broadly, there are also some mechanical reasons why Yudkowsky in particular is hard to cite.
The most valuable things about the academic paper style as a reader are:
1) Having a clear, short summary (the abstract)
2) Stating the claimed contributions explicitly
3) Using standard jargon, or if not, noting so explicitly
4) A related work section that contrasts one’s own position against others′
5) Being explicit about what evidence you’re marshalling and where it comes from.
6) Listing main claims explicitly.
7) The best papers include a “limitations” or “why I might be wrong” section.
Yudkowsky mostly doesn’t do these things. That doesn’t mean he doesn’t deserve credit for making a clear and accessible case for many foundational aspects of AI safety. It’s just that in any particular context, it’s hard to say what, exactly, his claims or contributions were.
In this setting, maybe the most appropriate citation would be something like “as illustrated in many thought experiments by yudkowsky [cite particular sections of the sequences and hpmor], it’s dangerous to rely on any protocol for detecting scheming by agents more intelligent than oneself”. But that’s a pretty broad claim. Maybe I’m being unfair—but it’s not clear to me what exactly yudkowsky’s work says about the workability of these schemes other than “there be dragons here”.