Paul’s work (ELK more than RLHF though it was useful to see what happens when you throw RL at LLMs in a way that’s kind of similar to how I do get some value out of Chris’s work)
Eliezer’s work
Nate’s work
Holden’s writing on cold takes
Ajeya’s work
Wentworth’s work
The debate stuff
Redwood’s work
Bostrom’s work
Evan’s work
Scott and Abram’s work
There is of course still huge variance in how relevant and how much for the throat these different people’s work is going for, but all of these seem more relevant to AI Alignment/AI-not-kill-everyonism than Chris’s work (which again, I found interesting, but not like super interesting).
Definitely not trying to put words in Habryka’s mouth, but I did want to make a concrete prediction to test my understanding of his position; I expect he will say that:
the only work which is relevant is the one that tries to directly tackle what Nate Soares described as “the hard bits of the alignment challenge” (the identity of which Habryka basically agrees with Soares about)
nobody is fully on the ball yet
but agent foundations-like research by MIRI-aligned or formerly MIRI-aligned people (Vanessa Kosoy, Abram Demski, etc.) is the one that’s most relevant, in theory
however, in practice, even that is kinda irrelevant because timelines are short and that work is going along too slowly to be useful even for deconfusion purposes
To clarify: are you saying that since you perceive Chris Olah as mostly intrinsically caring about understanding neural networks (instead of mostly caring about alignment), you conclude that his work is irrelevant to alignment?
No, I have detailed inside view models of the alignment problem, and under those models consider Chris Olah’s work to be interesting but close to irrelevant (or to be about as relevant as the work of top capability researchers, whose work, to be clear, does have some relevance since of course understanding how to make systems better is relevant for understanding how AGI will behave, but where the relevance is pretty limited).
(FWIW I think Chris Olah’s work is approximately irrelevant to alignment and indeed this is basically fully explained by the motivational dimension)
Whose work is relevant, according to you?
Lots of people’s work:
Paul’s work (ELK more than RLHF though it was useful to see what happens when you throw RL at LLMs in a way that’s kind of similar to how I do get some value out of Chris’s work)
Eliezer’s work
Nate’s work
Holden’s writing on cold takes
Ajeya’s work
Wentworth’s work
The debate stuff
Redwood’s work
Bostrom’s work
Evan’s work
Scott and Abram’s work
There is of course still huge variance in how relevant and how much for the throat these different people’s work is going for, but all of these seem more relevant to AI Alignment/AI-not-kill-everyonism than Chris’s work (which again, I found interesting, but not like super interesting).
Do you mean Evan Hubinger, Evan R. Murphy, or a different Evan? (I would be surprised and humbled if it was me, though my priors on that are low.)
Hubinger
Definitely not trying to put words in Habryka’s mouth, but I did want to make a concrete prediction to test my understanding of his position; I expect he will say that:
the only work which is relevant is the one that tries to directly tackle what Nate Soares described as “the hard bits of the alignment challenge” (the identity of which Habryka basically agrees with Soares about)
nobody is fully on the ball yet
but agent foundations-like research by MIRI-aligned or formerly MIRI-aligned people (Vanessa Kosoy, Abram Demski, etc.) is the one that’s most relevant, in theory
however, in practice, even that is kinda irrelevant because timelines are short and that work is going along too slowly to be useful even for deconfusion purposes
Edit: I was wrong.
To clarify: are you saying that since you perceive Chris Olah as mostly intrinsically caring about understanding neural networks (instead of mostly caring about alignment), you conclude that his work is irrelevant to alignment?
No, I have detailed inside view models of the alignment problem, and under those models consider Chris Olah’s work to be interesting but close to irrelevant (or to be about as relevant as the work of top capability researchers, whose work, to be clear, does have some relevance since of course understanding how to make systems better is relevant for understanding how AGI will behave, but where the relevance is pretty limited).