By contrast, some lines of research where I’ve seen compelling critiques (and haven’t seen compelling defences) of their core intuitions, and therefore don’t recommend to people:
Concluding thoughts on relevance to alignment: While we’ve made critical remarks on several of the details, we also want to reiterate that overall, we think (natural) abstractions are an important direction for alignment and it’s good that someone is working on them! In particular, the fact that there are at least four distinct stories for how abstractions could help with alignment is promising.
The second says:
I think this is a fine dream. It’s a dream I developed independently at MIRI a number of years ago, in interaction with others. A big reason why I slogged through a review of John’s work is because he seemed to be attempting to pursue a pathway that appeals to me personally, and I had some hope that he would be able to go farther than I could have.
Neither of them seemed, to me, to be critiques of the “core intuitions”; rather, the opposite: both suggested that the core intuitions seemed promising; the weaknesses were elsewhere. That suggests that natural abstractions might be a better than average target for incoming researchers, not a worse one.
I have some other disagreements, but those are model-level disagreements; that piece of advice in particular seems to be misguided even under your own models. I think I agree with the overall structure and most of the prioritization (though would put scalable oversight lower, or focus on those bits that Joe points out are the actual deciding factors for whether that entire class of approaches is worthwhile—that seems more like “alignment theory with respect to scalable oversight”).
The first critique of natural abstractions says:
The second says:
Neither of them seemed, to me, to be critiques of the “core intuitions”; rather, the opposite: both suggested that the core intuitions seemed promising; the weaknesses were elsewhere. That suggests that natural abstractions might be a better than average target for incoming researchers, not a worse one.
I have some other disagreements, but those are model-level disagreements; that piece of advice in particular seems to be misguided even under your own models. I think I agree with the overall structure and most of the prioritization (though would put scalable oversight lower, or focus on those bits that Joe points out are the actual deciding factors for whether that entire class of approaches is worthwhile—that seems more like “alignment theory with respect to scalable oversight”).
Good point. Will edit.