Thank you for writing this up – I think this (and the other posts in the series) do a good job of describing ARC’s big-picture alignment plan, common objections, our usual responses, and why you find those uncompelling.
In my personal opinion (not necessarily shared by everyone at ARC), the best case for our research agenda comes neither from the specific big-picture plan you are critiquing here, nor from “something good falling out of it along the way” (although that is a part of my motivation), but instead for some intermediate goal along the lines of “a formal framework for heuristic arguments that is well-developed enough that we can convincingly apply it to neural networks”. If we can achieve that, it seems quite likely to me that it will be useful for something, for essentially the same reason we would expect exhaustive mechanistic interpretability to be useful for something (and probably quite a lot). Under this view, the point of fleshing out the LPE and MAD applications is important as a proof of concept and for refining our plans, but they are subject to revision.
This isn’t meant to downplay your objections too much. The ones that loom largest in my mind are false positives in MAD, small estimates being “lost in the noise” for LPE, and the whole minefield of empirical regularities (all of which you do good justice to). Paul still seems to think we can resolve all of these issues, so hopefully we will get to the bottom of them at some point, although in the short term we are more focused on the more limited dream of heuristic arguments for neural networks (and instances of LPE we think they ought to enable).
A couple of your objections apply even to this more limited dream though, especially the ones under “Explaining everything” and “When and what do we explain?”. But your arguments there seem to boil down to “that seems incredibly daunting and ambitious”, which I basically agree with. I still come down on the side of thinking that it is still a promising target, but I do think that ARC’s top priority should be to come up with concrete cruxes here and put them to the test, which is our primary research focus at the moment.
I thought about this a bit more (and discussed with others) and decided that you are basically right that we can’t avoid the question of empirical regularities for any realistic alignment application, if only because any realistic model with potential alignment challenges will be trained on empirical data. The only potential application we came up with is LPE for a formalized distribution and formalized catastrophe event, but we didn’t find this especially compelling, for several reasons.[1]
To me the challenges we face in dealing with empirical regularities do not seem bigger than the challenges we face with formal heuristic explanations, but the empirical regularities challenges should become much more concrete once we have a notion of heuristic explanations to work with, so it seems easier to resolve them in that order. But I have moved in your direction, and it does seem worth our while to address them both in parallel to some extent.
Objections include: (a) the model is trained on empirical data, so we need to only explain things relevant to formal events, and not everything relevant to its loss; (b) we also need to hope that empirical regularities aren’t needed to explain purely formal events, which remains unclear; and (c) the restriction to formal distributions/events limits the value of the application.