First, I think this class of work is critical for deconfusion, which is critical if we need a theory for far more powerful AI systems, rather than for very smart but still fundamentally human level systems.
Secondly, concretely, it seems that very few other approaches to safety have the potential to provide enough fundamental understanding to allow us to make strong statements about models before they are fully trained. This seems like a critical issue if we are concerned about very strong models that could pose risks during testing, or possibly even during training. And as far as I’m aware, nothing in the interpretability and auditing spaces has a real claim to be able to make clear statements about those risks, other than perhaps to suggest interim testing during model training—which could work, if a huge amount of such work is done, but seems very unlikely to happen.
Edit to add: Given the votes on this, what specifically do people disagree with?
I don’t strongly disagree but do weakly disagree on some points so I guess I’ll answer
Re first- if you buy into automated alignment work by human level AGI, then trying to align ASI now seems less worth it. The strongest counterargument to this I see is that “human level AGI” is impossible to get with our current understanding, as it will be superhuman in some things and weirdly bad at others.
Re second- disagreements might be nitpicking on “few other approaches” vs “few currently pursued approaches”. There are probably a bunch of things that would allow fundamental understanding if they panned out (various agent foundations agendas, probably safe ai agendas like davidad’s), though one can argue they won’t apply to deep learning or are less promising to explore than SLT
In addition to the point that current models are already strongly superhuman in most ways, I think that if you buy the idea that we’ll be able to do automated alignment of ASI, you’ll still need some reliable approach to “manual” alignment of current systems. We’re already far past the point where we can robustly verify LLMs claims’ or reasoning in a robust fashion outside of narrow domains like programming and math.
But on point two, I strongly agree that Agent foundations and Davidad’s agendas are also worth pursuing. (And in a sane world, we should have tens or hundreds of millions of dollars in funding for each every year.) Instead, it looks like we have Davidad’s ARIA funding, Jaan Talinn and LTFF funding some agent foundations and SLT work, and that’s basically it. And MIRI abandoned agent foundations, while Openphil isn’t, it seems, putting money or effort into them.
First, I think this class of work is critical for deconfusion, which is critical if we need a theory for far more powerful AI systems, rather than for very smart but still fundamentally human level systems.
Secondly, concretely, it seems that very few other approaches to safety have the potential to provide enough fundamental understanding to allow us to make strong statements about models before they are fully trained. This seems like a critical issue if we are concerned about very strong models that could pose risks during testing, or possibly even during training. And as far as I’m aware, nothing in the interpretability and auditing spaces has a real claim to be able to make clear statements about those risks, other than perhaps to suggest interim testing during model training—which could work, if a huge amount of such work is done, but seems very unlikely to happen.
Edit to add: Given the votes on this, what specifically do people disagree with?
I don’t strongly disagree but do weakly disagree on some points so I guess I’ll answer
Re first- if you buy into automated alignment work by human level AGI, then trying to align ASI now seems less worth it. The strongest counterargument to this I see is that “human level AGI” is impossible to get with our current understanding, as it will be superhuman in some things and weirdly bad at others.
Re second- disagreements might be nitpicking on “few other approaches” vs “few currently pursued approaches”. There are probably a bunch of things that would allow fundamental understanding if they panned out (various agent foundations agendas, probably safe ai agendas like davidad’s), though one can argue they won’t apply to deep learning or are less promising to explore than SLT
In addition to the point that current models are already strongly superhuman in most ways, I think that if you buy the idea that we’ll be able to do automated alignment of ASI, you’ll still need some reliable approach to “manual” alignment of current systems. We’re already far past the point where we can robustly verify LLMs claims’ or reasoning in a robust fashion outside of narrow domains like programming and math.
But on point two, I strongly agree that Agent foundations and Davidad’s agendas are also worth pursuing. (And in a sane world, we should have tens or hundreds of millions of dollars in funding for each every year.) Instead, it looks like we have Davidad’s ARIA funding, Jaan Talinn and LTFF funding some agent foundations and SLT work, and that’s basically it. And MIRI abandoned agent foundations, while Openphil isn’t, it seems, putting money or effort into them.