Sam Bowman literally outlines the exact plan Eliezer Yudkowsky constantly warns not to use, and which the Underpants Gnomes know well.
Preparation (You are Here)
Making the AI Systems Do Our Homework (?????)
Life after TAI (Profit)
His tasks for chapter 1 start off with ‘not missing the boat on capabilities.’ Then, he says, we must solve near-term alignment of early TAI, render it ‘reliably harmless,’ so we can use it. I am not even convinced that ‘harmless’ intelligence is a thing if you want to be able to use it for anything that requires the intelligence, but here he says the plan is safeguards that would work even if the AIs tried to cause harm. Ok, sure, but obviously that won’t work if they are sufficiently capable and you want to actually use them properly.
I do love what he calls ‘the LeCun test,’ which is to design sufficiently robust safety policies (a Safety and Security Protocol, what Anthropic calls an RSP) that if someone who thinks AGI safety concerns are bullshit is put in charge of that policy at another lab, that would still protect us, at minimum by failing in a highly visible way before it doomed us.
The plan then involves solving interpretability and implementing sufficient cybersecurity, and proper legible evaluations for higher capability levels (what they call ASL-4 and ASL-5), that can also be used by third parties. And doing general good things like improving societal resilience and building adaptive infrastructure and creating well-calibrated forecasts and smoking gun demos of emerging risks. All that certainly helps, I’m not sure it counts as a ‘checklist’ per se. Importantly, the list includes ‘preparing to pause or de-deploy.’
He opens part 2 of the plan (‘chapter 2’) by saying lots of the things in part 1 will still not be complete. Okie dokie. There is more talk of concern about AI welfare, which I continue to be confused about, and a welcome emphasis on true cybersecurity, but beyond that this is simply more ways to say ‘properly and carefully do the safety work.’ What I do not see here is an actual plan for how to do that, or why this checklist would be sufficient?
Then part 3 is basically ‘profit,’ and boils down to making good decisions to the extent the government or AIs are not dictating your decisions. He notes that the most important decisions are likely already made once TAI arrives – if you are still in any position to steer outcomes, that is a sign you did a great job earlier. Or perhaps you did such a great job that step 3 can indeed be ‘profit.’
The worry is that this is essentially saying ‘we do our jobs, solve alignment, it all works out.’ That doesn’t really tell us how to solve alignment, and has the implicit assumption that this is a ‘do your job’ or ‘row the boat’ (or even ‘play like a champion today’) situation. Whereas I see a very different style of problem. You do still have to execute, or you automatically lose. And if we execute on Bowman’s plan, we will be in a vastly better position than if we do not do that. But there is no script.
For what it’s worth, as someone in basically the position you describe—I struggle to imagine automated alignment working, mostly because of Godzilla-ish concerns—demos like these do not strike me as cruxy. I’m not sure what the cruxes are, exactly, but I’m guessing they’re more about things like e.g. relative enthusiasm about prosaic alignment, relative likelihood of sharp left turn-type problems, etc., than about whether early automated demos are likely to work on early systems.
Maybe you want to call these concerns unserious too, but regardless I do think it’s worth bearing in mind that early results like these might seem like stronger/more relevant evidence to people whose prior is that scaled-up versions of them would be meaningfully helpful for aligning a superintelligence.
Commentary by Zvi in one of his AI posts, copied over since it seems nice to have it available for people reading this post:
I’ve now seen this meme overused to such a degree that I find it hard to take seriously anything written after. To me it just comes across as unserious if somebody apparently cannot imagine how this might happen, even after obvious (to me, at least) early demos/prototypes have been published, e.g. https://sakana.ai/ai-scientist/, Discovering Preference Optimization Algorithms with and for Large Language Models, A Multimodal Automated Interpretability Agent.
On a positive note, though, at least they didn’t also bring up the ‘Godzilla strategies’ meme.
For what it’s worth, as someone in basically the position you describe—I struggle to imagine automated alignment working, mostly because of Godzilla-ish concerns—demos like these do not strike me as cruxy. I’m not sure what the cruxes are, exactly, but I’m guessing they’re more about things like e.g. relative enthusiasm about prosaic alignment, relative likelihood of sharp left turn-type problems, etc., than about whether early automated demos are likely to work on early systems.
Maybe you want to call these concerns unserious too, but regardless I do think it’s worth bearing in mind that early results like these might seem like stronger/more relevant evidence to people whose prior is that scaled-up versions of them would be meaningfully helpful for aligning a superintelligence.