One somewhat obvious thing to do with ~human-level systems with “initial loose alignment” is to use them to automate alignment research (e.g. the superalignment plan). I think this kind of two-step plan is currently the best we have, and probably by quite some margin. Many more details for why I believe this in these slides and in this AI safety camp ’24 proposal.
I made time to go through your slides. You appear to be talking about LLMs, not language model agents. That’s what I’m addressing.
If you can align those, using them to align a different type of AGI would be a bit beside the point in most scenarios (maybe they’d progress so slowly that another type would overtake them before they pulled off a pivotal act using LMA AGI).
I don’t see a barrier to LMAs achieving full, agentic AGI. And I think they’ll be so useful and interesting that they’ll inevitably be made pretty efficiently.
I don’t quite understand why others don’t agree that this will happen. Perhaps I’ll write a post question asking why this is.
Agree, I do mostly discuss LLMs, but I think there’s significant overlap in aligning LLMs and LMAs.
Also agree that LMAs could scale all the way, but I also think once you get ~human-level automated alignment research, its likely applicability to other types of systems too (than LMAs and LLMs) should still be a nice bonus.
One somewhat obvious thing to do with ~human-level systems with “initial loose alignment” is to use them to automate alignment research (e.g. the superalignment plan). I think this kind of two-step plan is currently the best we have, and probably by quite some margin. Many more details for why I believe this in these slides and in this AI safety camp ’24 proposal.
I made time to go through your slides. You appear to be talking about LLMs, not language model agents. That’s what I’m addressing.
If you can align those, using them to align a different type of AGI would be a bit beside the point in most scenarios (maybe they’d progress so slowly that another type would overtake them before they pulled off a pivotal act using LMA AGI).
I don’t see a barrier to LMAs achieving full, agentic AGI. And I think they’ll be so useful and interesting that they’ll inevitably be made pretty efficiently.
I don’t quite understand why others don’t agree that this will happen. Perhaps I’ll write a post question asking why this is.
Agree, I do mostly discuss LLMs, but I think there’s significant overlap in aligning LLMs and LMAs.
Also agree that LMAs could scale all the way, but I also think once you get ~human-level automated alignment research, its likely applicability to other types of systems too (than LMAs and LLMs) should still be a nice bonus.