I couldn’t agree more with your opening and ending thesis, which you put ever so gently:
the current portfolio is over-indexed on work which treats “transformative AI” as a black box
It seems obvious to me that trying to figure out alignment without talking about AGI designs is going to be highly confusing. It also seems likely to stop short of a decent estimate of the difficulty. It’s hard to judge whether a plan is likely to fail when there’s no actual plan to judge. And it seems like any actual plan for alignment would reference a way AGI might use knowledge and make decisions.
WRT the language model agent route, you’ve probably seen my posts, which are broadly in agreement with your take:
The second focuses more on the range of alignment techniques applicable to LMAs/LMCAs. I wind up rather optimistic, particularly when the target of alignment is corrigibility or DWIM-and-check.
It seems like even if LMAs achieve AGI, they might progress slowly beyond the human-level source of the LLM training. That could be a really good thing. I want to think about this more.
I’m unsure how much to publish on possible routes. Right now it seems to me that advancing progress on LMAs is actually a good thing, since they’re more transparent and directable than any other AGI approach I can think of. But I don’t trust my own judgment when there’s been so little discussion from the hardcore alignment-is-hard-crowd.
It boggles my mind that posts like this, forecasting real routes to AGI and alignment, don’t get more attention and discussion. What exactly are people hoping for as alignment solutions if not work like this?
This is a fantastic post. Big upvote.
I couldn’t agree more with your opening and ending thesis, which you put ever so gently:
It seems obvious to me that trying to figure out alignment without talking about AGI designs is going to be highly confusing. It also seems likely to stop short of a decent estimate of the difficulty. It’s hard to judge whether a plan is likely to fail when there’s no actual plan to judge. And it seems like any actual plan for alignment would reference a way AGI might use knowledge and make decisions.
WRT the language model agent route, you’ve probably seen my posts, which are broadly in agreement with your take:
Capabilities and alignment of LLM cognitive architectures
Internal independent review for language model agent alignment
The second focuses more on the range of alignment techniques applicable to LMAs/LMCAs. I wind up rather optimistic, particularly when the target of alignment is corrigibility or DWIM-and-check.
It seems like even if LMAs achieve AGI, they might progress slowly beyond the human-level source of the LLM training. That could be a really good thing. I want to think about this more.
I’m unsure how much to publish on possible routes. Right now it seems to me that advancing progress on LMAs is actually a good thing, since they’re more transparent and directable than any other AGI approach I can think of. But I don’t trust my own judgment when there’s been so little discussion from the hardcore alignment-is-hard-crowd.
It boggles my mind that posts like this, forecasting real routes to AGI and alignment, don’t get more attention and discussion. What exactly are people hoping for as alignment solutions if not work like this?
Again, great post, keep it up.