Seth Herd comments on AGI Alignment is Absurd

Seth Herd 29 Nov 2023 20:48 UTC
2 points
0
I think that the first AGIs will, for practical reasons, very likely have an outer alignment goal of “do what I mean and check”, where “I” refers to the AGIs creator(s), not each user. This solution postpones the hard problem you address, which I think of as outer alignment, while allowing the creator to correct mistakes in alignment if they’re not too bad. It’s also simpler to define than most humanity-serving outer alignment goals. I’ve written a little more on this logic in Corrigibility or DWIM is an attractive primary goal for AGI.

I’m finding this logic really compelling WRT what people will actually do, and I wonder if others will too.

This isn’t necessarily a bad thing. The current leaders in AGI have pretty libertarian and humanistic values, at least they claim to and sound credible. Having them weigh the questions you raise carefully, at length, with plenty of input, could produce a really good set of values.

If you’re more concerned with what LLMs produce, I’m not. I don’t think they’ll reach “true” AGI alone. If they’re the basis of AGI, they will have explicitly stated goals as the top level instructions in a language model cognitive architecture. For expansion on that logic, see https://www.lesswrong.com/posts/ogHr8SvGqg9pW5wsT/capabilities-and-alignment-of-llm-cognitive-architectures