Seth Herd comments on Instruction-following AGI is easier and more likely than value aligned AGI

Seth Herd 3 Jun 2024 19:26 UTC
4 points
0
It sounds like you’re thinking of mass deployment. I think if every average joe has control of an AGI capable of recursive self-improvement, we are all dead.

I’m assuming that whoever develops this might allow others to use parts of its capacities, but definitely not all of them.

So we’re in a position where the actual principal(s) are among the smarter and at least not bottom-of-the-barrel impulsive and foolish people. Whether that’s good enough, who knows.

So your points about ways the AIs wisdom will be ignored should mostly be limited to the “safe” limited versions. I totally agree that the wisdom of the AGI will be limited. But it will grow as its capabilities grow. I’m definitely anticipating it learning after deployment, not just with retraining of its base LLMs. That’s not hard to implement, and it’s a good way to leverage a different type of human training.

I agree that defendingg the world will require some sort of pivotal act. Optimistically, this would be something like major governments agreeing to outlaw further development of sapient AGIs, and then enforcing that using their AGIs superior capabilities. And yes, that’s creepy. I’d far prefer your option 2, value-aligned, friendly sovereign AGI. I’ve always thought that was the win condition if we solve alignment. But now it’s seeming vastly more likely we’re stuck with option 1. It seems safer than attempting 2 until we have a better option, and appealing to those in charge of AGI projects.

I don’t see a better option on the table, even if language model agents don’t happen sooner than brainlike AGI that would allow your alignment plans to work. Your plan for mediocre alignment seems solid, but I don’t think the stability problem is solved, so aligning it to human flourishing might well go bad as it updates its understandingn of what that means. Maybe reflective stability would be adequate? If we analyzed it some more and decided it was, I’d prefer that plan. Otherwise I’d want to align even brainlike AGI to just follow instructions, so that it can be shut down if it starts going off-course.

I guess the same logic applies to language model agents. You could just give it a top-level goal like “work for human flourishing”, and if reflective stability is adequate and there’s no huge problem with that definition, it would work. But who’s going to launch that instead of keeping it under their control, at least until they’ve worked with it for a while?