Seth Herd comments on If we solve alignment, do we die anyway?

Seth Herd 23 Aug 2024 19:04 UTC
4 points
5
Yes. Good point that LLMs are sort of value aligned as it stands.

I think of that alignment as far too weak to put it in the same category as what I’m speaking of. I’d be shocked if that sort of RL alignment is sufficient to create durable alignment in smarter-than-human scaffolded agent systems using those foundation models.

When they achieve “coherence” or reflection and self-modification, I’d be surprised if their implicit values are good enough to create a good future without further tweaking, once they’re refined into explicit values. Which we won’t be able to do once they’re smart enough to escape our control.