Currently, an open source value-aligned model can be easily modified to just an intent-aligned model. The alignment isn’t ‘sticky’, it’s easy to remove it without substantially impacting capabilities.
So unless this changes, the hope of peace through value-aligned models routes through hoping that the people in charge of them are sufficiently ethical -value-aligned to not turn the model into a purely intent-aligned one.
Yes. Good point that LLMs are sort of value aligned as it stands.
I think of that alignment as far too weak to put it in the same category as what I’m speaking of. I’d be shocked if that sort of RL alignment is sufficient to create durable alignment in smarter-than-human scaffolded agent systems using those foundation models.
When they achieve “coherence” or reflection and self-modification, I’d be surprised if their implicit values are good enough to create a good future without further tweaking, once they’re refined into explicit values. Which we won’t be able to do once they’re smart enough to escape our control.
Currently, an open source value-aligned model can be easily modified to just an intent-aligned model. The alignment isn’t ‘sticky’, it’s easy to remove it without substantially impacting capabilities.
So unless this changes, the hope of peace through value-aligned models routes through hoping that the people in charge of them are sufficiently ethical -value-aligned to not turn the model into a purely intent-aligned one.
Yes. Good point that LLMs are sort of value aligned as it stands.
I think of that alignment as far too weak to put it in the same category as what I’m speaking of. I’d be shocked if that sort of RL alignment is sufficient to create durable alignment in smarter-than-human scaffolded agent systems using those foundation models.
When they achieve “coherence” or reflection and self-modification, I’d be surprised if their implicit values are good enough to create a good future without further tweaking, once they’re refined into explicit values. Which we won’t be able to do once they’re smart enough to escape our control.
Agreed, “sticky” alignment is a big issue—see my reply above to Seth Herd’s comment. Thanks.