Nathan Helm-Burger comments on If we solve alignment, do we die anyway?

Nathan Helm-Burger 23 Aug 2024 18:05 UTC
12 points
7
Currently, an open source value-aligned model can be easily modified to just an intent-aligned model. The alignment isn’t ‘sticky’, it’s easy to remove it without substantially impacting capabilities.

So unless this changes, the hope of peace through value-aligned models routes through hoping that the people in charge of them are sufficiently ethical -value-aligned to not turn the model into a purely intent-aligned one.
- Seth Herd 23 Aug 2024 19:04 UTC
  4 points
  5
  Parent
  Yes. Good point that LLMs are sort of value aligned as it stands.
  
  I think of that alignment as far too weak to put it in the same category as what I’m speaking of. I’d be shocked if that sort of RL alignment is sufficient to create durable alignment in smarter-than-human scaffolded agent systems using those foundation models.
  
  When they achieve “coherence” or reflection and self-modification, I’d be surprised if their implicit values are good enough to create a good future without further tweaking, once they’re refined into explicit values. Which we won’t be able to do once they’re smart enough to escape our control.
- sweenesm 23 Aug 2024 19:02 UTC
  3 points
  0
  Parent
  Agreed, “sticky” alignment is a big issue—see my reply above to Seth Herd’s comment. Thanks.