Seth Herd comments on Instruction-following AGI is easier and more likely than value aligned AGI

Seth Herd 1 Dec 2024 21:58 UTC
2 points
0
I guess I didn’t address RSI in enough detail. The general idea is to have a human in the loop during RSI, and to talk extensively with the current version of your AGI about how this next improvement could disrupt its alignment before you launch it.

WRT “I don’t want his attempted in any light-cone I inhabit”, well, neither do I. But we’re not in charge of the light cone.

All we can do is convince the people who currently very much ARE one the road to attempting exactly this to not do it—and saying “it’s way too risky and I refuse to think about how you might actually pull it off” is not going to do that.

Or else we can try to make it work if it is attempted.

Both paths to survival involve thinking carefully about how alignment could succeed or fail on our current trajectory.
- TristanTrim 1 Dec 2024 23:56 UTC
  1 point
  0
  Parent
  
  WRT “I don’t want his attempted in any light-cone I inhabit”, well, neither do I. But we’re not in charge of the light cone.
  
  That really is a true and relevant fact, isn’t it? 😭
  
  It seems like aligning humans really is much more of a bottleneck rn than aligning machines, and not because we are at all on track to align machines.
  
  I think you are correct about the need to be pragmatic. My fear is that there may not be anywhere on the scale from “too pragmatic failed to actually align ASI” to “too idealistic, failed to engage with actual decision makers running ASI projects” where we get good outcomes. Its stressful.