Seth Herd comments on If we solve alignment, do we die anyway?

Seth Herd 19 Nov 2024 23:05 UTC
5 points
1
1. We die (don’t fuck this step up!:)
  1. Unless we still have adequate mech interp or natural language train of thought to detect deceptive alignment
2. We die (don’t let your AGI fuck this step up!:)
  1. 22 chained independent alignment attempts does sound like too much. Hubinger specified that he wasn’t thinking of daisy-chaining like that, but having one trusted agent that keeps itself aligned as it grows smarter.
3. the endgame is to use Intent alignment as a stepping-stone to value alignment and let something more competent and compassionate than us monkeys handle things from there on out.