Unless we still have adequate mech interp or natural language train of thought to detect deceptive alignment
We die (don’t let your AGI fuck this step up!:)
22 chained independent alignment attempts does sound like too much. Hubinger specified that he wasn’t thinking of daisy-chaining like that, but having one trusted agent that keeps itself aligned as it grows smarter.
We die (don’t fuck this step up!:)
Unless we still have adequate mech interp or natural language train of thought to detect deceptive alignment
We die (don’t let your AGI fuck this step up!:)
22 chained independent alignment attempts does sound like too much. Hubinger specified that he wasn’t thinking of daisy-chaining like that, but having one trusted agent that keeps itself aligned as it grows smarter.
the endgame is to use Intent alignment as a stepping-stone to value alignment and let something more competent and compassionate than us monkeys handle things from there on out.