TurnTrout comments on The case for aligning narrowly superhuman models

TurnTrout 12 Mar 2021 1:14 UTC
LW: 2 AF: 2
AF
I’d say “If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human ‘moral experts’ are going to disagree about], then we’ve already mostly-won” is an accurate correlation, but doesn’t stand up to optimization pressure. We can’t mostly-win just by fine-tuning a language model to do moral discourse. I’d guess you agree?
English sentences don’t have to hold up to optimization pressure, our AI designs do. If I say “I’m hungry for pizza after I work out”, you could say “that doesn’t hold up to optimization pressure—I can imagine universes where you’re not hungry for pizza”, it’s like… okay, but that misses the point? There’s an implicit notion here of “if you told me that we had built AGI and it got hung up on exotic moral questions, I would expect that we had mostly won.”
Perhaps this notion isn’t obvious to all readers, and maybe it is worth spelling out, but as a writer I do find myself somewhat exhausted by the need to include this kind of disclaimer.
Furthermore, what would be optimized in this situation? Is there a dissatisfaction genie that optimizes outcomes against realizations technically permitted by our English sentences? I think it would be more accurate to say “this seems true in the main, although I can imagine situations where it’s not.” Maybe this is what you meant, in which case I agree.