If AIs are only human-level good at staying aligned, they might undergo value shifts that seem obvious to them the same way our shifts relative to humans 500 years ago seem obvious to us now, in hindsight, but that end with them being similarly misaligned. This would of course still represent significant progress over where we are now, but isn’t what I’d like to shoot for.
And of course a major reason humans are “human-level good at staying aligned” is because we can’t edit our own source code or add extra grey matter. This is not going to be true for AGI, so “just copy a human design into silicon” probably fails.
If AIs are only human-level good at staying aligned, they might undergo value shifts that seem obvious to them the same way our shifts relative to humans 500 years ago seem obvious to us now, in hindsight, but that end with them being similarly misaligned. This would of course still represent significant progress over where we are now, but isn’t what I’d like to shoot for.
And of course a major reason humans are “human-level good at staying aligned” is because we can’t edit our own source code or add extra grey matter. This is not going to be true for AGI, so “just copy a human design into silicon” probably fails.