However, wouldn’t it research further and correct itself (and before that, have care to not do something un-correctable)?
Check out the Cake or Death value loading problem, as Stuart Armstrong puts it.
There’s a rough similarity to the ‘resist blackmail’ problem, which is that you need to be able to tell the difference between someone delivering bad news and doing bad things. If the AI is mistaken about what is right, we want to be able to correct it without being interpreted as villains out to destroy potential utility.
(Also, “correctable” is not really a low-level separation in reality, since the passage of time means nothing is truly correctable.)
Check out the Cake or Death value loading problem, as Stuart Armstrong puts it.
There’s a rough similarity to the ‘resist blackmail’ problem, which is that you need to be able to tell the difference between someone delivering bad news and doing bad things. If the AI is mistaken about what is right, we want to be able to correct it without being interpreted as villains out to destroy potential utility.
(Also, “correctable” is not really a low-level separation in reality, since the passage of time means nothing is truly correctable.)