The thing underlying the intuition is more something like: We have a method of feedback that humans understand and that works fairly well, and is adapted to the way values are stored in human brains. If we try to have humans give feedback in ways that are not adapted to that, I expect information to be lost. The fact that it “feels natural” is a proxy for “the method of feedback to machines is adapted to the way humans normally give feedback to other humans” without which I am at least concerned about information loss (not claiming it’s inevitable). I don’t inherently care about the “feeling” of naturalness.
Regarding no Safe Natural Intelligence: I agree that there is no such thing, but this is not really a strong argument against? This doesn’t make me somehow suddenly feel comfortable about “unnatural” (I need a better term) methods for humans to provide feedback to AI agents. The fact that there are bad people doesn’t negate the idea that the only source of information about what is good seems to be stored in brains and that we need to extract this information in a way that is adapted to how those brains normally express that information.
Maybe I should have called it “human-adapted methods of human feedback” or something.
Regarding no Safe Natural Intelligence: I agree that there is no such thing, but this is not really a strong argument against?
I think it’s a pretty strong argument. There are no humans I’d trust with the massively expanded capabilities that AI will bring, so I have to believe that the training methods for humans are insufficient.
We WANT divergence from “business as usual” human beliefs and actions, and one of the ways to get there is by different specifications and training mechanisms. The hard part is we don’t yet know how to specify precisely how we want it to differ.
The thing underlying the intuition is more something like: We have a method of feedback that humans understand and that works fairly well, and is adapted to the way values are stored in human brains. If we try to have humans give feedback in ways that are not adapted to that, I expect information to be lost. The fact that it “feels natural” is a proxy for “the method of feedback to machines is adapted to the way humans normally give feedback to other humans” without which I am at least concerned about information loss (not claiming it’s inevitable). I don’t inherently care about the “feeling” of naturalness.
Regarding no Safe Natural Intelligence: I agree that there is no such thing, but this is not really a strong argument against? This doesn’t make me somehow suddenly feel comfortable about “unnatural” (I need a better term) methods for humans to provide feedback to AI agents. The fact that there are bad people doesn’t negate the idea that the only source of information about what is good seems to be stored in brains and that we need to extract this information in a way that is adapted to how those brains normally express that information.
Maybe I should have called it “human-adapted methods of human feedback” or something.
I think it’s a pretty strong argument. There are no humans I’d trust with the massively expanded capabilities that AI will bring, so I have to believe that the training methods for humans are insufficient.
We WANT divergence from “business as usual” human beliefs and actions, and one of the ways to get there is by different specifications and training mechanisms. The hard part is we don’t yet know how to specify precisely how we want it to differ.