In your hypothetical, the Clippy is trained to care about both paperclips and humans. If we knew how to do that, we’d know how to train an AI to only care about humans. The issue is not that we do not know how to exclude the paperclip part from this—the issue is that 1) we do not know how to even define what caring about humans means, and 2) even if we did, we do not know how to train a sufficiently powerful AI to reliably care about the things we want it to care about.
The issue you describe is one issue, but not the only one. We do know how to train an agent to do SOME things we like. The concern is that it won’t be an exact match. The question I’m raising is: can we be a little or a lot off-target, and still have that be enough, because we captured some overlap between our and the agents values?
In your hypothetical, the Clippy is trained to care about both paperclips and humans. If we knew how to do that, we’d know how to train an AI to only care about humans. The issue is not that we do not know how to exclude the paperclip part from this—the issue is that 1) we do not know how to even define what caring about humans means, and 2) even if we did, we do not know how to train a sufficiently powerful AI to reliably care about the things we want it to care about.
The issue you describe is one issue, but not the only one. We do know how to train an agent to do SOME things we like. The concern is that it won’t be an exact match. The question I’m raising is: can we be a little or a lot off-target, and still have that be enough, because we captured some overlap between our and the agents values?
Not consistently in sufficiently complex and variable environment.
No, because it will hallucinate often enough to kill us during one of those hallucinations.