Wireheading the human is the ultimate goal of the AI. I chose heroin as the first step along those lines, but that’s where the human would ultimately end at.
For instance, once the human’s on heroin, the AI could ask it “is your true reward function r? If you answer yes, you’ll get heroin.” Under the assumption that the human is rational and the heroin offered is short term, this allows the AI to conclude the human’s reward function is any given r.
I strongly predict that if you make your argument really precise (as you did in the main post), it will have a visible flaw in it. In particular, I expect the fact that r and r-1000 are indistinguishable to prevent the argument from going through (though it’s hard to say exactly how this applies without having access to a sufficiently mathematical argument).
Wireheading the human is the ultimate goal of the AI. I chose heroin as the first step along those lines, but that’s where the human would ultimately end at.
For instance, once the human’s on heroin, the AI could ask it “is your true reward function r? If you answer yes, you’ll get heroin.” Under the assumption that the human is rational and the heroin offered is short term, this allows the AI to conclude the human’s reward function is any given r.
I strongly predict that if you make your argument really precise (as you did in the main post), it will have a visible flaw in it. In particular, I expect the fact that r and r-1000 are indistinguishable to prevent the argument from going through (though it’s hard to say exactly how this applies without having access to a sufficiently mathematical argument).