In your example for bias, the agent only has incentive to manipulate humans because it’s treating human word as truth, rather than as evidence. For example, an AI that relies on button-pressing to learn about human morality will try to press its own buttons if it thinks that the buttons are identical to morality, but will not do so if it has a causal model of the world that allows for morality as one of several causes of button presses.
So a fully probabilistic value learner as in Dewey 2011 doesn’t have this manipulativeness—the trouble is just that we don’t know how to write down the perfect probabilistic model of the world that such a value learner needs in order to work. Hm, I wonder if there’s a way to solve this problem with lots of data and stochastic gradient descent.
(EDIT: For toy problems, you might try to learn correct moral updating from examples of correct moral updates, but the data would be hard to generate for the real world and the space to search would be huge. It seems to me that an AI couldn’t start ignorant, then learn how to learn about morality as it explored the world, then explore the world.)
In your example for bias, the agent only has incentive to manipulate humans because it’s treating human word as truth, rather than as evidence. For example, an AI that relies on button-pressing to learn about human morality will try to press its own buttons if it thinks that the buttons are identical to morality, but will not do so if it has a causal model of the world that allows for morality as one of several causes of button presses.
So a fully probabilistic value learner as in Dewey 2011 doesn’t have this manipulativeness—the trouble is just that we don’t know how to write down the perfect probabilistic model of the world that such a value learner needs in order to work. Hm, I wonder if there’s a way to solve this problem with lots of data and stochastic gradient descent.
(EDIT: For toy problems, you might try to learn correct moral updating from examples of correct moral updates, but the data would be hard to generate for the real world and the space to search would be huge. It seems to me that an AI couldn’t start ignorant, then learn how to learn about morality as it explored the world, then explore the world.)