People sometimes say it seems generally kind to help agents achieve their goals. But it’s possible there need be no relationship between a system’s subjective preferences (i.e. the world states it experiences as good) and its revealed preferences (i.e. the world states it works towards).
For example, you can imagine an agent architecture consisting of three parts:
a reward signal, experienced by a mind as pleasure or pain
a reinforcement learning algorithm
a wrapper which flips the reward signal before passing it to the RL algorithm.
This system might seek out hot stoves to touch while internally screaming. It would not be very kind to turn up the heat.
I think the way to go, philosophically, might be to distinguish kindness-towards-conscious-minds and kindness-towards-agents. The former comes from our values, while the second may be decision theoretic.
The revealed preference orthogonality thesis
People sometimes say it seems generally kind to help agents achieve their goals. But it’s possible there need be no relationship between a system’s subjective preferences (i.e. the world states it experiences as good) and its revealed preferences (i.e. the world states it works towards).
For example, you can imagine an agent architecture consisting of three parts:
a reward signal, experienced by a mind as pleasure or pain
a reinforcement learning algorithm
a wrapper which flips the reward signal before passing it to the RL algorithm.
This system might seek out hot stoves to touch while internally screaming. It would not be very kind to turn up the heat.
I think the way to go, philosophically, might be to distinguish kindness-towards-conscious-minds and kindness-towards-agents. The former comes from our values, while the second may be decision theoretic.