The topic of risks related to morally relevant computations seems very important, and I hope a lot more work will be done on it!
My tentative intuition is that learning is not directly involved here. If the weights of a trained RL agent are no longer being updated after some point[1], my intuition is that the model is similarly likely to experience pain before and after that point (assuming the environment stays the same).
Consider the following hypothesis which does not involve a direct relationship between learning and pain: In sufficiently large scale (and complex environments), TD learning tends to create components within the network, call them “evaluators”, that evaluate certain metrics that correlate with expected return. In practice the model is trained to optimize directly for the output of the evaluators (and maximizing the output of the evaluators becomes the mesa objective). Suppose we label possible outputs of the evaluators with “pain” and “pleasure”. We get something that seems analogous to humans. A human cares directly about pleasure and pain (which are things that correlated with expected evolutionary fitness in the ancestral environment), even when those things don’t affect their evolutionary fitness accordingly (e.g. pleasure from eating chocolate, and pain from getting a vaccine shot).
In TD learning, if from some point the model always perfectly predicted the future, the gradient would always be zero and no weights would be updated. Also, if an already-trained RL agent is being deployed, and there’s no longer reinforcement learning going on after deployment (which seems like a plausible setup in products/services that companies sell to customers), the weights would obviously not be updated.
In TD learning, if from some point the model always perfectly predicted the future
If it’s a perfect predictor of a deterministic world, sure. But if the world is stochastic, or you can’t assume realizability, your network can simultaneously be a global optimum but also have gradient updates. It’s just that in expectation, your gradient is zero, but if you update in sufficiently small batches, you might still have non-zero gradients.
The topic of risks related to morally relevant computations seems very important, and I hope a lot more work will be done on it!
My tentative intuition is that learning is not directly involved here. If the weights of a trained RL agent are no longer being updated after some point[1], my intuition is that the model is similarly likely to experience pain before and after that point (assuming the environment stays the same).
Consider the following hypothesis which does not involve a direct relationship between learning and pain: In sufficiently large scale (and complex environments), TD learning tends to create components within the network, call them “evaluators”, that evaluate certain metrics that correlate with expected return. In practice the model is trained to optimize directly for the output of the evaluators (and maximizing the output of the evaluators becomes the mesa objective). Suppose we label possible outputs of the evaluators with “pain” and “pleasure”. We get something that seems analogous to humans. A human cares directly about pleasure and pain (which are things that correlated with expected evolutionary fitness in the ancestral environment), even when those things don’t affect their evolutionary fitness accordingly (e.g. pleasure from eating chocolate, and pain from getting a vaccine shot).
In TD learning, if from some point the model always perfectly predicted the future, the gradient would always be zero and no weights would be updated. Also, if an already-trained RL agent is being deployed, and there’s no longer reinforcement learning going on after deployment (which seems like a plausible setup in products/services that companies sell to customers), the weights would obviously not be updated.
This isn’t key for your point, but:
If it’s a perfect predictor of a deterministic world, sure. But if the world is stochastic, or you can’t assume realizability, your network can simultaneously be a global optimum but also have gradient updates. It’s just that in expectation, your gradient is zero, but if you update in sufficiently small batches, you might still have non-zero gradients.