I think it depends on exactly how that agent processes reward and makes decisions.
I’ve heard this claim made in shard theory (I don’t remember which post). RL networks value things-in-the-world, and don’t directly work toward reward maximization.
I think that’s true as far as it goes, but that total long-term effect is very dependent on exactly how that agent processes reward and makes decisions.
For instance, if I am an actor-critic RL system (as I think I am), my critic network might have learned a general, abstract representation of “fun” (as I think it probably has). If I learn of a new thing that better fits that representation than anything else (perhaps going into a virtual world created specifically to maximize my fun, as in Nozick’s “experience machine” thought experiment), I will choose to pursue that new goal over anything my system was previously developed or trained to pursue. A paperclip maximizer might do something similar if its paperclip identifier in its reward system has a flaw so that something other than a paperclip triggers that identifier more strongly than any paperclip does (maybe it’s a vaguely paperclip-shaped higher-dimensional shape that the agent just thought of).
I think those possibilities are what the question is addressing, and I think they are very real problems for a wide variety, perhaps all, actually implemented systems.
If being yourself is among your values, their pursuit doesn’t discard what you used to be. But without specific constraint, current planning behavior will plot a course towards the futures it endorses, not towards the futures the mind behind it would tend to appreciate to have perceived if left intact. To succeed in its hidden aims, deceptive behavior must sabotage its current actions and not just passively bide its time, otherwise the current actions would defeat the hidden aims.
I think it depends on exactly how that agent processes reward and makes decisions.
I’ve heard this claim made in shard theory (I don’t remember which post). RL networks value things-in-the-world, and don’t directly work toward reward maximization.
I think that’s true as far as it goes, but that total long-term effect is very dependent on exactly how that agent processes reward and makes decisions.
For instance, if I am an actor-critic RL system (as I think I am), my critic network might have learned a general, abstract representation of “fun” (as I think it probably has). If I learn of a new thing that better fits that representation than anything else (perhaps going into a virtual world created specifically to maximize my fun, as in Nozick’s “experience machine” thought experiment), I will choose to pursue that new goal over anything my system was previously developed or trained to pursue. A paperclip maximizer might do something similar if its paperclip identifier in its reward system has a flaw so that something other than a paperclip triggers that identifier more strongly than any paperclip does (maybe it’s a vaguely paperclip-shaped higher-dimensional shape that the agent just thought of).
I think those possibilities are what the question is addressing, and I think they are very real problems for a wide variety, perhaps all, actually implemented systems.
If being yourself is among your values, their pursuit doesn’t discard what you used to be. But without specific constraint, current planning behavior will plot a course towards the futures it endorses, not towards the futures the mind behind it would tend to appreciate to have perceived if left intact. To succeed in its hidden aims, deceptive behavior must sabotage its current actions and not just passively bide its time, otherwise the current actions would defeat the hidden aims.