It seems to me a rational agent should never change its self-consistent terminal values. To act out that change would be to act according to some other value and not the terminal values in question. You’d have to say that the rational agent floats around between different sets of values, which is something that humans do, obviously, but not ideal rational agents. The claim then is that ideal rational agents have perfectly consistent values.
“But what if something happens to the agent which causes it too see that its values were wrong, should it not change them?” Cue a cascade of reasoning about which values are “really terminal.”
It seems to me a rational agent should never change its self-consistent terminal values. To act out that change would be to act according to some other value and not the terminal values in question.
It might decide to do that—if it meets another powerful agent, and it is part of the deal they strike.
It might decide to do that—if it meets another powerful agent, and it is part of the deal they strike.
Is it not part of the agent’s (terminal) value function to cooperate with agents when doing so provides benefits? Does the expected value of these benefits materialize from nowhere, or do they exist within some value function?
My claim entails that the agent’s preference ordering of world states consists mostly in instrumental values. If an agent’s value of paperclips is lowered in response to a stimulus, or evidence, than it never exclusively and terminally valued paperclips in the first place. If it gains evidence that paperclips are dangerous and lowers its expected value because of that, it’s because it valued safety. If a powerful agent threatens the agent with destruction unless it ceases to value paperclips, it will only comply if the expected number of future paperclips it would have saved has lower value than the value of its own existence.
Actually, that cuts to the heart of the confusion here. If I manually erased an AI’s source code, and replaced it with an agent with a different value function, is it the “same” agent? Nobody cares, because agents don’t have identities, only source codes. What then is the question we’re discussing?
A perfectly rational agent can indeed self-modify to have a different value function, I concede. It would self-modify according to expected values over the domain of possible agents it might become. It will use its current (terminal) value function to make that consideration. If the quantity of future utility units (according to the original function) with causal relation to the agent is decreased, we’d say the agent has become less powerful. The claim I’d have to prove to retain a point here would be that its new value function is not equivalent to its original function if and only if it the agent becomes less powerful. I think also it is the case if and only if a relevant evidence appears in the agent’s inputs that includes value in self-modification for the sake of self-modification, which exists in cases analogous to coercion.
I’m unsure at this point. My vaguely stated impression was originally that terminal values would never change in a rational agent unless it “had to,” but that may encompass more relevant cases than I originally imagined. Here might be the time to coin the phrase “terminal value drift” where each change in response to the impact of the real world was according to the present value function, but down the road the agent (identified as the “same” agent only modified) is substantively different. Perfect rational agents aren’t omniscient nor omnipotent, or else they might never have to react to the world at all.
It seems to me a rational agent should never change its self-consistent terminal values. To act out that change would be to act according to some other value and not the terminal values in question.
Only a static, an unchanging and unchangeable rational agent. In other words, a dead one.
All things change. In particular, with passage of time both the agent himself changes and the world around him changes. I see absolutely no reason why the terminal values of a rational agent should be an exception from the universal process of change.
Why wouldn’t you expect terminal values to charge? Does your agent have some motivation (which leads it to choose to change) other than its terminal values. Or is it choosing to change its terminal values in pursuit of those values? Or are the terminal value changing involuntarily?
In the first case, the things doing the changing are not the real terminal values.
In the second case, that doesn’t seem to make sense.
In the third case, what we’re discussing is no longer a perfect rational agent.
Example of somebody making that claim.
It seems to me a rational agent should never change its self-consistent terminal values. To act out that change would be to act according to some other value and not the terminal values in question. You’d have to say that the rational agent floats around between different sets of values, which is something that humans do, obviously, but not ideal rational agents. The claim then is that ideal rational agents have perfectly consistent values.
“But what if something happens to the agent which causes it too see that its values were wrong, should it not change them?” Cue a cascade of reasoning about which values are “really terminal.”
That’s a ‘circular’ link to your own comment.
It might decide to do that—if it meets another powerful agent, and it is part of the deal they strike.
It was totally really hard, I had to use a quine.
Is it not part of the agent’s (terminal) value function to cooperate with agents when doing so provides benefits? Does the expected value of these benefits materialize from nowhere, or do they exist within some value function?
My claim entails that the agent’s preference ordering of world states consists mostly in instrumental values. If an agent’s value of paperclips is lowered in response to a stimulus, or evidence, than it never exclusively and terminally valued paperclips in the first place. If it gains evidence that paperclips are dangerous and lowers its expected value because of that, it’s because it valued safety. If a powerful agent threatens the agent with destruction unless it ceases to value paperclips, it will only comply if the expected number of future paperclips it would have saved has lower value than the value of its own existence.
Actually, that cuts to the heart of the confusion here. If I manually erased an AI’s source code, and replaced it with an agent with a different value function, is it the “same” agent? Nobody cares, because agents don’t have identities, only source codes. What then is the question we’re discussing?
A perfectly rational agent can indeed self-modify to have a different value function, I concede. It would self-modify according to expected values over the domain of possible agents it might become. It will use its current (terminal) value function to make that consideration. If the quantity of future utility units (according to the original function) with causal relation to the agent is decreased, we’d say the agent has become less powerful. The claim I’d have to prove to retain a point here would be that its new value function is not equivalent to its original function if and only if it the agent becomes less powerful. I think also it is the case if and only if a relevant evidence appears in the agent’s inputs that includes value in self-modification for the sake of self-modification, which exists in cases analogous to coercion.
I’m unsure at this point. My vaguely stated impression was originally that terminal values would never change in a rational agent unless it “had to,” but that may encompass more relevant cases than I originally imagined. Here might be the time to coin the phrase “terminal value drift” where each change in response to the impact of the real world was according to the present value function, but down the road the agent (identified as the “same” agent only modified) is substantively different. Perfect rational agents aren’t omniscient nor omnipotent, or else they might never have to react to the world at all.
Only a static, an unchanging and unchangeable rational agent. In other words, a dead one.
All things change. In particular, with passage of time both the agent himself changes and the world around him changes. I see absolutely no reason why the terminal values of a rational agent should be an exception from the universal process of change.
Why wouldn’t you expect terminal values to charge? Does your agent have some motivation (which leads it to choose to change) other than its terminal values. Or is it choosing to change its terminal values in pursuit of those values? Or are the terminal value changing involuntarily?
In the first case, the things doing the changing are not the real terminal values.
In the second case, that doesn’t seem to make sense.
In the third case, what we’re discussing is no longer a perfect rational agent.
What exactly do you mean by “perfect rational agent”? Does such a creature exist in reality?