Such “value-seeking” behavior doesn’t seem irrational to me, even though I don’t know how to account for it in terms of rationality.
I would say it is part of checking for reflective consistency. Ideally, there shouldn’t be arguments that change your (terminal) values, so if there are, you want to so you can figure out what is wrong and how to fix it.
I don’t think that explanation makes sense. Suppose an AI thinks it might have a security hole in its network stack, so that if someone sends it a certain packet, it would become that person’s slave. It would try to fix that security hole, without actually seeking to have such a packet sent to itself.
We humans know that there are arguments out there that can change our values, but instead of hardening our minds against them, some of us actually try to have such arguments sent to us.
We humans know that there are arguments out there that can change our values, but instead of hardening our minds against them, some of us actually try to have such arguments sent to us.
In the deontological view of values this is puzzling, but in the consequentialist view it isn’t: we welcome arguments that can change our instrumental values, but not our terminal values (A.K.A. happiness/pleasure/eudaimonia/etc.). In fact I contend that it doesn’t even make sense to talk about changing our terminal values.
My explanation is that the human mind is something like a coalition of different sub-agents, many of which are more like animals or insects than rational agents. In any given context, they will pull the overall strategy in different directions. The overall result is an agent with context dependent preferences, i.e. irrational behavior. Many people just live with this.
Some people, however, try to develop a “life philosophy” that shapes the disparate urges of the different mental subcomponents into an overall strategy, that reflects a consistent overall policy.
A moral “argument” might be a hypothetical that attempts to put your mind into a new configuration of relative power of subagents, so that you can re-assess the overall deal.
My explanation is that the human mind is something like a coalition of different sub-agents, many of which are more like animals or insects than rational agents. In any given context, they will pull the overall strategy in different directions. The overall result is an agent with context dependent preferences, i.e. irrational behavior.
Congratulations, you just reinvented [a portion of] PCT. ;-)
[Clarification: PCT models the mind as a massive array of simple control circuits that act to correct errors in isolated perceptions, with consciousness acting as a conflict-resolver to manage things when two controllers send conflicting commands to the same sub-controller. At a fairly high level, a controller might be responsible for a complex value: like correcting hits to self-esteem, or compensating for failings in one’s aesthetic appreciation of one’s work. Such high-level controllers would thus appear somewhat anthropomorphically agent-like, despite simply being something that detects a discrepancy between a target and an actual value, and sets subgoals in an attempt to rectify the detected discrepancy. Anything that we consider of value potentially has an independent “agent” (simple controller) responsible for it in this way, but the hierarchy of control does not necessarily correspond to how we would abstractly prefer to rank our values—which is where the potential for irrationaity and other failings lies.]
I would say it is part of checking for reflective consistency. Ideally, there shouldn’t be arguments that change your (terminal) values, so if there are, you want to so you can figure out what is wrong and how to fix it.
I don’t think that explanation makes sense. Suppose an AI thinks it might have a security hole in its network stack, so that if someone sends it a certain packet, it would become that person’s slave. It would try to fix that security hole, without actually seeking to have such a packet sent to itself.
We humans know that there are arguments out there that can change our values, but instead of hardening our minds against them, some of us actually try to have such arguments sent to us.
In the deontological view of values this is puzzling, but in the consequentialist view it isn’t: we welcome arguments that can change our instrumental values, but not our terminal values (A.K.A. happiness/pleasure/eudaimonia/etc.). In fact I contend that it doesn’t even make sense to talk about changing our terminal values.
It is indeed a puzzling phenomenon.
My explanation is that the human mind is something like a coalition of different sub-agents, many of which are more like animals or insects than rational agents. In any given context, they will pull the overall strategy in different directions. The overall result is an agent with context dependent preferences, i.e. irrational behavior. Many people just live with this.
Some people, however, try to develop a “life philosophy” that shapes the disparate urges of the different mental subcomponents into an overall strategy, that reflects a consistent overall policy.
A moral “argument” might be a hypothetical that attempts to put your mind into a new configuration of relative power of subagents, so that you can re-assess the overall deal.
Congratulations, you just reinvented [a portion of] PCT. ;-)
[Clarification: PCT models the mind as a massive array of simple control circuits that act to correct errors in isolated perceptions, with consciousness acting as a conflict-resolver to manage things when two controllers send conflicting commands to the same sub-controller. At a fairly high level, a controller might be responsible for a complex value: like correcting hits to self-esteem, or compensating for failings in one’s aesthetic appreciation of one’s work. Such high-level controllers would thus appear somewhat anthropomorphically agent-like, despite simply being something that detects a discrepancy between a target and an actual value, and sets subgoals in an attempt to rectify the detected discrepancy. Anything that we consider of value potentially has an independent “agent” (simple controller) responsible for it in this way, but the hierarchy of control does not necessarily correspond to how we would abstractly prefer to rank our values—which is where the potential for irrationaity and other failings lies.]
It does seem that something in this region has to be correct.