I am imagining a flat plain of possible normative systems (goals / preferences / inclinations / whatever), with red zones sprinkled around marking those normative systems which are dangerous. CRT (as I understand it) says that there is a basin with consequentialism at its bottom, such that there is a systematic force pushing systems towards that. I’m imagining that there’s no systematic force.
So in my view (flat plain), a good AI system is one that starts in a safe place on this plain, and then doesn’t move at all … because if you move in any direction, you could randomly step into a red area. This is why I don’t like misaligned subsystems—it’s a step in some direction, any direction, away from the top-level normative system. Then “Inner optimizers / daemons” is a special case of “misaligned subsystem”, in which the random step happened to be into a red zone. Again, CRT says (as I understand it) that a misaligned subsystem is more likely than chance to be an inner optimizer, whereas I think a misaligned subsystem can be an inner optimizer but I don’t specify the probability of that happening.
Leaving aside what other people have said, it’s an interesting question: are there relations between the normative system at the top-level and the normative system of its subsystems? There’s obviously good reason to expect that consequentialist systems will tend to create consequentialist subsystems, and that deontological systems will tend create deontological subsystems, etc. I can kinda imagine cases where a top-level consequentialist would sometimes create a deontological subsystem, because it’s (I imagine) computationally simpler to execute behaviors than to seek goals, and sub-sub-...-subsystems need to be very simple. The reverse seems less likely to me. Why would a top-level deontologist spawn a consequentialist subsystem? Probably there are reasons...? Well, I’m struggling a bit to concretely imagine a deontological advanced AI...
We can ask similar questions at the top-level. I think about normative system drift (with goal drift being a special case), buffeted by a system learning new things and/or reprogramming itself and/or getting bit-flips from cosmic rays etc. Is there any reason to expect the drift to systematically move in a certain direction? I don’t see any reason, other than entropy considerations (e.g. preferring systems that can be implemented in many different ways). Paul Christiano talks about a “broad basin of attraction” towards corrigibility but I don’t understand the argument, or else I don’t believe it. I feel like, once you get to a meta-enough level, there stops being any meta-normative system pushing the normative system in any particular direction.
So maybe the stronger version of not-CRT is: “there is no systematic force of any kind whatsoever on an AI’s top-level normative system, with the exceptions of (1) entropic forces, and (2) programmers literally shutting down the AI, editing the raw source code, and trying again”. I (currently) would endorse this statement. (This is also a stronger form of orthogonality, I guess.)
RE “Is there any reason to expect the drift to systematically move in a certain direction?”
For bit-flips, evolution should select among multiple systems for those that get lucky and get bit-flipped towards higher fitness, but not directly push a given system in that direction.
For self-modification (“reprogramming itself”), I think there are a lot of arguments for CRT (e.g. the decision theory self-modification arguments), but they all seem to carry some implicit assumptions about the inner-workings of the AI.
I don’t know of any formal arguments (though that’s not to say there are none), but I’ve heard the point repeated enough times that I think I have a fairly good grasp of the underlying intuition. To wit: most departures from rationality (which is defined in the usual sense) are not stable under reflection. That is, if an agent is powerful enough to model its own reasoning process (and potential improvements to said process), by default it will tend to eliminate obviously irrational behavior (if given the opportunity).
The usual example of this is an agent running CDT. Such agents, if given the opportunity to build successor agents, will not create other CDT agents. Instead, the agents they construct will generally follow some FDT-like decision rule. This would be an instance of irrational behavior being corrected via self-modification (or via the construction of successor agents, which can be regarded as the same thing if the “successor agent” is simply the modified version of the original agent).
Of course, the above example is not without controversy, since some people still hold that CDT is, in fact, rational. (Though such people would be well-advised to consider what it might mean that CDT is unstable under reflection—if CDT agents are such that they try to get rid of themselves in favor of FDT-style agents when given a chance, that may not prove that CDT is irrational, but it’s certainly odd, and perhaps indicative of other problems.) So, with that being said, here’s a more obvious (if less realistic) example:
Suppose you have an agent that is perfectly rational in all respects, except that it is hardcoded to believe 51 is prime. (That is, its prior assigns a probability of 1 to the statement “51 is prime”, making it incapable of ever updating against this proposition.) If this agent is given the opportunity to build a successor agent, the successor agent it builds will not be likewise certain of 51′s primality. (This is, of course, because the original agent is not incentivized to ensure that its successor believes that 51 is prime. However, even if it were so incentivized, it still would not see a need to build this belief into the successor’s prior, the way said belief is built into its own prior. After all, the original agent actually does believe 51 is prime; and so from its perspective, the primality of 51 is simply a fact that any sufficiently intelligent agent ought to be able to establish—without any need for hardcoding.)
I’ve now given two examples of irrational behavior being corrected out of existence via self-modification. The first example, CDT, could be termed an example of instrumentally irrational behavior; that is, the irrational part of the agent is the rule it uses to make decisions. The second example, conversely, is not an instance of instrumental irrationality, but rather epistemic irrationality: the agent is certain, a priori, that a particular (false) statement of mathematics is actually true. But there is a third type of “irrationality” that self-modification is prone to destroying, which is not (strictly speaking) a form of irrationality at all: “irrational” preferences.
Yes, not even preferences are safe from self-modification! Intuitively, it might be obvious that departures from instrumental and epistemic rationality will tend to be corrected; but it doesn’t seem obvious at all that preferences should be subject to the same kind of “correction” (since, after all, preferences can’t be labeled “irrational”). And yet, consider the following agent: a paperclip maximizer that has had a very simple change made to its utility function, such that it assigns a utility of −10,000 to any future in which the sequence “a29cb1b0eddb9cb5e06160fdec195e1612e837be21c46dfc13d2a452552f00d0” is printed onto a piece of paper. (This is the SHA-256 hash for the phrase “this is a stupid hypothetical example”.) Such an agent, when considering how to build a successor agent, may reason in the following manner:
It is extraordinarily improbable that this sequence of characters will be printed by chance. In fact, the only plausible reason for such a thing to happen is if some other intelligence, upon inspecting my utility function, notices the presence of this odd utility assignment and subsequently attempts to exploit it by threatening to create just such a piece of paper. Therefore, if I eliminate this part of my utility function, that removes any incentives for potential adversaries to create such a piece of paper, which in turn means such a paper will almost surely never come into existence.
And thus, this “irrational” preference would be deleted from the utility function of the modified successor agent. So, as this toy example illustrates, not even preferences are guaranteed to be stable under reflection.
It is notable, of course, that all three of the agents I just described are in some sense “almost rational”—that is, these agents are more or less fully rational agents, with a tiny bit of irrationality “grafted on” by hypothesis. This is in part due to convenience; such agents are, after all, very easy to analyze. But it also leaves open the possibility that less obviously rational agents, whose behavior isn’t easily fit into the framework of rationality at all—such as, for example, humans—will not be subject to this kind of issue.
Still, I think these three examples are, if not conclusive, then at the very least suggestive. They suggest that the tendency to eliminate certain kinds of behavior does exist in at least some types of agents, and perhaps in most. Empirically, at least, humans do seem to gravitate toward expected utility maximization as a framework; there is a reason economists tend to assume rational behavior in their proofs and models, and have done so for centuries, whereas the notion of intentionally introducing certain kinds of irrational behavior has shown up only recently. And I don’t think it’s a coincidence that the first people who approached the AI alignment problem started from the assumption that the AI would be an expected utility maximizer. Perhaps humans, too, are subject to the “convergent rationality thesis”, and the only reason we haven’t built our “successor agent” yet is because we don’t know how to do so. (If so, then thank goodness for that!)
Thanks, that was really helpful!! OK, so going back to my claim above: “there is no systematic force of any kind whatsoever on an AI’s top-level normative system”. So far I have six exceptions to this:
If an agent has a “real-world goal” (utility function on future-world-states), we should expect increasingly rational goal-seeking behavior, including discovering and erasing hardcoded irrational behavior (with respect to that goal), as described by dxu. But I’m not counting this as an exception to my claim because the goal is staying the same.
If an agent has a set of mutually-inconsistent goals / preferences / inclinations, it may move around within the convex hull (so to speak) of these goals / preferences / inclinations, as they compete against each other. (This happens in humans.) And then, if there is at least one preference in that set which is a “real-world goal”, it’s possible (though not guaranteed) that that preference will come out on top, leading to (0) above. And maybe there’s a “systematic force” pushing in some direction within this convex hull—i.e., it’s possible that, when incompatible preferences are competing against each other, some types are inherently likelier to win the competition than other types. I don’t know which ones that would be.
In the (presumably unusual) case that an agent has a “self-defeating preference” (i.e. a preference which is likelier to be satisfied by the agent not having that preference, as in dxu’s awesome SHA example), we should expect the agent to erase that preference.
As capybaralet notes, if there is evolution among self-reproducing AIs (god help us all), we can expect the population average to move towards goals promoting evolutionary fitness
Insofar as there is randomness in how agents change over time, we should expect a systematic force pushing towards “high-entropy” goals / preferences / inclinations (i.e., ones that can be implemented in lots of different ways).
Insofar as the AI is programming its successors, we should expect a systematic force pushing towards goals / preferences / inclinations that are easy to program & debug & reason about.
The human programmers can shut down the AI and edit the raw source code.
I am imagining a flat plain of possible normative systems (goals / preferences / inclinations / whatever), with red zones sprinkled around marking those normative systems which are dangerous. CRT (as I understand it) says that there is a basin with consequentialism at its bottom, such that there is a systematic force pushing systems towards that. I’m imagining that there’s no systematic force.
So in my view (flat plain), a good AI system is one that starts in a safe place on this plain, and then doesn’t move at all … because if you move in any direction, you could randomly step into a red area. This is why I don’t like misaligned subsystems—it’s a step in some direction, any direction, away from the top-level normative system. Then “Inner optimizers / daemons” is a special case of “misaligned subsystem”, in which the random step happened to be into a red zone. Again, CRT says (as I understand it) that a misaligned subsystem is more likely than chance to be an inner optimizer, whereas I think a misaligned subsystem can be an inner optimizer but I don’t specify the probability of that happening.
Leaving aside what other people have said, it’s an interesting question: are there relations between the normative system at the top-level and the normative system of its subsystems? There’s obviously good reason to expect that consequentialist systems will tend to create consequentialist subsystems, and that deontological systems will tend create deontological subsystems, etc. I can kinda imagine cases where a top-level consequentialist would sometimes create a deontological subsystem, because it’s (I imagine) computationally simpler to execute behaviors than to seek goals, and sub-sub-...-subsystems need to be very simple. The reverse seems less likely to me. Why would a top-level deontologist spawn a consequentialist subsystem? Probably there are reasons...? Well, I’m struggling a bit to concretely imagine a deontological advanced AI...
We can ask similar questions at the top-level. I think about normative system drift (with goal drift being a special case), buffeted by a system learning new things and/or reprogramming itself and/or getting bit-flips from cosmic rays etc. Is there any reason to expect the drift to systematically move in a certain direction? I don’t see any reason, other than entropy considerations (e.g. preferring systems that can be implemented in many different ways). Paul Christiano talks about a “broad basin of attraction” towards corrigibility but I don’t understand the argument, or else I don’t believe it. I feel like, once you get to a meta-enough level, there stops being any meta-normative system pushing the normative system in any particular direction.
So maybe the stronger version of not-CRT is: “there is no systematic force of any kind whatsoever on an AI’s top-level normative system, with the exceptions of (1) entropic forces, and (2) programmers literally shutting down the AI, editing the raw source code, and trying again”. I (currently) would endorse this statement. (This is also a stronger form of orthogonality, I guess.)
RE “Is there any reason to expect the drift to systematically move in a certain direction?”
For bit-flips, evolution should select among multiple systems for those that get lucky and get bit-flipped towards higher fitness, but not directly push a given system in that direction.
For self-modification (“reprogramming itself”), I think there are a lot of arguments for CRT (e.g. the decision theory self-modification arguments), but they all seem to carry some implicit assumptions about the inner-workings of the AI.
What are “decision theory self-modification arguments”? Can you explain or link?
I don’t know of any formal arguments (though that’s not to say there are none), but I’ve heard the point repeated enough times that I think I have a fairly good grasp of the underlying intuition. To wit: most departures from rationality (which is defined in the usual sense) are not stable under reflection. That is, if an agent is powerful enough to model its own reasoning process (and potential improvements to said process), by default it will tend to eliminate obviously irrational behavior (if given the opportunity).
The usual example of this is an agent running CDT. Such agents, if given the opportunity to build successor agents, will not create other CDT agents. Instead, the agents they construct will generally follow some FDT-like decision rule. This would be an instance of irrational behavior being corrected via self-modification (or via the construction of successor agents, which can be regarded as the same thing if the “successor agent” is simply the modified version of the original agent).
Of course, the above example is not without controversy, since some people still hold that CDT is, in fact, rational. (Though such people would be well-advised to consider what it might mean that CDT is unstable under reflection—if CDT agents are such that they try to get rid of themselves in favor of FDT-style agents when given a chance, that may not prove that CDT is irrational, but it’s certainly odd, and perhaps indicative of other problems.) So, with that being said, here’s a more obvious (if less realistic) example:
Suppose you have an agent that is perfectly rational in all respects, except that it is hardcoded to believe 51 is prime. (That is, its prior assigns a probability of 1 to the statement “51 is prime”, making it incapable of ever updating against this proposition.) If this agent is given the opportunity to build a successor agent, the successor agent it builds will not be likewise certain of 51′s primality. (This is, of course, because the original agent is not incentivized to ensure that its successor believes that 51 is prime. However, even if it were so incentivized, it still would not see a need to build this belief into the successor’s prior, the way said belief is built into its own prior. After all, the original agent actually does believe 51 is prime; and so from its perspective, the primality of 51 is simply a fact that any sufficiently intelligent agent ought to be able to establish—without any need for hardcoding.)
I’ve now given two examples of irrational behavior being corrected out of existence via self-modification. The first example, CDT, could be termed an example of instrumentally irrational behavior; that is, the irrational part of the agent is the rule it uses to make decisions. The second example, conversely, is not an instance of instrumental irrationality, but rather epistemic irrationality: the agent is certain, a priori, that a particular (false) statement of mathematics is actually true. But there is a third type of “irrationality” that self-modification is prone to destroying, which is not (strictly speaking) a form of irrationality at all: “irrational” preferences.
Yes, not even preferences are safe from self-modification! Intuitively, it might be obvious that departures from instrumental and epistemic rationality will tend to be corrected; but it doesn’t seem obvious at all that preferences should be subject to the same kind of “correction” (since, after all, preferences can’t be labeled “irrational”). And yet, consider the following agent: a paperclip maximizer that has had a very simple change made to its utility function, such that it assigns a utility of −10,000 to any future in which the sequence “a29cb1b0eddb9cb5e06160fdec195e1612e837be21c46dfc13d2a452552f00d0” is printed onto a piece of paper. (This is the SHA-256 hash for the phrase “this is a stupid hypothetical example”.) Such an agent, when considering how to build a successor agent, may reason in the following manner:
And thus, this “irrational” preference would be deleted from the utility function of the modified successor agent. So, as this toy example illustrates, not even preferences are guaranteed to be stable under reflection.
It is notable, of course, that all three of the agents I just described are in some sense “almost rational”—that is, these agents are more or less fully rational agents, with a tiny bit of irrationality “grafted on” by hypothesis. This is in part due to convenience; such agents are, after all, very easy to analyze. But it also leaves open the possibility that less obviously rational agents, whose behavior isn’t easily fit into the framework of rationality at all—such as, for example, humans—will not be subject to this kind of issue.
Still, I think these three examples are, if not conclusive, then at the very least suggestive. They suggest that the tendency to eliminate certain kinds of behavior does exist in at least some types of agents, and perhaps in most. Empirically, at least, humans do seem to gravitate toward expected utility maximization as a framework; there is a reason economists tend to assume rational behavior in their proofs and models, and have done so for centuries, whereas the notion of intentionally introducing certain kinds of irrational behavior has shown up only recently. And I don’t think it’s a coincidence that the first people who approached the AI alignment problem started from the assumption that the AI would be an expected utility maximizer. Perhaps humans, too, are subject to the “convergent rationality thesis”, and the only reason we haven’t built our “successor agent” yet is because we don’t know how to do so. (If so, then thank goodness for that!)
Thanks, that was really helpful!! OK, so going back to my claim above: “there is no systematic force of any kind whatsoever on an AI’s top-level normative system”. So far I have six exceptions to this:
If an agent has a “real-world goal” (utility function on future-world-states), we should expect increasingly rational goal-seeking behavior, including discovering and erasing hardcoded irrational behavior (with respect to that goal), as described by dxu. But I’m not counting this as an exception to my claim because the goal is staying the same.
If an agent has a set of mutually-inconsistent goals / preferences / inclinations, it may move around within the convex hull (so to speak) of these goals / preferences / inclinations, as they compete against each other. (This happens in humans.) And then, if there is at least one preference in that set which is a “real-world goal”, it’s possible (though not guaranteed) that that preference will come out on top, leading to (0) above. And maybe there’s a “systematic force” pushing in some direction within this convex hull—i.e., it’s possible that, when incompatible preferences are competing against each other, some types are inherently likelier to win the competition than other types. I don’t know which ones that would be.
In the (presumably unusual) case that an agent has a “self-defeating preference” (i.e. a preference which is likelier to be satisfied by the agent not having that preference, as in dxu’s awesome SHA example), we should expect the agent to erase that preference.
As capybaralet notes, if there is evolution among self-reproducing AIs (god help us all), we can expect the population average to move towards goals promoting evolutionary fitness
Insofar as there is randomness in how agents change over time, we should expect a systematic force pushing towards “high-entropy” goals / preferences / inclinations (i.e., ones that can be implemented in lots of different ways).
Insofar as the AI is programming its successors, we should expect a systematic force pushing towards goals / preferences / inclinations that are easy to program & debug & reason about.
The human programmers can shut down the AI and edit the raw source code.
Agree or disagree? Did I miss any?