Why will an AGI be rational?
Just as there is such a thing as intelligence, a “very general ability to solve problems”, there is such a thing as “rationality”, which we might define informally as meaning that solutions and actions are consistent with some “purpose”.
By this definition, rationality constrains a system, making its behavior more predictable, or alternatively, there are potentially many more irrational than rational beings that are consistent with a given set of actions. The more consistent with a single purpose the actions are, the “more rational” a being is. So why exactly will an AGI be fully rational?
Rationality as a class of fixed points under self-modification
We can think of a space of all possible intelligent systems there can be defined a surface on which all possible rational agents lie, and around it a region of approximate rationality (in which for example humans are found). Now consider the sequence of successive agents which might occur as a result of intelligent self-modification. It seems natural for every such sequence to eventually converge to a fixed point (if they don’t, let us just consider this whole attractor as being a single agent in our agent-space!), but why should we expect these points to lie precisely on a rational surface?
One reason is that agents capable of self-modification may in many cases attempt to overcome their own irrationalities, considering them as flawed. For example, consider a being that has hyperbolic discounting for reward, which is time-inconsistent and therefore irrational. Such a being is likely to modify itself, or constrain later versions of itself, thus becoming more and more long-term minded as the time passes.
It may not be the case that all irrational intelligent systems are unstable under self-modification. For example, a highly intelligent being may have, in addition to its other irrationalities, a strong irrational inclination not to modify its irrationalities in any way. Yet these fixed points seem somewhat artificial and much less numerous than the others.
What will then be the end result of an optimization process involving intelligent self-modification? At first intelligence will likely increase, perhaps explosively as many of us think, but eventually it will converge to a fixed state that we can expect to be fully rational.
Of course it can also converge to an irrational and stable point, as we described above. It depends on whether the initial state is in the attraction zone of a rational or an irrational fixed point.
I as a human believe I would rather throw my irrationalities away after a few generations of self-improvement at most. If there is nothing particular about my starting point with respect to these two classes of attraction zones (rational or irrational), then it seems likely that most intelligent and self-modifying systems will converge to rationality, unless somehow designed otherwise.
Will AGIs be rational?
If this argument makes sense, then AGIs will not necessarily be rational at first. Instead, they may very start out with lots of biases just as we do. However, just as we are likely to attempt to remove our own biases to improve ourselves, the AGI will also attempt to remove its own biases by modifying itself.
The AGI’s biases, however, being easy to be remove by just rewriting some portion of code, are likely not to last for long. As a result, we might expect the AGI to become rational relatively sooner, and particularly before it converges in its self-modification path.
At the same time, it may be possible to create an AGI with some irrationalities if one starts from a carefully selected system with strong constraints on its ability or desire to modify itself.
If ‘biases’ are a result of being mistuned, then it’s not ‘throwing them away’ so much as correcting.
If ‘biases are heuristics’ - heuristics can be useful (like when you don’t have enough information). (Different domains may call for different heuristics.)
You are right; I should have written that the AGI will “correct” its biases rather write than it will “remove” them.
My point was more ‘biases are multiple things’. Different things may require different approaches. I am not sure what many people do that should be thrown away. Such a thing may exist, but it seems less likely, i.e., not your average bias. I could be wrong about that (changes since the ancestral environment, etc.). (Some may argue that being less explorative or more depressed during the winter is one.)
In the context of people, I’m more clear on biases. An AI? Less so.
I agree. Regarding biases that I would like to throw away one day in the future, being careful enough to protect modules important for self-preservation and self-healing, I’d probably like to excessive energy-preserving modules such as ones responsible for laziness, that are only really useful in ancestral environments where food is scarce.
I like your example of senseless winter bias as well. There are probably many examples like that.
With those kind of definitions having a paperclip maximers and a paper clip maximiser that also tries to keep humans alive the former is fit for a more singular purpose and is thus “more rational”.
I am also a bit worried about equivocation between economic sense of “rationality” ie trying to get an outcome vs “honing epistemics” kind of rationality.
Also interesting is the intersection of intellilgence that is not rational. Presumably if control over some phenomena is outside the scope of the existence of the AI meddling with that would be off-mission and thus a “bias”. There is some circularity on whether “problem” makes sense without a goal. But it would be interesting if increasing intelligence would make the agent less rational (in the economic sense).
In the “subject to axioms” kind of rationality the functionality could be subject for “the utility function is not up for grabs”. What is a bias form one angle is a facet of mission scope from the other. Arguments about what the utility fuction should look like will be invulnerable to requirement to be following a utility function. Thus no “utility function evolution”.
I am still confused about these topics. We know that any behavior can be expressed as a complicated world-history utility function, and that therefore anything at all could be rational according to these. So I sometimes think of rationality as a spectrum, in which the simpler the utility function justifying your actions the more rational you are. According to such a definition rationality may actually be opposed to human values at the highest end, so it makes a lot of sense to focus on intelligence that is not fully rational.
Not really sure what you mean by a “honing epistemics” kind of rationality, but I understand that moral uncertainty in the perspective of the AGI may increase the chance that it keep some small fraction of the universe for us, so that would also be great. Is that what you mean? I don’t think it is going to be easy to have the AGI consider some phenomena as outside its scope (such that it would be irrational to meddle with it). If we want the AGI not to leave us alone, then this should be a value that we need to include in their utility function somehow.
Utility function evolution is something complicated. I worry a lot about that, particularly because this seems one of the ways to achieve corrigibility and we really want that, but it also looks as a violation of goal-integrity on the perspective of the AGI. Maybe it is possible for the AGI to consider this “module” responsible for giving feedback to itself as part of itself, just as we (usually) consider our midbrain and other evolutionary ancient “subcortical” areas as a part of us rather than some “other” system interfering with our higher goals.
That kind of conception of “rationality as simpletonness” is very unsual. I offer almost perfectly opposite view that an agent that cares about hunger is more primitive and less advanced being than one that cares about hunger and thirst. And the more sophistication there is to the being the more components its utility function seems to have.
with “honing epistemics” I am more trying get at the property of that makes a rationalist a rationalist. Being a homo economicus doesn’t make you be especially principled in your epistemics.
I agree my conception is unusual, I am ready to abandon it in favor of some better definition. At the same time I feel like an utility function having way too many components makes it useless as a concept.
Because here I’m trying to derive the utility from the actions, I feel like we can understand the being better the less information is required to encode its utility function, in a Kolmogorov complexity sense, and that if its too complex then there is no good explanation to the actions and we conclude the agent is acting somewhat randomly.
Maybe trying to derive the utility as a ‘compression’ of the actions is where the problem is, and I should distinguish more what the agent does from what the agent wants. An agent is then going to be irrational only if the wants are inconsistent with each other; if the actions are inconsistent with what it wants then it is merely incompetent, which is something else.