Infra-Bayesianism naturally leads to the monotonicity principle, and I think this is a problem
Introduction
The monotonicity principle is a famously uncomfortable consequence of Infra-Bayesian Physicalism: an IBP agent can only act in a way as if its utility function never gave negative value to any event. This strongly contradicts the intuition that creating suffering people is actively bad.
In this post, I explain in layman terms how IBP leads to this conclusion and I argue that this feature is not unique to Physicalism: with certain reasonable extra assumptions, the monotonicity principle naturally follows from infra-Bayesianism. In my opinion, this points to a significant flaw in the applicability of infra-Bayesianism.
A very simplified overview of Infra-Bayesianism
An infra-Bayesian agent assumes that the world is controlled by a malevolent deity, Murphy, who tries to minimize the agent’s utility.[1] However, Murphy is constrained by some laws. The agent has some hypotheses on what these laws might be. As time goes on, the agent learns that some things don’t go maximally badly for it, which must be because some law constrains Murphy in that regard. The agent slowly learns about the laws in this way, then acts in a way that maximizes its utility under these assumptions. I explain this in more detail and try to give more intuition for the motivations behind this here, and in even more details here.
A moral philosophy assumption
Imagine a perfect simulation of a person being tortured. Presumably, running this simulation is bad. Is running the exact same simulation on two computers twice as bad as running it only once? My intuition is that no, there doesn’t seem to be that much of a difference between running the program once or twice. After all, what if we run it on only one computer, but the computer double-checks every computation-step? Then we basically run the program in two instances in parallel. Is this twice as bad as not double-checking the steps? And what if we run the program on a computer with wider wires?
I also feel that this being a simulation doesn’t change much. If a person is tortured, and a perfect clone of him with the exact same memories and thoughts is tortured in the exact same way, that’s not twice as bad. And the difference between clones and clones being tortured is definitely not the same as the difference between the torture happening once and not happening at all.
In fact, the only assumption we need is sub-linearity.
Assumption: The badness of torturing perfect clones of a person in the exact same way grows sublinearly with .
This can be a simple indicator function (after the first person is tortured, torturing the clones doesn’t make it any worse, as suggested by the simulation analogy), or this can be a compromise position where the badness grows with, let’s say, the logarithm or square-root of .
I think this is a reasonable assumption in moral philosophy that I personally hold, most people I asked agreed with, and Vanessa herself also strongly prescribes to: it’s an integral assumption of infra-Bayesian Physicalism that the exact same experience happening twice is not different from it happening only once.
One can disagree and take an absolute utilitarian position, but I think mine is a common enough intuition that I would want a good decision theory to successfully accommodate utility functions that scale sublinearly with copying an event.
The monotonicity principle
Take an infra-Bayesisan agent whose utility function is sublinear in the above-described way. The agent is offered a choice: if it creates a new person and tortures him for hundred years, then a child gets an ice cream.
The infra-Bayesian agent assumes that everything that it has no information about is maximally horrible. This means that it assumes that every part of the universe it didn’t observe yet (let’s say the inside of quarks), is filled with gazillion copies of all possible suffering and negative utility events imaginable.
In particular, this new person it should create and torture already has gazillion tiny copies inside the quarks, tortured in the exact same way the agent plans to torture him. Then the sublinearity assumption means that the marginal badness of torturing one more instance of this person is negligible.
On the other hand, as ice creams are good, the agent assumes that the only ice creams in the universe are the ones it knows about, and there is no perfect copy of this particular child getting this particular ice cream. There are no gazillion copies of that inside the quarks, as that would be a good thing, and an infra-Bayesian agent always assumes the worst.
Therefore, the positive utility of the child getting the ice cream outweighs the negative utility of creating a person and torturing him for hundred years. This is the monotonicity principle: the agent acts as if no event had negative value.
Vanessa also acknowledges that the monotonicity principle is a serious problem, although she sometimes considers that we should bite this bullet, and that creating an AGI adhering to the monotonicity principle might not actually be horrible, as creating suffering has the opportunity cost of not creating happiness in its place, so the AGI still wouldn’t create much suffering. I strongly oppose this kind of reasoning, as I explain here.
Infinite ethics
Okay, but doesn’t every utilitarian theory break down or at least get really confusing when we consider a universe that might be infinite in one way or an other and might contain lots of (even infinite) copies of every conceivable event? Shouldn’t a big universe make us ditch the sublinearity assumption anyway and go back to absolute utilitarianism, as it might not lead to that many paradoxes?
I’m still confused about all of this, and the only thing I’m confident in is that I don’t want to build a sovereign AGI that tries to maximize some kind of utility, using all kinds of philosophical assumptions. This is one of my disagreements with Vanessa’s alignment agenda, see here.
I also believe that infra-Bayesianism handles these questions even less gracefully than other decision processes would, because of the in-built asymmetry. Normally, I would assume that there might be many tortures and ice creams in our big universe, but I see no reason why there would be more copies of the torture than the ice cream, so I still choose avoiding the torture. On the other hand, an infra-Bayesian agent assumes that the quarks are full of torture but not ice cream, which leads to the monotonicity principle.
This whole problem can be patched by getting rid of the sublinearity assumption and subscribing to full absolute utilitarianism (although in that case Infra-Bayesian Physicalism needs to be reworked as it heavily relies on a strong version of the sublinearity assumption), but I think that even then the existence of this problem points at a serious weakness of infra-Bayesianism.
- ^
Well, it behaves in ways to maximize its utility in the worst-case scenario allowed by the laws. But “acting as if it assumes the worst” is functionally equivalent to assuming the worst, so I describe it this way.
- A mostly critical review of infra-Bayesianism by 28 Feb 2023 18:37 UTC; 104 points) (
- A very non-technical explanation of the basics of infra-Bayesianism by 26 Apr 2023 22:57 UTC; 62 points) (
- an Evangelion dialogue explaining the QACI alignment plan by 10 Jun 2023 3:28 UTC; 51 points) (
- 30 May 2023 22:24 UTC; 6 points) 's comment on A very non-technical explanation of the basics of infra-Bayesianism by (
Some quick comments:
The monotonicity principle is only a consequence of IB if you assume (as IBP does) Knightian uncertainty that allows for anything to be simulated somewhere. IBP assumes this essentially because it leads to natural mathematical properties of the behavior of hypotheses under ontology refinement, which I conjecture to be important for learning, but we still don’t know whether they are truly necessary.
I reiterate that I am not calling to immediately build a sovereign AI based on the 1st set of philosophical assumptions that came to my mind. I am only pointing directions for investigation, which ultimately might leads to us to a state of such confidence in our philosophical assumptions, that even a sovereign AI becomes reasonable (and if they won’t, we will probably still learn much). Right now, we are not close to such a level of confidence.
There are additional approaches to either explaining or avoiding the monotonicity principle, for example the idea of transcartesian agents I mentioned here.
The sequence on Infra-Bayesianism motivates the min (a.k.a. Murphy) part of its argmax min by wanting to establish lower bounds on utility — that’s a valid viewpoint. My own interest in Infra-Bayesianism comes from a different motivation: Murphy’s min encodes directly into Infra-Bayesian decision making the generally true, inter-related facts that 1) for an optimizer, uncertainty on the true world model injects noise into your optimization process which almost always makes the outcome worse 2) the optimizer’s curse usually results in you exploring outcomes whose true utility you had overestimated, so your regret is generally higher than you had expected 3) most everyday environments and situations are already highly optimized, so random perturbations of their state almost invariably make things worse. All of which justify pessimism and conservatism.
The problem with this argument, is that it’s only true when the utility of the current state is higher than the utility of the maximum-entropy equilibrium state of the environment (the state that increasingly randomizing it tends to move it towards due to the law of big numbers). In everyday situations this is almost always true—making random changes to a human body or in a city will almost invariable make things worse, for example. In most physical environments, randomizing them sufficiently (e.g. by raining meteorites on them, or whatever) will tend to reduce their utility to that of a blasted wasteland (the surface of the moon, for example, has pretty-much reached equilibrium under randomization-by-meteorites, and has a very low utility). However, it’s a general feature of human utility functions that there often can be states worse than the maximum-entropy equilibrium. If your environment is a 5-choice multiple-choice test whose utility is the score, the entropic equilibrium is random guessing which will score 20%, and there are chose-wrong-answer policies that score less than that, all the way down to 0% -- and partially randomizing away from one of those policies will make its utility increase towards 20%. Similarly, consider a field of anti-personnel mines left over from a war—as an array of death and dismemberment waiting to happen, randomizing it with meteorite impacts will clear some mines and make its utility better—since it starts off actually worse than a blasted wasteland. Or, if a very smart GAI working on alignment research informed you that it had devised an acausal attack that would convert all permanent hellworlds anywhere in the multiverse into blasted wastelands, your initial assumption would probably be that doing that would be a good thing (modulo questions about whether it was pulling your leg, or whether the inhabitants of the hellworld would consider it a hellworld).
In general, hellworlds (or at least local hell-landscapes) like this are rare — they have a negligible probability of arising by chance, so creating one requires work by an unaligned optimizer. So currently, with humans the strongest optimizers on the planet, they usually only arise in adversarial situations such as wars between groups of humans (“War is Hell”, as the saying goes). However, Infra-Bayesianism has exactly the wrong intuitions about any hell-environment whose utility is currently lower than that of the entropic equilibrium. If you have an environment that has been carefully optimized by a powerful very-non-aligned optimizer so as to maximize human suffering, then random sabotage such as throwing monkey wrenches in the works or assassinating the diabolical mastermind is actually very likely to improve things (from a human point of view), at least somewhat. Infra-Bayesianism would predict otherwise. I think having your GAI’s decision theory based on a system that gives exactly the wrong intuitions about hellworlds is likely to be extremely dangerous.
The solution to this would be what one might call Meso-Bayesianism—renormalize your utility scores so that of the utility of the maximal entropy state of the environment is by definition zero, and then assume that Murphy minimizes the absolute value of the utility towards the equilibrium utility of zero, not towards a hellworld. (I’m not enough of a pure mathematician to have any idea what this modification does to the network of proofs, other then making the utility renormalization part of a Meso-Bayesian update more complicated.) So then your decision theory understands that any unaligned optimizer trying to create a hellworld is also fighting Murphy, and when fighting them on their home turf Murphy is your ally, since “it’s easier to destroy than create” is also true of hellworlds. [Despite the usual formulation of Murphy’s law, I actually think the name ‘Murphy’ suits this particular metaphysical force better—Infra-Bayesianism’s original ‘Murphy’ might have been better named ‘Satan’, since it’s is wiling to go to any length to create a hellworld, hobbled only by some initially-unknown physical laws.]
Reading your thoughts on sublinearity, I instinctively feel as a CEV utilitarian, instead of feeling that copies matter less and less terminally. It seems to me that caring less about copies amounts to hardcoding curiosity in the utility as a terminal goal, while I expect curiosity to be CEV instrumental (curiosity as a property of the superintelligence; not in the sense of keeping around intrinsically curious entities if that’s what we want the superintelligence to value).
I think that in a universe large enough w.r.t. the rate of discount over similarity of computations of your utility, and for certain conditions about how you value size and complexity of the computations, devaluing copies would lead to the garden of God in Unsong: you start tiling the universe with maximally viable satisfactory computations; when you have filled the space of such computations given your “lattice spacing” of what counts as different, you go on creating less satisfactory computations, and so on; if you are not an utilitarian, you stop when you reach your arbitrary level of what counts as “neutral utility”; and, depending on the parameters, you either calculated this to happen at the end of the universe, or you go on repeating your whole construction until the universe is filled with copies of your hierarchy of entities. While instead a utilitarian (that values complexity low enough to not fill the whole universe with a single entity) would tile the universe with the optimal entities, with differences only insofar as its utility says to make these entities value differences and be satisfied.
What do you think about this? Do you see it as a problem? Do you think it’s too unlikely to matter due to purely combinatorial reasons? Or else?
Personally I like Unsong’s God, and I think His approach is better than tiling the Universe with copies of the same optimal entity (or copies of an optimal neighborhood where each being can encounter enough diversity to satisfy them in their own neighborhood).
The Unsong approach might still lead to uncomfortable outcomes with some people tortured to make other people have different positive experiences than the ones already tried (hence the solution to the Problem of evil in Unsong), but I think that with giving big enough negative utilities to suffering, the system probably wouldn’t create people with overall very net-negative lives (and maybe put suffering p-zombie robots in the world if that’s really necessary for other people having novel positive experiences). These are just my guesses and I’m not confident that we can actually make this right, as I mentioned, I wouldn’t want to create any kind of utilitarian sovereign superintelligence. But I think that the weird asymmetry baked in infra-Bayesianism that it can’t give negative utility to any event makes the whole problem significantly harder and points at a weakness of IB.
Generally, an agent that always assumes the worst unless proven wrong, would have no problem doing all kinds of horrible things, because by the assumption they are still an improvement over what would happen otherwise.
I’m pretty sure that’s not how it works. By looking around, it very soon learns that some things are not maximally horrible, like the chair in the room is not broken (so presumably there is some kind of law constraining Murphy to keep the chair intact at least for now). Why would the agent break the chair then, why would that be better than what would happen otherwise?