there was a period during which i was more uncertain about this question, and avoided openly sharing minimally-dual-use alignment research (but did not try to accelerate progress towards a nonaligned-takeover) while resolving that uncertainty.
a few relevant updates since then:
decrease on the probability that the values an aligned AI would have would endorse human-caused moral catastrophes such as human-caused animal suffering.
i did not automatically believe humans to be good-by-default, and wanted to take time to seriously consider what i think should be a default hypothesis-for-consideration upon existing in a society that generally accepts an ongoing mass torture event.
awareness of vastly worse possible s-risks.
factory farming is a form of physical torture, by which i mean torture of a mind which is done through the indirect route of effecting its input channels (body/senses). it is also a form of psychological torture. it is very bad, but situations which are magnitudes worse seem possible, where a mind is modulated directly (on the neuronal level) and fully.
compared to ‘in-distribution suffering’ (eg animal suffering, human-social conflicts), i find it further less probable that an AI aligned to some human-specified values[2] would create a future with this.
i think it’s plausible that it exists rarely in other parts of the world, though, and if so would be important to prevent through acausal trade if we can.
i am not free of uncertainty about the topic, though.
in particular, if disvalue of suffering is common across the world, such that the suffering which can be reduced through acausal trade will be reduced through acausal trade regardless of whether we create an AI which disvalues suffering, then it would no longer be the case that working on alignment is the best decision for a purely negative utilitarian.
despite this uncertainty, my current belief is that the possibility of reducing suffering via acausal trade (including possibly such really-extreme forms of suffering) outweighs the probability and magnitude of human-aligned-AI-caused suffering.[3]
also, to be clear, if it ever seems that an actualized s-risk takeover event is significantly more probable than it seems now[4] as a result of unknown future developments, i would fully endorse causing a sooner unaligned-but-not-suffering takeover to prevent it.
i find it easier to write this post as explaining my position as “even for a pure negative utilitarian, i think it’s the correct choice”, because it lets us ignore individual differences in how much moral weight is assigned to suffering relative to everything else.
i think it’s pretty improbable that i would, on ‘idealized reflection’/CEV, endorse total-negative-utilitarianism (which has been classically pointed out as implying, e.g, preferring a universe with nothing to a universe containing a robust utopia plus an instance of light suffering).
i self-describe as a “suffering-focused altruist” or “negative-leaning-utilitarian.” ie, suffering seems much worse to me than happiness seems good.
(though certainly there are some individual current humans would do this, for example to digital minds, if given the ability to do so. rather, i’m expressing a belief that it’s very probable that an aligned AI which practically results from this situation would not allow that to happen.)
(by ‘human-aligned AI’, I mean one pointed to an actual CEV of one or a group of humans (which could indirectly imply the ‘CEV of everyone’ but without actually-not-being-that and failing in the below way, and without allowing cruel values of some individuals to enter into it).
I don’t mean an AI aligned to some sort of ‘current institutional process’, like voting, involving all living humans—I think that should be avoided due to politicization risk and potential for present/unreflective(/by which i mean cruel)-values lock-in.)
there’s some way to formalize with bayes equations how likely, from a negative-utilitarian perspective, an s-risk needs to be (relative to a good outcome) to terminate a timeline.
it would intake probability distributions related to ‘the frequency of suffering-disvalue across the universal distribution of ASIs’ and ‘the frequency of various forms of s-risks that are preventable with acausal trade’. i might create this formalization later.
if we think there’s pretty certainly more preventable-through-trade-type suffering-events than there is altruistic ASIs to prevent it, a local preventable-type s-risk might actually need to be ‘more likely than the good/suffering-disvaluing outcome’
Superintelligence is related to three categories of suffering risk: suffering subroutines (Tomasik 2017), mind crime (Bostrom 2014) and flawed realization (Bostrom 2013).
5.1 Suffering subroutines
Humans have evolved to be capable of suffering, and while the question of which other animals are conscious or capable of suffering is controversial, pain analogues are present in a wide variety of animals. The U.S. National Research Council’s Committee on Recognition and Alleviation of Pain in Laboratory Animals (2004) argues that, based on the state of existing evidence, at least all vertebrates should be considered capable of experiencing pain.
Pain seems to have evolved because it has a functional purpose in guiding behavior: evolution having found it suggests that pain might be the simplest solution for achieving its purpose. A superintelligence which was building subagents, such as worker robots or disembodied cognitive agents, might then also construct them in such a way that they were capable of feeling pain—and thus possibly suffering (Metzinger 2015)—if that was the most efficient way of making them behave in a way that achieved the superintelligence’s goals.
Humans have also evolved to experience empathy towards each other, but the evolutionary reasons which cause humans to have empathy (Singer 1981) may not be relevant for a superintelligent singleton which had no game-theoretical reason to empathize with others. In such a case, a superintelligence which had no disincentive to create suffering but did have an incentive to create whatever furthered its goals, could create vast populations of agents which sometimes suffered while carrying out the superintelligence’s goals. Because of the ruling superintelligence’s indifference towards suffering, the amount of suffering experienced by this population could be vastly higher than it would be in e.g. an advanced human civilization, where humans had an interest in helping out their fellow humans.
Depending on the functional purpose of positive mental states such as happiness, the subagents might or might not be built to experience them. For example, Fredrickson (1998) suggests that positive and negative emotions have differing functions. Negative emotions bias an individual’s thoughts and actions towards some relatively specific response that has been evolutionarily adaptive: fear causes an urge to escape, anger causes an urge to attack, disgust an urge to be rid of the disgusting thing, and so on. In contrast, positive emotions bias thought-action tendencies in a much less specific direction. For example, joy creates an urge to play and be playful, but “play” includes a very wide range of behaviors, including physical, social, intellectual, and artistic play. All of these behaviors have the effect of developing the individual’s skills in whatever the domain. The overall effect of experiencing positive emotions is to build an individual’s resources—be those resources physical, intellectual, or social.
To the extent that this hypothesis were true, a superintelligence might design its subagents in such a way that they had pre-determined response patterns for undesirable situations, so exhibited negative emotions. However, if it was constructing a kind of a command economy in which it desired to remain in control, it might not put a high value on any subagent accumulating individual resources. Intellectual resources would be valued to the extent that they contributed to the subagent doing its job, but physical and social resources could be irrelevant, if the subagents were provided with whatever resources necessary for doing their tasks. In such a case, the end result could be a world whose inhabitants experienced very little if any in the way of positive emotions, but did experience negative emotions. [...]
5.2 Mind crime
A superintelligence might run simulations of sentient beings for a variety of purposes. Bostrom (2014, p. 152) discusses the specific possibility of an AI creating simulations of human beings which were detailed enough to be conscious. These simulations could then be placed in a variety of situations in order to study things such as human psychology and sociology, and be destroyed afterwards.
The AI could also run simulations that modeled the evolutionary history of life on Earth in order to obtain various kinds of scientific information, or to help estimate the likely location of the “Great Filter” (Hanson 1998) and whether it should expect to encounter other intelligent civilizations. This could repeat the wildanimal suffering (Tomasik 2015, Dorado 2015) experienced in Earth’s evolutionary history. The AI could also create and mistreat, or threaten to mistreat, various minds as a way to blackmail other agents. [...]
5.3 Flawed realization
A superintelligence with human-aligned values might aim to convert the resources in its reach into clusters of utopia, and seek to colonize the universe in order to maximize the value of the world (Bostrom 2003a), filling the universe with new minds and valuable experiences and resources. At the same time, if the superintelligence had the wrong goals, this could result in a universe filled by vast amounts of disvalue.
While some mistakes in value loading may result in a superintelligence whose goal is completely unlike what people value, certain mistakes could result in flawed realization (Bostrom 2013). In this outcome, the superintelligence’s goal gets human values mostly right, in the sense of sharing many similarities with what we value, but also contains a flaw that drastically changes the intended outcome.
For example, value-extrapolation (Yudkowsky 2004) and value-learning (Soares 2016, Sotala 2016) approaches attempt to learn human values in order to create a world that is in accordance with those values.
There have been occasions in history when circumstances that cause suffering have been defended by appealing to values which seem pointless to modern sensibilities, but which were nonetheless a part of the prevailing values at the time. In Victorian London, the use of anesthesia in childbirth was opposed on the grounds that being under the partial influence of anesthetics may cause “improper” and “lascivious” sexual dreams (Farr 1980), with this being considered more important to avoid than the pain of childbirth.
A flawed value-loading process might give disproportionate weight to historical, existing, or incorrectly extrapolated future values whose realization then becomes more important than the avoidance of suffering. Besides merely considering the avoidance of suffering less important than the enabling of other values, a flawed process might also tap into various human tendencies for endorsing or celebrating cruelty (see the discussion in section 4), or outright glorifying suffering. Small changes to a recipe for utopia may lead to a future with much more suffering than one shaped by a superintelligence whose goals were completely different from ours.
thanks for sharing. here’s my thoughts on the possibilities in the quote.
Suffering subroutines—maybe 10-20% likely. i don’t think suffering reduces to “pre-determined response patterns for undesirable situations,” because i can think of simple algorithmic examples of that which don’t seem like suffering.
suffering feels like it’s about the sense of aversion/badness (often in response a situation), and not about the policy “in <situation>, steer towards <new situation>”. (maybe humans were instilled with a policy of steering away from ‘suffering’ states generally, and that’s why evolution made us enter those states in some types of situation?). (though i’m confused about what suffering really is)
i would also give the example of positive-feeling emotions sometimes being narrowly directed. for example, someone can feel ‘excitement/joy’ about a gift or event and want to <go to/participate in> it. sexual and romantic subroutines can also be both narrowly-directed and positive-feeling. though these examples lack the element of a situation being steered away from, vs steering (from e.g any neutral situation) towards other ones.
Suffering simulations—seems likely (75%?) for the estimation of universal attributes, such as the distribution of values. my main uncertainty is about whether there’s some other way for the ASIs to compute that information which is simple enough to be suffering free. this also seems lower magnitude than other classes, because (unless it’s being calculated indefinetely for ever-greater precision) this computation terminates at some point, rather than lasting until heat death (or forever if it turns out that’s avoidable).
Blackmail—i don’t feel knowledgeable enough about decision theory to put a probability on this one, but in the case where it works (or is precommitted to under uncertainty in hopes that it works), it’s unfortunately a case where building aligned ASI would incentive unaligned entities to do it.
Flawed realization—again i’m too uncertain about what real-world paths lead to this, but intuitively, it’s worryingly possible if the future contains LLM-based LTPAs (long term planning agents) intelligent enough to solve alignment and implement their own (possibly simulated) ‘values’.
Suffering subroutines—maybe 10-20% likely. i don’t think suffering reduces to “pre-determined response patterns for undesirable situations,” because i can think of simple algorithmic examples of that which don’t seem like suffering.
Yeah, I agree with this to be clear. Our intended claim wasn’t that just “pre-determined response patterns for undesirable situations” would be enough for suffering. Actually, there were meant to be two separate claims, which I guess we should have distinguished more clearly:
1) If evolution stumbled on pain and suffering, those might be relatively easy and natural ways to get a mind to do something. So an AGI that built other AGIs might also build them to experience pain and suffering (that it was entirely indifferent to), if that happened to be an effective motivational system.
2) If this did happen, then there’s also some speculation suggesting that an AI that wanted to stay in charge might not want to give its worker AGIs things much in the way of things that looked like positive emotions, but did have a reason to give them things that looked like negative emotions. Which would then tilt the balance of pleasure vs. pain in the post-AGI world much more heavily in favor of (emotional) pain.
Now the second claim is much more speculative and I don’t even know if I’d consider it a particularly likely scenario (probably not); we just put it in since much of the paper was just generally listing various possibilities of what might happen. But the first claim—that since all the biological minds we know of seem to run on something like pain and pleasure, we should put a substantial probability on AGI architectures also ending up with something like that—seems much stronger to me.
i currently believe that working on superintelligence-alignment is likely the correct choice from a fully-negative-utilitarian perspective.[1]
for others, this may be an intuitive statement or unquestioned premise. for me it is not, and i’d like to state my reasons for believing it, partially as a response to this post concerned about negative utilitarians trying to accelerate progress towards an unaligned-ai-takeover.
there was a period during which i was more uncertain about this question, and avoided openly sharing minimally-dual-use alignment research (but did not try to accelerate progress towards a nonaligned-takeover) while resolving that uncertainty.
a few relevant updates since then:
decrease on the probability that the values an aligned AI would have would endorse human-caused moral catastrophes such as human-caused animal suffering.
i did not automatically believe humans to be good-by-default, and wanted to take time to seriously consider what i think should be a default hypothesis-for-consideration upon existing in a society that generally accepts an ongoing mass torture event.
awareness of vastly worse possible s-risks.
factory farming is a form of physical torture, by which i mean torture of a mind which is done through the indirect route of effecting its input channels (body/senses). it is also a form of psychological torture. it is very bad, but situations which are magnitudes worse seem possible, where a mind is modulated directly (on the neuronal level) and fully.
compared to ‘in-distribution suffering’ (eg animal suffering, human-social conflicts), i find it further less probable that an AI aligned to some human-specified values[2] would create a future with this.
i think it’s plausible that it exists rarely in other parts of the world, though, and if so would be important to prevent through acausal trade if we can.
(also see Kaj Sotala’s reply about some plausible incidental s-risks)
i am not free of uncertainty about the topic, though.
in particular, if disvalue of suffering is common across the world, such that the suffering which can be reduced through acausal trade will be reduced through acausal trade regardless of whether we create an AI which disvalues suffering, then it would no longer be the case that working on alignment is the best decision for a purely negative utilitarian.
despite this uncertainty, my current belief is that the possibility of reducing suffering via acausal trade (including possibly such really-extreme forms of suffering) outweighs the probability and magnitude of human-aligned-AI-caused suffering.[3]
also, to be clear, if it ever seems that an actualized s-risk takeover event is significantly more probable than it seems now[4] as a result of unknown future developments, i would fully endorse causing a sooner unaligned-but-not-suffering takeover to prevent it.
i find it easier to write this post as explaining my position as “even for a pure negative utilitarian, i think it’s the correct choice”, because it lets us ignore individual differences in how much moral weight is assigned to suffering relative to everything else.
i think it’s pretty improbable that i would, on ‘idealized reflection’/CEV, endorse total-negative-utilitarianism (which has been classically pointed out as implying, e.g, preferring a universe with nothing to a universe containing a robust utopia plus an instance of light suffering).
i self-describe as a “suffering-focused altruist” or “negative-leaning-utilitarian.” ie, suffering seems much worse to me than happiness seems good.
(though certainly there are some individual current humans would do this, for example to digital minds, if given the ability to do so. rather, i’m expressing a belief that it’s very probable that an aligned AI which practically results from this situation would not allow that to happen.)
(by ‘human-aligned AI’, I mean one pointed to an actual CEV of one or a group of humans (which could indirectly imply the ‘CEV of everyone’ but without actually-not-being-that and failing in the below way, and without allowing cruel values of some individuals to enter into it).
I don’t mean an AI aligned to some sort of ‘current institutional process’, like voting, involving all living humans—I think that should be avoided due to politicization risk and potential for present/unreflective(/by which i mean cruel)-values lock-in.)
there’s some way to formalize with bayes equations how likely, from a negative-utilitarian perspective, an s-risk needs to be (relative to a good outcome) to terminate a timeline.
it would intake probability distributions related to ‘the frequency of suffering-disvalue across the universal distribution of ASIs’ and ‘the frequency of various forms of s-risks that are preventable with acausal trade’. i might create this formalization later.
if we think there’s pretty certainly more preventable-through-trade-type suffering-events than there is altruistic ASIs to prevent it, a local preventable-type s-risk might actually need to be ‘more likely than the good/suffering-disvaluing outcome’
Considering how loog it took me to get that by this you mean “not dual-use”, I expect some others just won’t get it.
You may find Superintelligence as a Cause or Cure for Risks of Astronomical Suffering of interest; among other things, it discusses s-risks that might come about from having unaligned AGI.
thanks for sharing. here’s my thoughts on the possibilities in the quote.
Suffering subroutines—maybe 10-20% likely. i don’t think suffering reduces to “pre-determined response patterns for undesirable situations,” because i can think of simple algorithmic examples of that which don’t seem like suffering.
suffering feels like it’s about the sense of aversion/badness (often in response a situation), and not about the policy “in <situation>, steer towards <new situation>”. (maybe humans were instilled with a policy of steering away from ‘suffering’ states generally, and that’s why evolution made us enter those states in some types of situation?). (though i’m confused about what suffering really is)
i would also give the example of positive-feeling emotions sometimes being narrowly directed. for example, someone can feel ‘excitement/joy’ about a gift or event and want to <go to/participate in> it. sexual and romantic subroutines can also be both narrowly-directed and positive-feeling. though these examples lack the element of a situation being steered away from, vs steering (from e.g any neutral situation) towards other ones.
Suffering simulations—seems likely (75%?) for the estimation of universal attributes, such as the distribution of values. my main uncertainty is about whether there’s some other way for the ASIs to compute that information which is simple enough to be suffering free. this also seems lower magnitude than other classes, because (unless it’s being calculated indefinetely for ever-greater precision) this computation terminates at some point, rather than lasting until heat death (or forever if it turns out that’s avoidable).
Blackmail—i don’t feel knowledgeable enough about decision theory to put a probability on this one, but in the case where it works (or is precommitted to under uncertainty in hopes that it works), it’s unfortunately a case where building aligned ASI would incentive unaligned entities to do it.
Flawed realization—again i’m too uncertain about what real-world paths lead to this, but intuitively, it’s worryingly possible if the future contains LLM-based LTPAs (long term planning agents) intelligent enough to solve alignment and implement their own (possibly simulated) ‘values’.
Yeah, I agree with this to be clear. Our intended claim wasn’t that just “pre-determined response patterns for undesirable situations” would be enough for suffering. Actually, there were meant to be two separate claims, which I guess we should have distinguished more clearly:
1) If evolution stumbled on pain and suffering, those might be relatively easy and natural ways to get a mind to do something. So an AGI that built other AGIs might also build them to experience pain and suffering (that it was entirely indifferent to), if that happened to be an effective motivational system.
2) If this did happen, then there’s also some speculation suggesting that an AI that wanted to stay in charge might not want to give its worker AGIs things much in the way of things that looked like positive emotions, but did have a reason to give them things that looked like negative emotions. Which would then tilt the balance of pleasure vs. pain in the post-AGI world much more heavily in favor of (emotional) pain.
Now the second claim is much more speculative and I don’t even know if I’d consider it a particularly likely scenario (probably not); we just put it in since much of the paper was just generally listing various possibilities of what might happen. But the first claim—that since all the biological minds we know of seem to run on something like pain and pleasure, we should put a substantial probability on AGI architectures also ending up with something like that—seems much stronger to me.