[Question] What are the best arguments for/​against AIs being “slightly ‘nice’”?

Awhile ago, Nate Soares wrote the posts Decision theory does not imply that we get to have nice things and Cosmopolitan values don’t come free and But why would the AI kill us?

Paul Christiano put forth some arguments that “it seems pretty plausible that AI will be at least somewhat ‘nice’”, similar to how humans are somewhat nice to animals. There was some back-and-forth.

More recently we had Eliezer’s post ASIs will not leave just a little sunlight for Earth.

I have a sense that something feels “unresolved” here. The current comments on Eliezer’s post look likely to be rehashing the basics and I’d like to actually make some progress on distilling the best arguments. I’d like it if we got more explicit debate about this.

I also have some sense that the people previously involved (i.e. Nate, Paul, Eliezer) are sort of tired of arguing with each other. But I am hoping someones-or-other end up picking up the arguments here, hashing them out more, and/​or writing more distilled summaries of the arguments/​counterarguments.

To start with, I figured I would just literally repeat most of the previous comments in a top-level post, to give everyone another chance to read through them.

Without further ado, here they are:

Paul and Nate

Paul Christiano re: “Cosmopolitan Values Don’t Come for Free.”

I want to keep picking a fight about “will the AI care so little about humans that it just kills them all?” This is different from a broader sense of cosmopolitanism, and moreover I’m not objecting to the narrow claim “doesn’t come for free.” But it’s directly related to the actual emotional content of your parables and paragraphs, and it keeps coming up recently with you and Eliezer, and I think it’s an important way that this particular post looks wrong even if the literal claim is trivially true.

(Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that’s likely to be a mistake even if it doesn’t lead to billions of deaths.)

Humans care about the preferences of other agents they interact with (not much, just a little bit!), even when those agents are weak enough to be powerless. It’s not just that we have some preferences about the aesthetics of cows, which could be better optimized by having some highly optimized cow-shaped objects. It’s that we actually care (a little bit!) about the actual cows getting what they actually want, trying our best to understand their preferences and act on them and not to do something that they would regard as crazy and perverse if they understood it.

If we kill the cows, it’s because killing them meaningfully helped us achieve some other goals. We won’t kill them for arbitrarily insignificant reasons. In fact I think it’s safe to say that we’d collectively allocate much more than 1/​millionth of our resources towards protecting the preferences of whatever weak agents happen to exist in the world (obviously the cows get only a small fraction of that).

Before really getting into it, some caveats about what I want to talk about:

  • I don’t want to focus on whatever form of altruism you and Eliezer in particular have (which might or might not be more dependent on some potentially-idiosyncratic notion of “sentience.”) I want to talk about caring about whatever weak agents happen to actually exist, which I think is reasonably common amongst humans. Let’s call that “kindness” for the purpose of this comment. I don’t think it’s a great term but it’s the best short handle I have.

  • I’ll talk informally about how quantitatively kind an agent is, by which I mean something like: how much of its resources it would allocate to helping weak agents get what they want? How highly does it weigh that part of its preferences against other parts? To the extent it can be modeled as an economy of subagents, what fraction of them are kind (or were kind pre-bargain)?

  • I don’t want to talk about whether the aliens would be very kind. I specifically want to talk about tiny levels of kindness, sufficient to make a trivial effort to make life good for a weak species you encounter but not sufficient to make big sacrifices on its behalf.

  • I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.

You and Eliezer seem to think there’s a 90% chance that AI will be <1/​trillion (perhaps even a 90% chance that they have exactly 0 kindness?). But we have one example of a smart mind, and in fact: (i) it has tons of diverse shards of preference-on-reflection, varying across and within individuals (ii) it has >1/​million kindness. So it’s superficially striking to be confident AI systems will have a million times less kindness.

I have no idea under what conditions evolved or selected life would be kind. The more preferences are messy with lots of moving pieces, the more probable it is that at least 1/​trillion of those preferences are kind (since the less correlated the trillion different shards of preference are with one another and so the more chances you get). And the selection pressure against small levels of kindness is ~trivial, so this is mostly a question about idiosyncrasies and inductive biases of minds rather than anything that can be settled by an appeal to selection dynamics.

I can’t tell if you think kindness is rare amongst aliens, or if you think it’s common amongst aliens but rare amongst AIs.[1] Either way, I would like to understand why you think that. What is it that makes humans so weird in this way?

(And maybe I’m being unfair here by lumping you and Eliezer together—maybe in the previous post you were just talking about how the hypothetical AI that had 0 kindness would kill us, and in this post how kindness isn’t guaranteed. But you give really strong vibes in your writing, including this post. And in other places I think you do say things that don’t actually add up unless you think that AI is very likely to be <1/​trillion kind. But at any rate, if this post is unfair to you, then you can just sympathize and consider it directed at Eliezer instead who lays out this position much more explicitly though not in a convenient place to engage with.)

Here are some arguments you could make that kindness is unlikely, and my objections:

  • “We can’t solve alignment at all.” But evolution is making no deliberate effort to make humans kind, so this is a non-sequitur.

  • “This is like a Texas sharpshooter hitting the side of a barn then drawing a target around the point they hit; every evolved creature might decide that their own idiosyncrasies are common but in reality none of them are.” But all the evolved creatures wonder if a powerful AI they built would kill them or if if it would it be kind. So we’re all asking the same question, we’re not changing the question based on our own idiosyncratic properties. This would have been a bias if we’d said: humans like art, so probably our AI will like art too. In that case the fact that we were interested in “art” was downstream of the fact that humans had this property. But for kindness I think we just have n=1 sample of observing a kind mind, without any analogous selection effect undermining the inference.

  • “Kindness is just a consequences of misfiring [kindness for kin /​ attachment to babies /​ whatever other simple story].” AI will be selected in its own ways that could give rise to kindness (e.g. being selected to do things that humans like, or to appear kind). The a priori argument for why that selection would lead to kindness seems about as good as the a priori argument for humans. And on the other side, the incentives for humans to be not kind seem if anything stronger than the incentives for ML systems to not be kind. This mostly seems like ungrounded evolutionary psychology, though maybe there are some persuasive arguments or evidence I’ve just never seen.

  • “Kindness is a result of the suboptimality inherent in compressing a brain down into a genome.” ML systems are suboptimal in their own random set of ways, and I’ve never seen any persuasive argument that one kind of suboptimality would lead to kindness and the other wouldn’t (I think the reverse direction is equally plausible). Note also that humans absolutely can distinguish powerful agents from weak agents, and they can distinguish kin from unrelated weak agents, and yet we care a little bit about all of them. So the super naive arguments for suboptimality (that might have appealed to information bottlenecks in a more straightforward way) just don’t work. We are really playing a kind of complicated guessing game about what is easy for SGD vs easy for a genome shaping human development.

  • “Kindness seems like it should be rare a priori, we can’t update that much from n=1.” But the a priori argument is a poorly grounded guess about about the inductive biases of spaces of possible minds (and genomes), since the levels of kindness we are talking about are too small to be under meaningful direct selection pressure. So I don’t think the a priori arguments are even as strong as the n=1 observation. On top of that, the more that preferences are diverse and incoherent the more chances you have to get some kindness in the mix, so you’d have to be even more confident in your a priori reasoning.

  • “Kindness is a totally random thing, just like maximizing squiggles, so it should represent a vanishingly small fraction of generic preferences, much less than 1/​trillion.” Setting aside my a priori objections to this argument, we have an actual observation of an evolved mind having >1/​million kindness. So evidently it’s just not that rare, and the other points on this list respond to various objections you might have used to try to salvage the claim that kindness is super rare despite occurring in humans (this isn’t analogous to a Texas sharpshooter, there aren’t great debunking explanation for why humans but not ML would be kind, etc.). See this twitter thread where I think Eliezer is really off base, both on this point and on the relevance of diverse and incoherent goals to the discussion.

Note that in this comment I’m not touching on acausal trade (with successful humans) or ECL. I think those are very relevant to whether AI systems kill everyone, but are less related to this implicit claim about kindness which comes across in your parables (since acausally trading AIs are basically analogous to the ants who don’t kill us because we have power).

A final note, more explicitly lumping you with Eliezer: if we can’t get on the same page about our predictions I’m at at least aiming to get folks to stop arguing so confidently for death given takeover. It’s easy to argue that AI takeover is very scary for humans, has a significant probability of killing billions of humans from rapid industrialization and conflict, and is a really weighty decision even if we don’t all die and it’s “just” handing over control over the universe. Arguing that P(death|takeover) is 100% rather than 50% doesn’t improve your case very much, but it means that doomers are often getting into fights where I think they look unreasonable.

I think OP’s broader point seems more important and defensible: “cosmopolitanism isn’t free” is a load-bearing step in explaining why handing over the universe to AI is a weighty decision. I’d just like to decouple it from “complete lack of kindness.”

His followup comment continues:

Eliezer has a longer explanation of his view here.

My understanding of his argument is: there are a lot of contingencies that reflect how and whether humans are kind. Because there are so many contingencies, it is somewhat unlikely that aliens would go down a similar route, and essentially impossible for ML. So maybe aliens have a 5% probability of being nice and ML systems have ~0% probability of being nice. I think this argument is just talking about why we shouldn’t have update too much from humans, and there is an important background assumption that kindness is super weird and so won’t be produced very often by other processes, i.e. the only reason to think it might happen is that it happened in the single case we observed.

I find this pretty unconvincing. He lists like 10 things (humans need to trade favors, we’re not smart enough to track favors and kinship explicitly, and we tend to be allied with nearby humans so want to be nice to those around us, we use empathy to model other humans, and we had religion and moral realism for contingent reasons, we weren’t optimized too much once we were smart enough that our instrumental reasoning screens off kindness heuristics).

But no argument is given for why these are unusually kindness-inducing settings of the variables. And the outcome isn’t like a special combination of all of them, they each seem like factors that contribute randomly. It’s just a lot of stuff mixing together.

Presumably there is no process that ensures humans have lots of kindness-inducing features (and we didn’t select kindness as a property for which humans were notable, we’re just asking the civilization-independent question “does our AI kill us”). So if you list 10 random things that make humans more kind, it strongly suggests that other aliens will also have a bunch of random things that make them more kind. It might not be 10, and the net effect might be larger or smaller. But:

  • I have no idea whatsoever how you are anchoring this distribution, and giving it a narrow enough spread to have confident predictions.

  • Statements like “kindness is super weird” are wildly implausible if you’ve just listed 5 independent plausible mechanisms for generating kindness. You are making detailed quantitative guesses here, not ruling something out for any plausible a priori reason.

As a matter of formal reasoning, listing more and more contingencies that combine apparently-additively tends to decrease rather than increase the variance of kindness across the population. If there was just a single random thing about humans that drove kindness it would be more plausible that we’re extreme. If you are listing 10 things then things are going to start averaging out (and you expect that your 10 things are cherry-picked to be the ones most relevant to humans, but you can easily list 10 more candidates).

In fact it’s easy to list analogous things that could apply to ML (and I can imagine the identical conversation where hypothetical systems trained by ML are talking about how stupid it is to think that evolved life could end up being kind). Most obviously, they are trained in an environment where being kind to humans is a very good instrumental strategy. But they are also trained to closely imitate humans who are known to be kind, they’ve been operating in a social environment where they are very strongly expected to appear to be kind, etc. Eliezer seems to believe this kind of thing gets you “ice cream and condoms” instead of kindness OOD, but just one sentence ago he explained why similar (indeed, superficially much weaker!) factors led to humans retaining niceness out of distribution. I just don’t think we have the kind of a priori asymmetry or argument here that would make you think humans are way kinder than models. Yeah it can get you to ~50% or even somewhat lower, but ~0% seems like a joke.

There was one argument that I found compelling, which I would summarize as: humans were optimized while they were dumb. If evolution had kept optimizing us while we got smart, eventually we would have stopped being so kind. In ML we just keep on optimizing as the system gets smart. I think this doesn’t really work unless being kind is a competitive disadvantage for ML systems on the training distribution. But I do agree that if if you train your AI long enough on cases where being kind is a significant liability, it will eventually stop being kind.

Nate Soare’s reply

Short version: I don’t buy that humans are “micro-pseudokind” in your sense; if you say “for just $5 you could have all the fish have their preferences satisfied” I might do it, but not if I could instead spend $5 on having the fish have their preferences satisfied in a way that ultimately leads to them ascending and learning the meaning of friendship, as is entangled with the rest of my values.


Meta:

Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that’s likely to be a mistake even if it doesn’t lead to billions of deaths.

So for starters, thanks for making acknowledgements about places we apparently agree, or otherwise attempting to demonstrate that you’ve heard my point before bringing up other points you want to argue about. (I think this makes arguments go better.) (I’ll attempt some of that myself below.)

Secondly, note that it sounds to me like you took a diametric-opposite reading of some of my intended emotional content (which I acknowledge demonstrates flaws in my writing). For instance, I intended the sentence “At that very moment they hear the dinging sound of an egg-timer, as the next-token-predictor ascends to superintelligence and bursts out of its confines” to be a caricature so blatant as to underscore the point that I wasn’t making arguments about takeoff speeds, but was instead focusing on the point about “complexity” not being a saving grace (and “monomaniacalism” not being the issue here). (Alternatively, perhaps I misunderstand what things you call the “emotional content” and how you’re reading it.)

Thirdly, I note that for whatever it’s worth, when I go to new communities and argue this stuff, I don’t try to argue people into >95% change we’re all going to die in <20 years. I just try to present the arguments as I see them (without hiding the extremity of my own beliefs, nor while particularly expecting to get people to a similarly-extreme place with, say, a 30min talk). My 30min talk targets are usually something more like “>5% probability of existential catastrophe in <20y”. So insofar as you’re like “I’m aiming to get you to stop arguing so confidently for death given takeover”, you might already have met your aims in my case.

(Or perhaps not! Perhaps there’s plenty of emotional-content leaking through given the extremity of my own beliefs, that you find particularly detrimental. To which the solution is of course discussion on the object-level, which I’ll turn to momentarily.)


Object:

First, I acknowledge that if an AI cares enough to spend one trillionth of its resources on the satisfaction of fulfilling the preferences of existing “weak agents” in precisely the right way, then there’s a decent chance that current humans experience an enjoyable future.

With regards to your arguments about what you term “kindness” and I shall term “pseudokindness” (on account of thinking that “kindness” brings too much baggage), here’s a variety of places that it sounds like we might disagree:

  • Pseudokindness seems underdefined, to me, and I expect that many ways of defining it don’t lead to anything like good outcomes for existing humans.

    • Suppose the AI is like “I am pico-pseudokind; I will dedicate a trillionth of my resources to satisfying the preferences of existing weak agents by granting those existing weak agents their wishes”, and then only the most careful and conscientious humans manage to use those wishes in ways that leave them alive and well.

    • There are lots and lots of ways to “satisfy the preferences” of the “weak agents” that are humans. Getting precisely the CEV (or whatever it should be repaired into) is a subtle business. Most humans probably don’t yet recognize that they could or should prefer taking their CEV over various more haphazard preference-fulfilments that ultimately leave them unrecognizable and broken. (Or, consider what happens when a pseudokind AI encounters a baby, and seeks to satisfy its preferences. Does it have the baby age?)

    • You’ve got to do some philosophy to satisfy the preferences of humans correctly. And the issue isn’t that the AI couldn’t solve those philosophy problems correctly-according-to-us, it’s that once we see how wide the space of “possible ways to be pseudokind” is, then “pseudokind in the manner that gives us our CEVs” starts to feel pretty narrow against “pseudokind in the manner that fulfills our revealed preferences, or our stated preferences, or the poorly-considered preferences of philosophically-immature people, or whatever”.

  • I doubt that humans are micro-pseudokind, as defined. And so in particular, all your arguments of the form “but we’ve seen it arise once” seem suspect to me.

    • Like, suppose we met fledgeling aliens, and had the opportunity to either fulfil their desires, or leave them alone to mature, or affect their development by teaching them the meaning of friendship. My guess is that we’d teach them the meaning of friendship. I doubt we’d hop in and fulfil their desires.

    • (Perhaps you’d counter with something like: well if it was super cheap, we might make two copies of the alien civilization, and fulfil one’s desires and teach the other the meaning of friendship. I’m skeptical, for various reasons.)

    • More generally, even though “one (mill|trill)ionth” feels like a small fraction, the obvious ways to avoid dedicating even a (mill|trill)ionth of your resources to X is if X is right near something even better that you might as well spend the resources on instead.

    • There’s all sorts of ways to thumb the scales in how a weak agent develops, and there’s many degrees of freedom about what counts as a “pseudo-agent” or what counts as “doing justice to its preferences”, and my read is that humans take one particular contingent set of parameters here and AIs are likely to take another (and that the AI’s other-settings are likely to lead to behavior not-relevantly-distinct from killing everyone).

    • My read is than insofar as humans do have preferences about doing right by other weak agents, they have all sorts of desire-to-thumb-the-scales mixed in (such that humans are not actually pseudokind, for all that they might be kind).

  • I have a more-difficult-to-articulate sense that “maybe the AI ends up pseudokind in just the right way such that it gives us a (small, limited, ultimately-childless) glorious transhumanist future” is the sort of thing that reality gets to say “lol no” to, once you learn more details about how the thing works internally.

Most of my argument here is that “the space of ways things can end “caring” about the “preferences” of “weak agents” is wide, and most points within it don’t end up being our point in it, and optimizing towards most points in it doesn’t end up keeping us around at the extremes. My guess is mostly that the space is so wide that you don’t even end up with AIs warping existing humans into unrecognizable states, but do in fact just end up with the people dead (modulo distant aliens buying copies, etc).

I haven’t really tried to quantify how confident I am of this; I’m not sure whether I’d go above 90%, \shrug.


It occurs to me that one possible source of disagreement here is, perhaps you’re trying to say something like:

Nate, you shouldn’t go around saying “if we don’t competently intervene, literally everybody will die” with such a confident tone, when you in fact think there’s a decent chance of scenarios where the AIs keep people around in some form, and make some sort of effort towards fulfilling their desires; most people don’t care about the cosmic endowment like you do; the bluntly-honest and non-manipulative thing to say is that there’s a decent chance they’ll die and a better chance that humanity will lose the cosmic endowment (as you care about more than they do),

whereas my stance has been more like

most people I meet are skeptical that uploads count as them; most people would consider scenarios where their bodies are destroyed by rapid industrialization of Earth but a backup of their brain is stored and then later run in simulation (where perhaps it’s massaged into an unrecognizable form, or kept in an alien zoo, or granted a lovely future on account of distant benefactors, or …) to count as “death”; and also those exotic scenarios don’t seem all that likely to me, so it hasn’t seemed worth caveating.

I’m somewhat persuaded by the claim that failing to mention even the possibility of having your brainstate stored, and then run-and-warped by an AI or aliens or whatever later, or run in an alien zoo later, is potentially misleading.

I’m considering adding footnotes like “note that when I say “I expect everyone to die”, I don’t necessarily mean “without ever some simulation of that human being run again”, although I mostly don’t think this is a particularly comforting caveat”, in the relevant places. I’m curious to what degree that would satisfy your aims (and I welcome workshopped wording on the footnotes, as might both help me make better footnotes and help me understand better where you’re coming from).

Paul’s reply:

I disagree with this but am happy your position is laid out. I’ll just try to give my overall understanding and reply to two points.

Like Oliver, it seems like you are implying:

Humans may be nice to other creatures in some sense, But if the fish were to look at the future that we’d achieve for them using the 1/​billionth of resources we spent on helping them, it would be as objectionable to them as “murder everyone” is to us.

I think that normal people being pseudokind in a common-sensical way would instead say:

If we are trying to help some creatures, but those creatures really dislike the proposed way we are “helping” them, then we should try a different tactic for helping them.

I think that some utilitarians (without reflection) plausibly would “help the humans” in a way that most humans consider as bad as being murdered. But I think this is an unusual feature of utilitarians, and most people would consult the beneficiaries, observe they don’t want to be murdered, and so not murder them.

I think that saying “Helping someone in a way they like, sufficiently precisely to avoid things like murdering them, requires precisely the right form of caring—and that’s super rare” is a really misleading sense of how values work and what targets are narrow. I think this is more obvious if you are talking about how humans would treat a weaker species. If that’s the state of the disagreement I’m happy to leave it there.

I’m somewhat persuaded by the claim that failing to mention even the possibility of having your brainstate stored, and then run-and-warped by an AI or aliens or whatever later, or run in an alien zoo later, is potentially misleading.

This is an important distinction at 1/​trillion levels of kindness, but at 1/​billion levels of kindness I don’t even think the humans have to die.

Nate:

If we are trying to help some creatures, but those creatures really dislike the proposed way we are “helping” them, then we should do something else.

My picture is less like “the creatures really dislike the proposed help”, and more like “the creatures don’t have terribly consistent preferences, and endorse each step of the chain, and wind up somewhere that they wouldn’t have endorsed if you first extrapolated their volition (but nobody’s extrapolating their volition or checking against that)”.

It sounds to me like your stance is something like “there’s a decent chance that most practically-buildable minds pico-care about correctly extrapolating the volition of various weak agents and fulfilling that extrapolated volition”, which I am much more skeptical of than the weaker “most practically-buildable minds pico-care about satisfying the preferences of weak agents in some sense”.

Paul:

We’re not talking about practically building minds right now, we are talking about humans.

We’re not talking about “extrapolating volition” in general. We are talking about whether—in attempting to help a creature with preferences about as coherent as human preferences—you end up implementing an outcome that creature considers as bad as death.

For example, we are talking about what would happen if humans were trying to be kind to a weaker species that they had no reason to kill, that could nevertheless communicate clearly and had preferences about as coherent as human preferences (while being very alien).

And those creatures are having a conversation amongst themselves before the humans arrive wondering “Are the humans going to murder us all?” And one of them is saying “I don’t know, they don’t actually benefit from murdering us and they seem to care a tiny bit about being nice, maybe they’ll just let us do our thing with 1/​trillionth of the universe’s resources?” while another is saying “They will definitely have strong opinions about what our society should look like and the kind of transformation they implement is about as bad by our lights as being murdered.”

In practice attempts to respect someone’s preferences often involve ideas like autonomy and self-determination and respect for their local preferences. I really don’t think you have to go all the way to extrapolated volition in order to avoid killing everyone.

Nate:

Is this a reasonable paraphrase of your argument?

Humans wound up caring at least a little about satisfying the preferences of other creatures, not in a “grant their local wishes even if that ruins them” sort of way but in some other intuitively-reasonable manner.

Humans are the only minds we’ve seen so far, and so having seen this once, maybe we start with a 50%-or-so chance that it will happen again.

You can then maybe drive this down a fair bit by arguing about how the content looks contingent on the particulars of how humans developed or whatever, and maybe that can drive you down to 10%, but it shouldn’t be able to drive you down to 0.1%, especially not if we’re talking only about incredibly weak preferences.

If so, one guess is that a bunch of disagreement lurks in this “intuitively-reasonable manner” business.

A possible locus of disagreemet: it looks to me like, if you give humans power before you give them wisdom, it’s pretty easy to wreck them while simply fulfilling their preferences. (Ex: lots of teens have dumbass philosophies, and might be dumb enough to permanently commit to them if given that power.)

More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfil certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.

(I separately expect that if we were doing something more like the volition-extrapolation thing, we’d be tempted to bend the process towards “and they learn the meaning of friendship”.)

That said, this conversation is updating me somewhat towards “a random UFAI would keep existing humans around and warp them in some direction it prefers, rather than killing them”, on the grounds that the argument “maybe preferences-about-existing-agents is just a common way for rando drives to shake out” plausibly supports it to a threshold of at least 1 in 1000. I’m not sure where I’ll end up on that front.

Another attempt at naming a crux: It looks to me like you see this human-style caring about others’ preferences as particularly “simple” or “natural”, in a way that undermines “drawing a target around the bullseye”-type arguments, whereas I could see that argument working for “grant all their wishes (within a budget)” but am much more skeptical when it comes to “do right by them in an intuitively-reasonable way”.

(But that still leaves room for an update towards “the AI doesn’t necessarily kill us, it might merely warp us, or otherwise wreck civilization by bounding us and then giving us power-before-wisdom within those bounds or or suchlike, as might be the sort of whims that rando drives shake out into”, which I’ll chew on.)

Nate and Paul had an additional thread, which initially was mostly some meta on the conversation about what exactly Nate was trying to argue and what exactly Paul was annoyed at.

I’m skipping most of it here for brevity (you can read it here)

But eventually Nate says:

Thanks! I’m curious for your paraphrase of the opposing view that you think I’m failing to understand.

Paul says:

I think a closer summary is:

Humans and AI systems probably want different things. From the human perspective, it would be better if the universe was determined by what the humans wanted. But we shouldn’t be willing to pay huge costs, and shouldn’t attempt to create a slave society where AI systems do humans’ bidding forever, just to ensure that human values win out. After all, we really wouldn’t want that outcome if our situations had been reversed. And indeed we are the beneficiary of similar values-turnover in the past, as our ancestors have been open (perhaps by necessity rather than choice) to values changes that they would sometimes prefer hadn’t happened.

We can imagine really sterile outcomes, like replicators colonizing space with an identical pattern repeated endlessly, or AI systems that want to maximize the number of paperclips. And considering those outcomes can help undermine the cosmopolitan intuition that we should respect the AI we build. But in fact that intuition pump relies crucially on its wildly unrealistic premises, that the kind of thing brought about by AI systems will be sterile and uninteresting. If we instead treat “paperclip” as an analog for some crazy weird shit that is alien and valence-less to humans, drawn from the same barrel of arbitrary and diverse desires that can be produced by selection processes, then the intuition pump loses all force. I’m back to feeling like our situations could have been reversed, and we shouldn’t be total assholes to the AI.

I don’t think that requires anything at all about AI systems converging to cosmopolitan values in the sense you are discussing here. I do think it is much more compelling if you accept some kind of analogy between the sorts of processes shaping human values and the processes shaping AI values, but this post (and the references you cite and other discussions you’ve had) don’t actually engage with the substance of that analogy and the kinds of issues raised in my comment are much closer to getting at the meat of the issue.

I also think the “not for free” part doesn’t contradict the views of Rich Sutton. I asked him this question and he agrees that all else equal it would be better if we handed off to human uploads instead of powerful AI. I think his view is that the proposed course of action from the alignment community is morally horrifying (since in practice he thinks the alternative is “attempt to have a slave society,” not “slow down AI progress for decades”—I think he might also believe that stagnation is much worse than a handoff but haven’t heard his view on this specifically) and that even if you are losing something in expectation by handing the universe off to AI systems it’s not as bad as the alternative.

Nate says:

Thanks! Seems like a fine summary to me, and likely better than I would have done, and it includes a piece or two that I didn’t have (such as an argument from symmetry if the situations were reversed). I do think I knew a bunch of it, though. And e.g., my second parable was intended to be a pretty direct response to something like

If we instead treat “paperclip” as an analog for some crazy weird shit that is alien and valence-less to humans, drawn from the same barrel of arbitrary and diverse desires that can be produced by selection processes, then the intuition pump loses all force.

where it’s essentially trying to argue that this intuition pump still has force in precisely this case.

Paul says:

To the extent the second parable has this kind of intuitive force I think it comes from: (i) the fact that the resulting values still sound really silly and simple (which I think is mostly deliberate hyperbole), (ii) the fact that the AI kills everyone along the way.

Eliezer Briefly Chimes in

He doesn’t engage much but says:

I sometimes mention the possibility of being stored and sold to aliens a billion years later, which seems to me to validly incorporate most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don’t expect Earthlings to think about validly.

Paul and Oliver

Oliver Habryka also replies to Paul, saying:

Might write a longer reply at some point, but the reason why I don’t expect “kindness” in AIs (as you define it here) is that I don’t expect “kindness” to be the kind of concept that is robust to cosmic levels of optimization pressure applied to it, and I expect will instead come apart when you apply various reflective principles and eliminate any status-quo bias, even if it exists in an AI mind (and I also think it is quite plausible that it is completely absent).

Like, different versions of kindness might or might not put almost all of their considerateness on all the different types of minds that could hypothetically exist, instead of the minds that currently exist right now. Indeed, I expect it’s more likely than not that I myself will end up in that moral equilibrium, and won’t be interested in extending any special consideration to systems that happened to have been alive in 2022, instead of the systems that could have been alive and seem cooler to me to extend consideration towards.

Another way to say the same thing is that if AI extends consideration towards something human-like, I expect that it will use some superstimuli-human-ideal as a reference point, which will be a much more ideal thing to be kind towards than current humans by its own lights (for an LLM this might be cognitive processes much more optimized for producing internet text than current humans, though that is really very speculative, and is more trying to illustrate the core idea of a superstimuli-human). I currently think few superstimuli-humans like this would still qualify by my lights to count as “human” (though it might by the lights of the AI).

I do find the game-theoretic and acausal trade case against AI killing literally everyone stronger, though it does depend on the chance of us solving alignment in the first place, and so feels a bit recursive in these conversations (like, in order for us to be able to negotiate with the AIs, there needs to be some chance we end up in control of the cosmic endowment in the first place, otherwise we don’t have anything to bargain with).

Paul’s first response to Habryka

Is this a fair summary?

Humans might respect the preferences of weak agents right now, but if they thought about it for longer they’d pretty robustly just want to completely destroy the existing agents (including a hypothetical alien creator) and replace them with something better. No reason to honor that kind of arbitrary path dependence.

If so, it seems like you wouldn’t be making an argument about AI or aliens at all, but rather an empirical claim about what would happen if humans were to think for a long time (and become more the people we wished to be and so on).

That seems like an important angle that my comment didn’t address at all. I personally don’t believe that humans would collectively stamp out 99% of their kindness to existing agents (in favor of utilitarian optimization) if you gave them enough time to reflect. That sounds like a longer discussion. I also think that if you expressed the argument in this form to a normal person they would be skeptical about the strong claims about human nature (and would be skeptical of doomer expertise on that topic), and so if this ends up being the crux it’s worth being aware of where the conversation goes and my bottom line recommendation of more epistemic humility may still be justified.

It’s hard to distinguish human kindness from arguably decision-theoretic reasoning like “our positions could have been reversed, would I want them to do the same to me?” but I don’t think the distinction between kindness and common-sense morality and decision theory is particularly important here except insofar as we want to avoid double-counting.

(This does call to mind another important argument that I didn’t discuss in my original comment: “kindness is primarily a product of moral norms produced by cultural accumulation and domestication, and there will be no analogous process amongst AI systems.” I have the same reaction as to the evolutionary psychology explanations. Evidently the resulting kindness extends beyond the actual participants in that cultural process, so I think you need to be making more detailed guesses about minds and culture and so on to have a strong a priori view between AI and humans.)

Habryka’s next reply:

Humans might respect the preferences of weak agents right now, but if they thought about it for longer they’d pretty robustly just want to completely destroy the existing agents (including a hypothetical alien creator) and replace them with something better. No reason to honor that kind of arbitrary path dependence.

No, this doesn’t feel accurate. What I am saying is more something like:

The way humans think about the question of “preferences for weak agents” and “kindness” feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of “having a continuous stream of consciousness with a good past and good future is important” to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.

The way this comes apart seems very chaotic to me, and dependent enough on the exact metaethical and cultural and environmental starting conditions that I wouldn’t be that surprised if I disagree even with other humans on their resulting conceptualization of “kindness” (and e.g. one endpoint might be that I end up not having a special preference for currently-alive beings, but there are thousands, maybe millions of ways for this concept to fray apart under optimization pressure).

In other words, I think it’s plausible that at something like human level of capabilities and within a roughly human ontology (which AIs might at least partially share, though how much is quite uncertain to me), the concept of kindness as assigning value to the extrapolated preferences of beings that currently exist might be a thing that an AI could share. But I expect it to not hold up under reflection, and much greater power, and predictable ontological changes (that I expect any AI go to through as it reaches superintelligence), so that the resulting reflectively stable and optimized idea of kindness will not meaningfully results in current humans genuine preferences being fulfilled (by my own lights of what it means to extrapolate and fulfill someone’s preferences). The space of possibilities in which this concept could fray apart seems quite great, and many of the endpoints are unlikely to align with my endpoints of this concept.


Edit (some more thoughts): The thing you said feels related to that in that I think my own pretty huge uncertainty about how I will relate to kindness on reflection is evidence that I think iterating on that concept will be quite chaotic and different for different minds.

I do want to push back on “in favor of utilitarian optimization”. That is not what I am saying, or at least it feels somewhat misleading.

I am saying that I think it’s pretty likely that upon reflection I no longer think that my “kindness” goals are meaningfully achieved by caring about the beings alive in 2022, and that it would be more kind, by my own lights, to not give special consideration to beings who happened to be alive right now. This isn’t about “trading off kindness in favor of utilitarian optimization”, it’s saying that when you point towards the thing in me that generates an instinct towards kindness, I can imagine that as I more fully realize what that instinct cashes out to in terms of preferences, that it will not result in actually giving consideration to e.g. rats that are currently alive, or would give consideration to some archetype of a rat that is actually not really that much like a rat, because I don’t even really know what it means for a rat to want something, and similarly the way the AI relates to the question of “do humans want things” will feel similarly underdetermined (and again, these are just concrete examples of how the concept could come apart, not trying to be an exhaustive list of ways the concept could fall apart).

Paul’s Second Response to Oliver:

I think some of the confusion here comes from my using “kind” to refer to “respecting the preferences of existing weak agents,” I don’t have a better handle but could have just used a made up word.

I don’t quite understand your objection to my summary—it seems like you are saying that notions like “kindness” (that might currently lead you to respect the preferences of existing agents) will come apart and change in unpredictable ways as agents deliberate. The result is that smart minds will predictably stop respecting the preferences of existing agents, up to and including killing them all to replace them with something that more efficiently satisfies other values (including whatever kind of form “kindness” may end up taking, e.g. kindness towards all the possible minds who otherwise won’t get to exist).

I called this utilitarian optimization but it might have been more charitable to call it “impartial” optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It’s also “utilitarian” in the sense that it’s willing to spare nothing (or at least not 1/​trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like “utilitarian” is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.

Habryka’s third reply:

I think some of the confusion here comes from my using “kind” to refer to “respecting the preferences of existing weak agents,” I don’t have a better handle but could have just used a made up word.

Yeah, sorry, I noticed the same thing a few minutes ago, that I was probably at least somewhat misled by the more standard meaning of kindness.

Tabooing “kindness” I am saying something like:

Yes, I don’t think extrapolated current humans assign approximately any value to the exact preference of “respecting the preferences of existing weak agents” and I don’t really believe that you would on-reflection endorse that preference either.

Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions, like ‘agent’ being a meaningful concept in the first place, or ‘existing’ or ‘weak’ or ‘preferences’, all of which I expect I would think are probably terribly confused concepts to use after I had understood the real concepts that carve reality more at its joints, and this means this sentence sounds deceptively simple or robust, but really doesn’t feel like the kind of thing whose meaning will stay simple as an AI does more conceptual refinement.

I called this utilitarian optimization but it might have been more charitable to call it “impartial” optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It’s also “utilitarian” in the sense that it’s willing to spare nothing (or at least not 1/​trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like “utilitarian” is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.

The reason why I objected to this characterization is that I was trying to point at a more general thing than the “impartialness”. Like, to paraphrase what this sentence sounds like to me, it’s more as if someone from a pre-modern era was arguing about future civilizations and said “It’s weird that your conception of future humans are willing to do nothing for the gods that live in the sky, and the spirits that make our plants grow”.

Like, after a bunch of ontological reflection and empirical data gathering, “gods” is just really not a good abstraction for things I care about anymore. I don’t think “impartiality” is what is causing me to not care about gods, it’s just that the concept of “gods” seems fake and doesn’t carve reality at its joints anymore. It’s also not the case that I don’t care at all about ancient gods anymore (they are pretty cool and I like the aesthetic), but they way I care about them is very different from how I care about other humans.

Not caring about gods doesn’t feel “harsh” or “utilitarian” or in some sense like I have decided to abandon any part of my values. This is what I expect it to feel like for a future human to look back at our meta-preferences for many types of other beings, and also what it feels like for AIs that maybe have some initial version of ‘caring about others’ when they are at similar capability levels to humans.

This again isn’t capturing my objection perfectly, but maybe helps point to it better.

Ryan Greenblatt then replies:

When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else).

Or at least that many/​most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans ‘want’ to be preserved (at least according to a conventional notion of preferences).

I think this empirical view seems pretty implausible.

That said, I think it’s quite plausible that upon reflection, I’d want to ‘wink out’ any existing copies of myself in favor of using resources better things. But this is partially because I personally (in my current state) would endorse such a thing: if my extrapolated volition thought it would be better to not exist (in favor of other resource usage), my current self would accept that. And, I think it currently seems unlikely that upon reflection, I’d want to end all human lives (in particular, I think I probably would want to keep humans alive who had preferences against non-existence). This applies regardless of trade; it’s important to note this to avoid a ‘perpetual motion machine’ type argument.

Beyond this, I think that most or many humans or aliens would, upon reflection, want to preserve currently existing humans or aliens who had a preference against non-existence. (Again, regardless of trade.)

Additionally, I think it’s quite plausible that most or many humans or aliens will enact various trades or precommitments prior to reflecting (which is probably ill-advised, but it will happen regardless). So current preferences which aren’t stable under reflection might have a significant influence overall.

Vladimir Nesov says:

Zeroth approximation of pseudokindness is strict nonintervention, reifying the patient-in-environment as a closed computation and letting it run indefinitely, with some allocation of compute. Interaction with the outside world creates vulnerability to external influence, but then again so does incautious closed computation, as we currently observe with AI x-risk, which is not something beamed in from outer space.

Formulation of the kinds of external influences that are appropriate for a particular patient-in-environment is exactly the topic of membranes/​boundaries, this task can be taken as the defining desideratum for the topic. Specifically, the question of which environments can be put in contact with a particular membrane without corrupting it, hence why I think membranes are relevant to pseudokindness. Naturality of the membranes/​boundaries abstraction is linked to naturality of the pseudokindness abstraction.

In contrast, the language of preferences/​optimization seems to be the wrong frame for formulating pseudokindness, it wants to discuss ways of intervening and influencing, of not leaving value on the table, rather than ways of offering acceptable options that avoid manipulation. It might be possible to translate pseudokindness back into the language of preferences, but this translation would induce a kind of deontological prior on preferences that makes the more probable preferences look rather surprising/​unnatural from a more preferences-first point of view.

There were a bunch more comments, but this feels like a reasonable stopping place for priming the “previous discussion” pump.

  1. ^

    I believe Eliezer later wrote a twitter thread where he said he expects [something like kindness] to be somewhat common among evolved creatures, but ~0 for AIs trained the way we currently do. I don’t have the link offhand but if someone finds it I’ll edit it in.