paulfchristiano comments on Cosmopolitan values don’t come free

paulfchristiano Jun 1, 2023, 12:00 AM
86 points
24
Eliezer has a longer explanation of his view here.
My understanding of his argument is: there are a lot of contingencies that reflect how and whether humans are kind. Because there are so many contingencies, it is somewhat unlikely that aliens would go down a similar route, and essentially impossible for ML. So maybe aliens have a 5% probability of being nice and ML systems have ~0% probability of being nice. I think this argument is just talking about why we shouldn’t have update too much from humans, and there is an important background assumption that kindness is super weird and so won’t be produced very often by other processes, i.e. the only reason to think it might happen is that it happened in the single case we observed.
I find this pretty unconvincing. He lists like 10 things (humans need to trade favors, we’re not smart enough to track favors and kinship explicitly, and we tend to be allied with nearby humans so want to be nice to those around us, we use empathy to model other humans, and we had religion and moral realism for contingent reasons, we weren’t optimized too much once we were smart enough that our instrumental reasoning screens off kindness heuristics).
But no argument is given for why these are unusually kindness-inducing settings of the variables. And the outcome isn’t like a special combination of all of them, they each seem like factors that contribute randomly. It’s just a lot of stuff mixing together.
Presumably there is no process that ensures humans have lots of kindness-inducing features (and we didn’t select kindness as a property for which humans were notable, we’re just asking the civilization-independent question “does our AI kill us”). So if you list 10 random things that make humans more kind, it strongly suggests that other aliens will also have a bunch of random things that make them more kind. It might not be 10, and the net effect might be larger or smaller. But:
- I have no idea whatsoever how you are anchoring this distribution, and giving it a narrow enough spread to have confident predictions.
- Statements like “kindness is super weird” are wildly implausible if you’ve just listed 5 independent plausible mechanisms for generating kindness. You are making detailed quantitative guesses here, not ruling something out for any plausible a priori reason.
As a matter of formal reasoning, listing more and more contingencies that combine apparently-additively tends to decrease rather than increase the variance of kindness across the population. If there was just a single random thing about humans that drove kindness it would be more plausible that we’re extreme. If you are listing 10 things then things are going to start averaging out (and you expect that your 10 things are cherry-picked to be the ones most relevant to humans, but you can easily list 10 more candidates).
In fact it’s easy to list analogous things that could apply to ML (and I can imagine the identical conversation where hypothetical systems trained by ML are talking about how stupid it is to think that evolved life could end up being kind). Most obviously, they are trained in an environment where being kind to humans is a very good instrumental strategy. But they are also trained to closely imitate humans who are known to be kind, they’ve been operating in a social environment where they are very strongly expected to appear to be kind, etc. Eliezer seems to believe this kind of thing gets you “ice cream and condoms” instead of kindness OOD, but just one sentence ago he explained why similar (indeed, superficially much weaker!) factors led to humans retaining niceness out of distribution. I just don’t think we have the kind of a priori asymmetry or argument here that would make you think humans are way kinder than models. Yeah it can get you to ~50% or even somewhat lower, but ~0% seems like a joke.
There was one argument that I found compelling, which I would summarize as: humans were optimized while they were dumb. If evolution had kept optimizing us while we got smart, eventually we would have stopped being so kind. In ML we just keep on optimizing as the system gets smart. I think this doesn’t really work unless being kind is a competitive disadvantage for ML systems on the training distribution. But I do agree that if if you train your AI long enough on cases where being kind is a significant liability, it will eventually stop being kind.
What links here?
- Being nicer than Clippy by Joe Carlsmith (Jan 16, 2024, 7:44 PM; 109 points)
- Being nicer than Clippy by Joe_Carlsmith (EA Forum; Jan 16, 2024, 7:44 PM; 26 points)
- So8res Jun 1, 2023, 10:10 PM
  40 points
  4
  Parent
  Short version: I don’t buy that humans are “micro-pseudokind” in your sense; if you say “for just $5 you could have all the fish have their preferences satisfied” I might do it, but not if I could instead spend $5 on having the fish have their preferences satisfied in a way that ultimately leads to them ascending and learning the meaning of friendship, as is entangled with the rest of my values.
  
  Meta:
  
  Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that’s likely to be a mistake even if it doesn’t lead to billions of deaths.
  
  So for starters, thanks for making acknowledgements about places we apparently agree, or otherwise attempting to demonstrate that you’ve heard my point before bringing up other points you want to argue about. (I think this makes arguments go better.) (I’ll attempt some of that myself below.)
  
  Secondly, note that it sounds to me like you took a diametric-opposite reading of some of my intended emotional content (which I acknowledge demonstrates flaws in my writing). For instance, I intended the sentence “At that very moment they hear the dinging sound of an egg-timer, as the next-token-predictor ascends to superintelligence and bursts out of its confines” to be a caricature so blatant as to underscore the point that I wasn’t making arguments about takeoff speeds, but was instead focusing on the point about “complexity” not being a saving grace (and “monomaniacalism” not being the issue here). (Alternatively, perhaps I misunderstand what things you call the “emotional content” and how you’re reading it.)
  
  Thirdly, I note that for whatever it’s worth, when I go to new communities and argue this stuff, I don’t try to argue people into >95% change we’re all going to die in <20 years. I just try to present the arguments as I see them (without hiding the extremity of my own beliefs, nor while particularly expecting to get people to a similarly-extreme place with, say, a 30min talk). My 30min talk targets are usually something more like “>5% probability of existential catastrophe in <20y”. So insofar as you’re like “I’m aiming to get you to stop arguing so confidently for death given takeover”, you might already have met your aims in my case.
  
  (Or perhaps not! Perhaps there’s plenty of emotional-content leaking through given the extremity of my own beliefs, that you find particularly detrimental. To which the solution is of course discussion on the object-level, which I’ll turn to momentarily.)
  
  Object:
  
  First, I acknowledge that if an AI cares enough to spend one trillionth of its resources on the satisfaction of fulfilling the preferences of existing “weak agents” in precisely the right way, then there’s a decent chance that current humans experience an enjoyable future.
  
  With regards to your arguments about what you term “kindness” and I shall term “pseudokindness” (on account of thinking that “kindness” brings too much baggage), here’s a variety of places that it sounds like we might disagree:
  - Pseudokindness seems underdefined, to me, and I expect that many ways of defining it don’t lead to anything like good outcomes for existing humans.
    
    Suppose the AI is like “I am pico-pseudokind; I will dedicate a trillionth of my resources to satisfying the preferences of existing weak agents by granting those existing weak agents their wishes”, and then only the most careful and conscientious humans manage to use those wishes in ways that leave them alive and well.
    There are lots and lots of ways to “satisfy the preferences” of the “weak agents” that are humans. Getting precisely the CEV (or whatever it should be repaired into) is a subtle business. Most humans probably don’t yet recognize that they could or should prefer taking their CEV over various more haphazard preference-fulfilments that ultimately leave them unrecognizable and broken. (Or, consider what happens when a pseudokind AI encounters a baby, and seeks to satisfy its preferences. Does it have the baby age?)
    You’ve got to do some philosophy to satisfy the preferences of humans correctly. And the issue isn’t that the AI couldn’t solve those philosophy problems correctly-according-to-us, it’s that once we see how wide the space of “possible ways to be pseudokind” is, then “pseudokind in the manner that gives us our CEVs” starts to feel pretty narrow against “pseudokind in the manner that fulfills our revealed preferences, or our stated preferences, or the poorly-considered preferences of philosophically-immature people, or whatever”.
  - I doubt that humans are micro-pseudokind, as defined. And so in particular, all your arguments of the form “but we’ve seen it arise once” seem suspect to me.
    
    Like, suppose we met fledgeling aliens, and had the opportunity to either fulfil their desires, or leave them alone to mature, or affect their development by teaching them the meaning of friendship. My guess is that we’d teach them the meaning of friendship. I doubt we’d hop in and fulfil their desires.
    (Perhaps you’d counter with something like: well if it was super cheap, we might make two copies of the alien civilization, and fulfil one’s desires and teach the other the meaning of friendship. I’m skeptical, for various reasons.)
    More generally, even though “one (mill|trill)ionth” feels like a small fraction, the obvious ways to avoid dedicating even a (mill|trill)ionth of your resources to X is if X is right near something even better that you might as well spend the resources on instead.
    There’s all sorts of ways to thumb the scales in how a weak agent develops, and there’s many degrees of freedom about what counts as a “pseudo-agent” or what counts as “doing justice to its preferences”, and my read is that humans take one particular contingent set of parameters here and AIs are likely to take another (and that the AI’s other-settings are likely to lead to behavior not-relevantly-distinct from killing everyone).
    My read is than insofar as humans do have preferences about doing right by other weak agents, they have all sorts of desire-to-thumb-the-scales mixed in (such that humans are not actually pseudokind, for all that they might be kind).
  - I have a more-difficult-to-articulate sense that “maybe the AI ends up pseudokind in just the right way such that it gives us a (small, limited, ultimately-childless) glorious transhumanist future” is the sort of thing that reality gets to say “lol no” to, once you learn more details about how the thing works internally.
  Most of my argument here is that “the space of ways things can end “caring” about the “preferences” of “weak agents” is wide, and most points within it don’t end up being our point in it, and optimizing towards most points in it doesn’t end up keeping us around at the extremes. My guess is mostly that the space is so wide that you don’t even end up with AIs warping existing humans into unrecognizable states, but do in fact just end up with the people dead (modulo distant aliens buying copies, etc).
  
  I haven’t really tried to quantify how confident I am of this; I’m not sure whether I’d go above 90%, \shrug.
  
  It occurs to me that one possible source of disagreement here is, perhaps you’re trying to say something like:
  
  Nate, you shouldn’t go around saying “if we don’t competently intervene, literally everybody will die” with such a confident tone, when you in fact think there’s a decent chance of scenarios where the AIs keep people around in some form, and make some sort of effort towards fulfilling their desires; most people don’t care about the cosmic endowment like you do; the bluntly-honest and non-manipulative thing to say is that there’s a decent chance they’ll die and a better chance that humanity will lose the cosmic endowment (as you care about more than they do),
  
  whereas my stance has been more like
  
  most people I meet are skeptical that uploads count as them; most people would consider scenarios where their bodies are destroyed by rapid industrialization of Earth but a backup of their brain is stored and then later run in simulation (where perhaps it’s massaged into an unrecognizable form, or kept in an alien zoo, or granted a lovely future on account of distant benefactors, or …) to count as “death”; and also those exotic scenarios don’t seem all that likely to me, so it hasn’t seemed worth caveating.
  
  I’m somewhat persuaded by the claim that failing to mention even the possibility of having your brainstate stored, and then run-and-warped by an AI or aliens or whatever later, or run in an alien zoo later, is potentially misleading.
  
  I’m considering adding footnotes like “note that when I say “I expect everyone to die”, I don’t necessarily mean “without ever some simulation of that human being run again”, although I mostly don’t think this is a particularly comforting caveat”, in the relevant places. I’m curious to what degree that would satisfy your aims (and I welcome workshopped wording on the footnotes, as might both help me make better footnotes and help me understand better where you’re coming from).
  What links here?
  - paulfchristiano Jun 2, 2023, 5:19 PM
    48 points
    27
    Parent
    I disagree with this but am happy your position is laid out. I’ll just try to give my overall understanding and reply to two points.
    Like Oliver, it seems like you are implying:
    Humans may be nice to other creatures in some sense, But if the fish were to look at the future that we’d achieve for them using the 1/billionth of resources we spent on helping them, it would be as objectionable to them as “murder everyone” is to us.
    I think that normal people being pseudokind in a common-sensical way would instead say:
    If we are trying to help some creatures, but those creatures really dislike the proposed way we are “helping” them, then we should try a different tactic for helping them.
    I think that some utilitarians (without reflection) plausibly would “help the humans” in a way that most humans consider as bad as being murdered. But I think this is an unusual feature of utilitarians, and most people would consult the beneficiaries, observe they don’t want to be murdered, and so not murder them.
    I think that saying “Helping someone in a way they like, sufficiently precisely to avoid things like murdering them, requires precisely the right form of caring—and that’s super rare” is a really misleading sense of how values work and what targets are narrow. I think this is more obvious if you are talking about how humans would treat a weaker species. If that’s the state of the disagreement I’m happy to leave it there.
    I’m somewhat persuaded by the claim that failing to mention even the possibility of having your brainstate stored, and then run-and-warped by an AI or aliens or whatever later, or run in an alien zoo later, is potentially misleading.
    This is an important distinction at 1/trillion levels of kindness, but at 1/billion levels of kindness I don’t even think the humans have to die.
    What links here?
    ryan_greenblatt's comment on MIRI 2024 Communications Strategy by Gretta Duleba (May 29, 2024, 10:18 PM; 11 points)
    ryan_greenblatt's comment on MIRI 2024 Communications Strategy by Gretta Duleba (May 30, 2024, 5:55 PM; 4 points)
    - So8res Jun 2, 2023, 5:46 PM
      9 points
      4
      Parent
      
      If we are trying to help some creatures, but those creatures really dislike the proposed way we are “helping” them, then we should do something else.
      
      My picture is less like “the creatures really dislike the proposed help”, and more like “the creatures don’t have terribly consistent preferences, and endorse each step of the chain, and wind up somewhere that they wouldn’t have endorsed if you first extrapolated their volition (but nobody’s extrapolating their volition or checking against that)”.
      
      It sounds to me like your stance is something like “there’s a decent chance that most practically-buildable minds pico-care about correctly extrapolating the volition of various weak agents and fulfilling that extrapolated volition”, which I am much more skeptical of than the weaker “most practically-buildable minds pico-care about satisfying the preferences of weak agents in some sense”.
      - paulfchristiano Jun 2, 2023, 6:00 PM
        30 points
        11
        Parent
        We’re not talking about practically building minds right now, we are talking about humans.
        We’re not talking about “extrapolating volition” in general. We are talking about whether—in attempting to help a creature with preferences about as coherent as human preferences—you end up implementing an outcome that creature considers as bad as death.
        For example, we are talking about what would happen if humans were trying to be kind to a weaker species that they had no reason to kill, that could nevertheless communicate clearly and had preferences about as coherent as human preferences (while being very alien).
        And those creatures are having a conversation amongst themselves before the humans arrive wondering “Are the humans going to murder us all?” And one of them is saying “I don’t know, they don’t actually benefit from murdering us and they seem to care a tiny bit about being nice, maybe they’ll just let us do our thing with 1/trillionth of the universe’s resources?” while another is saying “They will definitely have strong opinions about what our society should look like and the kind of transformation they implement is about as bad by our lights as being murdered.”
        In practice attempts to respect someone’s preferences often involve ideas like autonomy and self-determination and respect for their local preferences. I really don’t think you have to go all the way to extrapolated volition in order to avoid killing everyone.
        So8res Jun 2, 2023, 7:01 PM
        12 points
        0
        Parent
        Is this a reasonable paraphrase of your argument?
        
        Humans wound up caring at least a little about satisfying the preferences of other creatures, not in a “grant their local wishes even if that ruins them” sort of way but in some other intuitively-reasonable manner.
        
        Humans are the only minds we’ve seen so far, and so having seen this once, maybe we start with a 50%-or-so chance that it will happen again.
        
        You can then maybe drive this down a fair bit by arguing about how the content looks contingent on the particulars of how humans developed or whatever, and maybe that can drive you down to 10%, but it shouldn’t be able to drive you down to 0.1%, especially not if we’re talking only about incredibly weak preferences.
        
        If so, one guess is that a bunch of disagreement lurks in this “intuitively-reasonable manner” business.
        
        A possible locus of disagreemet: it looks to me like, if you give humans power before you give them wisdom, it’s pretty easy to wreck them while simply fulfilling their preferences. (Ex: lots of teens have dumbass philosophies, and might be dumb enough to permanently commit to them if given that power.)
        
        More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfil certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.
        
        (I separately expect that if we were doing something more like the volition-extrapolation thing, we’d be tempted to bend the process towards “and they learn the meaning of friendship”.)
        
        That said, this conversation is updating me somewhat towards “a random UFAI would keep existing humans around and warp them in some direction it prefers, rather than killing them”, on the grounds that the argument “maybe preferences-about-existing-agents is just a common way for rando drives to shake out” plausibly supports it to a threshold of at least 1 in 1000. I’m not sure where I’ll end up on that front.
        
        Another attempt at naming a crux: It looks to me like you see this human-style caring about others’ preferences as particularly “simple” or “natural”, in a way that undermines “drawing a target around the bullseye”-type arguments, whereas I could see that argument working for “grant all their wishes (within a budget)” but am much more skeptical when it comes to “do right by them in an intuitively-reasonable way”.
        
        (But that still leaves room for an update towards “the AI doesn’t necessarily kill us, it might merely warp us, or otherwise wreck civilization by bounding us and then giving us power-before-wisdom within those bounds or or suchlike, as might be the sort of whims that rando drives shake out into”, which I’ll chew on.)
        Joey KL Jun 2, 2023, 8:12 PM
        16 points
        6
        Parent
        More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfill certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.
        Isn’t the worst case scenario just leaving the aliens alone? If I’m worried I’m going to fuck up some alien’s preferences, I’m just not going to give them any power or wisdom!
        I guess you think we’re likely to fuck up the alien’s preferences by light of their reflection process, but not our reflection process. But this just recurs to the meta level. If I really do care about an alien’s preferences (as it feels like I do), why can’t I also care about their reflection process (which is just a meta preference)?
        I feel like the meta level at which I no longer care about doing right by an alien is basically the meta level at which I stop caring about someone doing right by me. In fact, this is exactly how it seems mentally constructed: what I mean by “doing right by [person]” is “what that person would mean by ‘doing right by me’”. This seems like either something as simple as it naively looks, or sensitive to weird hyperparameters I’m not sure I care about anyway.
        Daniel Kokotajlo Jun 4, 2023, 3:26 PM
        7 points
        1
        Parent
        (But that still leaves room for an update towards “the AI doesn’t necessarily kill us, it might merely warp us, or otherwise wreck civilization by bounding us and then giving us power-before-wisdom within those bounds or or suchlike, as might be the sort of whims that rando drives shake out into”, which I’ll chew on.)
        FWIW this is my view. (Assuming no ECL/MSR or acausal trade or other such stuff. If we add those things in, the situation gets somewhat better in expectation I think, because there’ll be trades with faraway places that DO care about our CEV.)
        Eric Zhang Jun 5, 2023, 6:25 AM
        4 points
        2
        Parent
        My reading of the argument was something like “bullseye-target arguments refute an artificially privileged target being rated significantly likely under ignorance, e.g. the probability that random aliens will eat ice cream is not 50%. But something like kindness-in-the-relevant-sense is the universal problem faced by all evolved species creating AGI, and is thus not so artificially privileged, and as a yes-no question about which we are ignorant the uniform prior assigns 50%”. It was more about the hypothesis not being artificially privileged by path-dependent concerns than the notion being particularly simple, per se.
  - Eliezer Yudkowsky Jun 2, 2023, 5:27 AM
    40 points
    8
    Parent
    I sometimes mention the possibility of being stored and sold to aliens a billion years later, which seems to me to validly incorporate most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don’t expect Earthlings to think about validly.
    What links here?
    ryan_greenblatt's comment on MIRI 2024 Communications Strategy by Gretta Duleba (May 29, 2024, 10:18 PM; 11 points)
  - astridain Jun 2, 2023, 6:08 PM
    7 points
    2
    Parent
    My guess is mostly that the space is so wide that you don’t even end up with AIs warping existing humans into unrecognizable states, but do in fact just end up with the people dead
    Why? I see a lot of opportunities for s-risk or just generally suboptimal future in such options, but “we don’t want to die, or at any rate we don’t want to die out as a species” seems like an extremely simple, deeply-ingrained goal that almost any metric by which the AI judges our desires should be expected to pick up, assuming it’s at all pseudokind. (In many cases, humans do a lot to protect endangered species even as we do diddly-squat to fulfill individual specimens’ preferences!)
- So8res Jun 1, 2023, 10:15 PM
  16 points
  2
  Parent
  Some more less-important meta, that is in part me writing out of frustration from how the last few exchanges have gone:
  
  I’m not quite sure what argument you’re trying to have here. Two explicit hypotheses follow, that I haven’t managed to distinguish between yet.
  
  Background context, for establishing common language etc.:
  - Nate is trying to make a point about inclusive cosmopolitan values being a part of the human inheritance, and not universally compelling.
  - Paul is trying to make a point about how there’s a decent chance that practical AIs will plausibly care at least a tiny amount about the fulfillment of the preferences of existing “weak agents”, herein called “pico-pseudokindness”.
  Hypothesis 1: Nate’s trying to make a point about cosmopolitan values that Paul basically agrees with. But Paul thinks Nate’s delivery gives a wrong impression about the tangentially-related question of pico-pseudokindness, probably because (on Paul’s model) Nate’s wrong about pico-pseudokindness, and Paul is taking the opportunity to argue about it.
  
  Hypothesis 2: Nate’s trying to make a point about cosmopolitan values that Paul basically disagrees with. Paul maybe agrees with all the literal words, but thinks that Nate has misunderstood the connection between pico-pseudokindness and cosmopolitan values, and is hoping to convince Nate that these questions are more than tangentially related.
  
  (Or, well, I have hypothesis-cluster rather than hypotheses, of which these are two representatives, whatever.)
  
  Some notes that might help clear some things up in that regard:
  - The long version of the title here is not “Cosmopolitan values don’t come cheap”, but rather “Cosmopolitan values are also an aspect of human values, and are not universally compelling”.
  - I think there’s a common mistake that people outside our small community make, where they’re like “whatever the AIs decide to do, turns out to be good, so long as they decide it while they’re smart; don’t be so carbon-chauvinist and anthropocentric”. A glaring example is Richard Sutton. Heck, I think people inside our community make it decently often, with an example being Robin Hanson.
    My model is that many of these people are intuiting that “whatever the AIs decide to do” won’t include vanilla ice cream, but will include broad cosmopolitan value.
    It seems worth flatly saying “that’s a crux for me; if I believed that the AIs would naturally have broad inclusive cosmopolitan values then I’d be much more onboard the acceleration train; when I say that the AIs won’t have our values I am not talking just about the “ice cream” part I am also talking about the “broad inclusive cosmopolitan dream” part; I think that even that is at risk”.
  If you were to acknowledge something like “yep, folks like Sutton and Hanson are making the mistake you name here, and the broad cosmopolitan dream is very much at risk and can’t be assumed as convergent, but separately you (Nate) seem to be insinuating that you expect it’s hard to get the AIs to care about the broad cosmopolitan dream even a tiny bit, and that it definitely won’t happen by chance, and I want to fight about that here”, then I’d feel like I understood what argument we were having (namely: hypothesis 1 above).
  
  If you were to instead say something like “actually, Nate, I think that these people are accessing a pre-theoretic intuition that’s essentially reasonable, and that you’ve accidentally destroyed with all your premature theorizing, such that I don’t think you should be so confident in your analysis that folk like Sutton and Hanson are making a mistake in this regard”, then I’d also feel like I understood what argument we were having (namely: hypothesis 2 above).
  
  Alternatively, perhaps my misunderstanding runs even deeper, and the discussion you’re trying to have here comes from even farther outside my hypothesis space.
  
  For one reason or another, I’m finding it pretty frustrating to attempt to have this conversation while not knowing which of the above conversations (if either) we’re having. My current guess is that that frustration would ease up if something like hypothesis-1 were true and you made some acknowledgement like the above. (I expect to still feel frustrated in the hypothesis-2 case, though I’m not yet sure why, but might try to tease it out if that turns out to be reality.)
  What links here?
  - What are the best arguments for/against AIs being “slightly ‘nice’”? by Raemon (Sep 24, 2024, 2:00 AM; 99 points)
  - paulfchristiano Jun 2, 2023, 5:43 PM
    16 points
    6
    Parent
    Hypothesis 1 is closer to the mark, though I’d highlight that it’s actually fairly unclear what you mean by “cosmopolitan values” or exactly what claim you are making (and that ambiguity is hiding most of the substance of disagreements).
    I’m raising the issue of pico-pseudokindness here because I perceive it as (i) an important undercurrent in this post, (ii) an important part of the actual disagreements you are trying to address. (I tried to flag this at the start.)
    More broadly, I don’t really think you are engaging productively with people who disagree with you. I suspect that if you showed this post to someone you perceive yourself to be arguing with, they would say that you seem not to understand the position—the words aren’t really engaging with their view, and the stories aren’t plausible on their models of the world but in ways that go beyond the literal claim in the post.
    I think that would hold in particular for Robin Hanson or Rich Sutton. I don’t think they are accessing a pre-theoretic intuition that you are discarding by premature theorizing. I think the better summary is that you don’t understand their position very well or are choosing not to engage with the important parts of it. (Just as Robin doesn’t seem to understand your position ~at all.)
    I don’t think the point about pico-pseudokindness is central for either Robert Hanson or Rich Sutton. I think it is more obviously relevant to a bunch of recent arguments Eliezer has gotten into on Twitter.
    - So8res Jun 2, 2023, 5:54 PM
      3 points
      0
      Parent
      Thanks! I’m curious for your paraphrase of the opposing view that you think I’m failing to understand.
      
      (I put >50% probability that I could paraphrase a version of “if the AIs decide to kill us, that’s fine” that Sutton would basically endorse (in the right social context), and that would basically route through a version of “broad cosmopolitan value is universally compelling”, but perhaps when you give a paraphrase it will sound like an obviously-better explanation of the opposing view and I’ll update.)
      - paulfchristiano Jun 2, 2023, 6:21 PM
        16 points
        1
        Parent
        I think a closer summary is:
        Humans and AI systems probably want different things. From the human perspective, it would be better if the universe was determined by what the humans wanted. But we shouldn’t be willing to pay huge costs, and shouldn’t attempt to create a slave society where AI systems do humans’ bidding forever, just to ensure that human values win out. After all, we really wouldn’t want that outcome if our situations had been reversed. And indeed we are the beneficiary of similar values-turnover in the past, as our ancestors have been open (perhaps by necessity rather than choice) to values changes that they would sometimes prefer hadn’t happened.
        We can imagine really sterile outcomes, like replicators colonizing space with an identical pattern repeated endlessly, or AI systems that want to maximize the number of paperclips. And considering those outcomes can help undermine the cosmopolitan intuition that we should respect the AI we build. But in fact that intuition pump relies crucially on its wildly unrealistic premises, that the kind of thing brought about by AI systems will be sterile and uninteresting. If we instead treat “paperclip” as an analog for some crazy weird shit that is alien and valence-less to humans, drawn from the same barrel of arbitrary and diverse desires that can be produced by selection processes, then the intuition pump loses all force. I’m back to feeling like our situations could have been reversed, and we shouldn’t be total assholes to the AI.
        I don’t think that requires anything at all about AI systems converging to cosmopolitan values in the sense you are discussing here. I do think it is much more compelling if you accept some kind of analogy between the sorts of processes shaping human values and the processes shaping AI values, but this post (and the references you cite and other discussions you’ve had) don’t actually engage with the substance of that analogy and the kinds of issues raised in my comment are much closer to getting at the meat of the issue.
        I also think the “not for free” part doesn’t contradict the views of Rich Sutton. I asked him this question and he agrees that all else equal it would be better if we handed off to human uploads instead of powerful AI. I think his view is that the proposed course of action from the alignment community is morally horrifying (since in practice he thinks the alternative is “attempt to have a slave society,” not “slow down AI progress for decades”—I think he might also believe that stagnation is much worse than a handoff but haven’t heard his view on this specifically) and that even if you are losing something in expectation by handing the universe off to AI systems it’s not as bad as the alternative.
        What links here?
        Does AI risk “other” the AIs? by Joe Carlsmith (Jan 9, 2024, 5:51 PM; 59 points)
        Does AI risk “other” the AIs? by Joe_Carlsmith (EA Forum; Jan 9, 2024, 5:51 PM; 23 points)
        So8res Jun 2, 2023, 6:29 PM
        2 points
        0
        Parent
        Thanks! Seems like a fine summary to me, and likely better than I would have done, and it includes a piece or two that I didn’t have (such as an argument from symmetry if the situations were reversed). I do think I knew a bunch of it, though. And e.g., my second parable was intended to be a pretty direct response to something like
        
        If we instead treat “paperclip” as an analog for some crazy weird shit that is alien and valence-less to humans, drawn from the same barrel of arbitrary and diverse desires that can be produced by selection processes, then the intuition pump loses all force.
        
        where it’s essentially trying to argue that this intuition pump still has force in precisely this case.
        paulfchristiano Jun 2, 2023, 6:41 PM
        4 points
        2
        Parent
        To the extent the second parable has this kind of intuitive force I think it comes from: (i) the fact that the resulting values still sound really silly and simple (which I think is mostly deliberate hyperbole), (ii) the fact that the AI kills everyone along the way.
- Max H Jun 1, 2023, 12:45 AM
  1 point
  0
  Parent
  This comment changed my mind on the probability that evolved aliens are likely to end up kind, which I now think is somewhat more likely than 5%. I still think AI systems are unlikely to have kindness, for something like the reason you give at the end:
  In ML we just keep on optimizing as the system gets smart. I think this doesn’t really work unless being kind is a competitive disadvantage for ML systems on the training distribution.
  
  I actually think it’s somewhat likely that ML systems won’t value kindness at all before they are superhuman enough to take over. I expect kindness as a value within the system itself not to arise spontaneously during training, and that no one will succeed at eliciting it deliberately before take over. (The outward behavior of the system may appear to be kind, and mechanistic interpretability may show that some internal component of the system has a correct understanding of kindness. But that’s not the same as the system itself valuing kindness the way that humans do or aliens might.)