Communications @ MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer’s. (Though we agree about an awful lot.)
Rob Bensinger
Note that “everyone will be killed (or worse)” is a different claim from “everyone will be killed”! (And see Oliver’s point that Ryan isn’t talking about mistreated brain scans.)
Some of the other things you suggest, like future systems keeping humans physically alive, do not seem plausible to me.
I agree with Gretta here, and I think this is a crux. If MIRI folks thought it were likely that AI will leave a few humans biologically alive (as opposed to information-theoretically revivable), I don’t think we’d be comfortable saying “AI is going to kill everyone”. (I encourage other MIRI folks to chime in if they disagree with me about the counterfactual.)
I also personally have maybe half my probability mass on “the AI just doesn’t store any human brain-states long-term”, and I have less than 1% probability on “conditional on the AI storing human brain-states for future trade, the AI does in fact encounter aliens that want to trade and this trade results in a flourishing human civilization”.
FWIW I do think “don’t trust this guy” is warranted; I don’t know that he’s malicious, but I think he’s just exceptionally incompetent relative to the average tech reporter you’re likely to see stories from.
Like, in 2018 Metz wrote a full-length article on smarter-than-human AI that included the following frankly incredible sentence:
During a recent Tesla earnings call, Mr. Musk, who has struggled with questions about his company’s financial losses and concerns about the quality of its vehicles, chastised the news media for not focusing on the deaths that autonomous technology could prevent — a remarkable stance from someone who has repeatedly warned the world that A.I. is a danger to humanity.
FWIW, Cade Metz was reaching out to MIRI and some other folks in the x-risk space back in January 2020, and I went to read some of his articles and came to the conclusion in January that he’s one of the least competent journalists—like, most likely to misunderstand his beat and emit obvious howlers—that I’d ever encountered. I told folks as much at the time, and advised against talking to him just on the basis that a lot of his journalism is comically bad and you’ll risk looking foolish if you tap him.
This was six months before Metz caused SSC to shut down and more than a year before his hit piece on Scott came out, so it wasn’t in any way based on ‘Metz has been mean to my friends’ or anything like that. (At the time he wasn’t even asking around about SSC or Scott, AFAIK.)
(I don’t think this is an idiosyncratic opinion of mine, either; I’ve seen other non-rationalists I take seriously flag Metz as someone unusually out of his depth and error-prone for a NYT reporter, for reporting unrelated to SSC stuff.)
Sounds like a lot of political alliances! (And “these two political actors are aligned” is arguably an even weaker condition than “these two political actors are allies”.)
At the end of the day, of course, all of these analogies are going to be flawed. AI is genuinely a different beast.
It’s pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for “actually terminally has good intentions”.
Aren’t there a lot of clearer words for this? “Well-intentioned”, “nice”, “benevolent”, etc.
(And a lot of terms, like “value loading” and “value learning”, that are pointing at the research project of getting good intentions into the AI.)
To my ear, “aligned person” sounds less like “this person wishes the best for me”, and more like “this person will behave in the right ways”.
If I hear that Russia and China are “aligned”, I do assume that their intentions play a big role in that, but I also assume that their circumstances, capabilities, etc. matter too. Alignment in geopolitics can be temporary or situational, and it almost never means that Russia cares about China as much as China cares about itself, or vice versa.
And if we step back from the human realm, an engineered system can be “aligned” in contexts that have nothing to do with goal-oriented behavior, but are just about ensuring components are in the right place relative to each other.
Cf. the history of the term “AI alignment”. From my perspective, a big part of why MIRI coordinated with Stuart Russell to introduce the term “AI alignment” was that we wanted to switch away from “Friendly AI” to a term that sounded more neutral. “Friendly AI research” had always been intended to subsume the full technical problem of making powerful AI systems safe and aimable; but emphasizing “Friendliness” made it sound like the problem was purely about value loading, so a more generic and low-content word seemed desirable.
But Stuart Russell (and later Paul Christiano) had a different vision in mind for what they wanted “alignment” to be, and MIRI apparently failed to communicate and coordinate with Russell to avoid a namespace collision. So we ended up with a messy patchwork of different definitions.
I’ve basically given up on trying to achieve uniformity on what “AI alignment” is; the best we can do, I think, is clarify whether we’re talking about “intent alignment” vs. “outcome alignment” when the distinction matters.
But I do want to push back against those who think outcome alignment is just an unhelpful concept — on the contrary, if we didn’t have a word for this idea I think it would be very important to invent one.
IMO it matters more that we keep our eye on the ball (i.e., think about the actual outcomes we want and keep researchers’ focus on how to achieve those outcomes) than that we define an extremely crisp, easily-packaged technical concept (that is at best a loose proxy for what we actually want). Especially now that ASI seems nearer at hand (so the need for this “keep our eye on the ball” skill is becoming less and less theoretical), and especially now that ASI disaster concerns have hit the mainstream (so the need to “sell” AI risk has diminished somewhat, and the need to direct research talent at the most important problems has increased).
And I also want to push back against the idea that a priori, before we got stuck with the current terminology mess, it should have been obvious that “alignment” is about AI systems’ goals and/or intentions, rather than about their behavior or overall designs. I think intent alignment took off because Stuart Russell and Paul Christiano advocated for that usage and encouraged others to use the term that way, not because this was the only option available to us.
“Should” in order to achieve a certain end? To meet some criterion? To boost a term in your utility function?
In the OP: “Should” in order to have more accurate beliefs/expectations. E.g., I should anticipate (with high probability) that the Sun will rise tomorrow in my part of the world, rather than it remaining night.
Why would the laws of physics conspire to vindicate a random human intuition that arose for unrelated reasons?
We do agree that the intuition arose for unrelated reasons, right? There’s nothing in our evolutionary history, and no empirical observation, that causally connects the mechanism you’re positing and the widespread human hunch “you can’t copy me”.
If the intuition is right, we agree that it’s only right by coincidence. So why are we desperately searching for ways to try to make the intuition right?
It also doesn’t force us to believe that a bunch of water pipes or gears functioning as a classical computer can ever have our own first person experience.
Why is this an advantage of a theory? Are you under the misapprehension that “hypothesis H allows humans to hold on to assumption A” is a Bayesian update in favor of H even when we already know that humans had no reason to believe A? This is another case where your theory seems to require that we only be coincidentally correct about A (“sufficiently complex arrangements of water pipes can’t ever be conscious”), if we’re correct about A at all.
One way to rescue this argument is by adding in an anthropic claim, like: “If water pipes could be conscious, then nearly all conscious minds would be instantiated in random dust clouds and the like, not in biological brains. So given that we’re not Boltzmann brains briefly coalescing from space dust, we should update that giant clouds of space dust can’t be conscious.”
But is this argument actually correct? There’s an awful lot of complex machinery in a human brain. (And the same anthropic argument seems to suggest that some of the human-specific machinery is essential, else we’d expect to be some far-more-numerous observer, like an insect.) Is it actually that common for a random brew of space dust to coalesce into exactly the right shape, even briefly?
Yeah, at some point we’ll need a proper theory of consciousness regardless, since many humans will want to radically self-improve and it’s important to know which cognitive enhancements preserve consciousness.
You can easily clear this confusion if you rephrase it as “You should anticipate having any of these experiences”. Then it’s immediately clear that we are talking about two separate screens.
This introduces some other ambiguities. E.g., “you should anticipate having any of these experiences” may make it sound like you have a choice as to which experience to rationally expect.
And it’s also clear that our curriocity isn’t actually satisfied. That the question “which one of these two will actually be the case” is still very much on the table.
… And the answer is “both of these will actually be the case (but not in a split-screen sort of way)”.
Your rephrase hasn’t shown that there was a question left unanswered in the original post; it’s just shown that there isn’t a super short way to crisply express what happens in English, you do actually have to add the clarification.
Still as soon as we got Rob-y and Rob-z they are not “metaphysically the same person”. When Rob-y says “I” he is reffering to Rob-y, not Rob-z and vice versa. More specifically Rob-y is refering to some causal curve through time ans Rob-z is refering to another causal curve through time. These two curves are the same to some point, but then they are not.
Yep, I think this is a perfectly fine way to think about the thing.
My first issue with your post is that this initial ontological assumption is neither mentioned explicitly nor motivated. Nothing in your post can be used as proof of this initial assumption.
There are always going to be many different ways someone could object to a view. If you were a Christian, you’d perhaps be objecting that the existence of incorporeal God-given Souls is the real crux of the matter, and if I were intellectually honest I’d be devoting the first half of the post to arguing against the Christian Soul.
Rather than trying to anticipate these objections, I’d rather just hear them stated out loud by their proponents and then hash them out in the comments. This also makes the post less boring for the sorts of people who are most likely to be on LW: physicalists and their ilk.
Now, what would be the experience of getting copied, seen from a first-person, “internal”, perspective? I am pretty sure it would be something like: you walk into the room, you sit there, you hear say the scanner working for some time, it stops, you walk out. From my agnostic perspective, if I were the one to be scanned it seems like nothing special would have happened to me in this procedure. I didnt feel anything weird, I didnt feel my “consciousness split into two” or something.
Why do you assume that you wouldn’t experience the copy’s version of events?
The un-copied version of you experiences walking into the room, sitting there, hearing the scanner working, and hearing it stop; then that version of you experiences walking out. It seems like nothing special happened in this procedure; this version of you doesn’t feel anything weird, and doesn’t feel like their “consciousness split into two” or anything.
The copied version of you experiences walking into the room, sitting here, hearing the scanner working, and then an instantaneous experience of (let’s say) feeling like you’ve been teleported into another room—you’re now inside the simulation. Assuming the simulation feels like a normal room, it could well seem like nothing special happened in this procedure—it may feel like blinking and seeing the room suddenly change during the blink, while you yourself remain unchanged. This version of you doesn’t necessarily feel anything weird either, and they don’t feel like their “consciousness split into two” or anything.
It’s a bit weird that there are two futures, here, but only one past—that the first part of the story is the same for both versions of you. But so it goes; that just comes with the territory of copying people.
If you disagree with anything I’ve said above, what do you disagree with? And, again, what do you mean by saying you’re “pretty sure” that you would experience the future of the non-copied version?
Namely, if I consider this procedure as an empirical experiment, from my first person perspective I dont get any new / unexpected observation compared to say just sitting in an ordinary room. Even if I were to go and find my copy, my experience would again be like meeting a different person which just happens to look like me and which claims to have similar memories up to the point when I entered the copying room. There would be no way to verify or to view things from their first person perspective.
Sure. But is any of this Bayesian evidence against the view I’ve outlined above? What would it feel like, if the copy were another version of yourself? Would you expect that you could telepathically communicate with your copy and see things from both perspectives at once, if your copies were equally “you”? If so, why?
On the contrary, I would be wary to, say, kill myself or to be destroyed after the copying procedure, since no change will have occured to my first person perspective, and it would thus seem less likely that my “experience” would somehow survive because of my copy.
Shall we make a million copies and then take a vote? :)
I agree that “I made a non-destructive software copy of myself and then experienced the future of my physical self rather than the future of my digital copy” is nonzero Bayesian evidence that physical brains have a Cartesian Soul that is responsible for the brain’s phenomenal consciousness; the Cartesian Soul hypothesis does predict that data. But the prior probability of Cartesian Souls is low enough that I don’t think it should matter.
You need some prior reason to believe in this Soul in the first place; the same as if you flipped a coin, it came up heads, and you said “aha, this is perfectly predicted by the existence of an invisible leprechaun who wanted that coin to come up heads!”. Losing a coinflip isn’t a surprising enough outcome to overcome the prior against invisible leprechauns.
and it would also force me to accept that even a copy where the “circuit” is made of water pipes and pumps, or gears and levers also have an actual, first person experience as “me”, as long as the appropriate computations are being carried out.
Why wouldn’t it? What do you have against water pipes?
Wouldn’t it follow that in the same way you anticipate the future experiences of the brain that you “find yourself in” (i.e. the person reading this) you should anticipate all experiences, i.e. that all brain states occur with the same kind of me-ness/vivid immediacy?
What’s the empirical or physical content of this belief?
I worry that this may be another case of the Cartesian Ghost rearing its ugly head. We notice that there’s no physical thingie that makes the Ghost more connected to one experience or the other; so rather than exorcising the Ghost entirely, we imagine that the Ghost is connected to every experience simultaneously.
But in fact there is no Ghost. There’s just a bunch of experience-moments implemented in brain-moments.
Some of those brain-moments resemble other brain-moments, either by coincidence or because of some (direct or indirect) causal link between the brain-moments. When we talk about Brain-1 “anticipating” or “becoming” a future brain-state Brain-2, we normally mean things like:
There’s a lawful physical connection between Brain-1 and Brain-2, such that the choices and experiences of Brain-1 influence the state of Brain-2 in a bunch of specific ways.
Brain-2 retains ~all of the memories, personality traits, goals, etc. of Brain-1.
If Brain-2 is a direct successor to Brain-1, then typically Brain-2 can remember a bunch of things about the experience Brain-1 was undergoing.
These are all fuzzy, high-level properties, which admit of edge cases. But I’m not seeing what’s gained by therefore concluding “I should anticipate every experience, even ones that have no causal connection to mine and no shared memories and no shared personality traits”. Tables are a fuzzy and high-level concept, but that doesn’t mean that every object in existence is a table. It doesn’t even mean that every object is slightly table-ish. A photon isn’t “slightly table-ish”, it’s just plain not a table.
Which just means, all brain states exist in the same vivid, for-me way, since there is nothing further to distinguish between them that makes them this vivid, i.e. they all exist HERE-NOW.
But they don’t have the anticipation-related properties I listed above; so what hypotheses are we distinguishing by updating from “these experiences aren’t mine” to “these experiences are mine”?
Maybe the update that’s happening is something like: “Previously it felt to me like other people’s experiences weren’t fully real. I was unduly selfish and self-centered, because my experiences seemed to me like they were the center of the universe; I abstractly and theoretically knew that other people have their own point of view, but that fact didn’t really hit home for me. Then something happened, and I had a sudden realization that no, it’s all real.”
If so, then that seems totally fine to me. But I worry that the view in question might instead be something tacitly Cartesian, insofar as it’s trying to say “all experiences are for me”—something that doesn’t make a lot of sense to say if there are two brain states on opposite sides of the universe with nothing in common and nothing connecting them, but that does make sense if there’s a Ghost the experiences are all “for”.
As a test, I asked a non-philosopher friend of mine what their view is. Here’s a transcript of our short conversation: https://docs.google.com/document/d/1s1HOhrWrcYQ5S187vmpfzZcBfolYFIbeTYgqeebNIA0/edit
I was a bit annoyingly repetitive with trying to confirm and re-confirm what their view is, but I think it’s clear from the exchange that my interpretation is correct at least for this person.
Is there even anybody claiming there is an experiential difference?
Yep! Ask someone with this view whether the current stream of consciousness continues from their pre-uploaded self to their post-uploaded self, like it continues when they pass through a doorway. The typical claim is some version of “this stream of consciousness will end, what comes next is only oblivion”, not “oh sure, the stream of consciousness is going to continue in the same way it always does, but I prefer not to use the English word ‘me’ to refer to the later parts of that stream of consciousness”.
This is why the disagreement here has policy implications: people with different views of personal identity have different beliefs about the desirability of mind uploading. They aren’t just disagreeing about how to use words, and if they were, you’d be forced into the equally “uncharitable” perspective that someone here is very confused about how relevant word choice is to the desirability of uploading.
The alternative to this is that there is a disagreement about the appropriate semantic interpretation/analysis of the question. E.g. about what we mean when we say “I will (not) experience such and such”. That seems more charitable than hypothesizing beliefs in “ghosts” or “magic”.
I didn’t say that the relevant people endorse a belief in ghosts or magic. (Some may do so, but many explicitly don’t!)
It’s a bit darkly funny that you’ve reached for a clearly false and super-uncharitable interpretation of what I said, in the same sentence you’re chastising me for being uncharitable! But also, “charity” is a bad approach to trying to understand other people, and bad epistemology can get in the way of a lot of stuff.
The problem was that you first seemed to belittle questions about word meanings (“self”) as being “just” about “definitions” that are “purely verbal”.
I did no such thing!
Luckily now you concede that the question about the meaning of “I” isn’t just about (arbitrary) “definitions”
Read the blog post at the top of this page! It’s my attempt to answer the question of when a mind is “me”, and you’ll notice it’s not talking about definitions.
But we already know all the empirical facts: Someone goes into the teleporter, a bit later someone comes out at the other end and sees something. So the issue can only be about the semantic interpretation of that question, about what we mean with expressions like “I will see x”.
Nope!
There are two perspectives here:
“I don’t want to upload myself, because I wouldn’t get to experience that uploads’ experiences. When I die, this stream of consciousness will end, rather than continuing in another body. Physically dying and then being being copied elsewhere is not phenomenologically indistinguishable from stepping through a doorway.”
“I do want to upload myself, because I would get to experience that uploads’ experiences. Physically dying and then being copied myself is phenomenologically indistinguishable from stepping through a doorway.”
The disagreement between these two perspectives isn’t about word definitions at all; a fear that “when my body dies, there will be nothing but oblivion” is a very real fear about anticipated experiences (and anticipated absences of experience), not a verbal quibble about how we ought to define a specific word.
But it’s also a bit confusing to call the disagreement between these two perspectives “empirical”, because “empirical” here is conflating “third-person empirical” with “first-person empirical”.
The disagreement here is about whether a stream of consciousness can “continue” across temporal and spatial gaps, in the same way that it continues when there are no obvious gaps. It’s about whether there’s a subjective, experiential difference between stepping through a doorway and using a teleporter.
The thing I’m arguing in the OP is that there can’t be an experiential difference here, because there’s no physical difference that could be underlying the supposed experiential difference. So the disagreement about the first-person facts, I claim, stems from a cognitive error, which I characterize as “making predictions as though you believed yourself to be a Cartesian Ghost (even if you don’t on-reflection endorse the claim that Cartesian Ghosts exist)”. This is, again, a very different error from “defining a word in a nonstandard way”.
You’re also free to define “I” however you want in your values.
Sort of!
It’s true that no law of nature will stop you from using “I” in a nonstandard way; your head will not explode if you redefine “table” to mean “penguin”.
And it’s true that there are possible minds in abstract mindspace that have all sorts of values, including strict preferences about whether they want their brain to be made of silicon vs. carbon.
But it’s not true that humans alive today have full and complete control over their own preferences.
And it’s not true that humans can never be mistaken in their beliefs about their own preferences.
In the case of teleportation, I think teleportation-phobic people are mostly making an implicit error of the form “mistakenly modeling situations as though you are a Cartesian Ghost who is observing experiences from outside the universe”, not making a mistake about what their preferences are per se. (Though once you realize that you’re not a Cartesian Ghost, that will have some implications for what experiences you expect to see next in some cases, and implications for what physical world-states you prefer relative to other world-states.)
FWIW, I typically use “alignment research” to mean “AI research aimed at making it possible to safely do ambitious things with sufficiently-capable AI” (with an emphasis on “safely”). So I’d include things like Chris Olah’s interpretability research, even if the proximate impact of this is just “we understand what’s going on better, so we may be more able to predict and finely control future systems” and the proximate impact is not “the AI is now less inclined to kill you”.
Some examples: I wouldn’t necessarily think of “figure out how we want to airgap the AI” as “alignment research”, since it’s less about designing the AI, shaping its mind, predicting and controlling it, etc., and more about designing the environment around the AI.
But I would think of things like “figure out how to make this AI system too socially-dumb to come up with ideas like ‘maybe I should deceive my operators’, while keeping it superhumanly smart at nanotech research” as central examples of “alignment research”, even though it’s about controlling capabilities (‘make the AI dumb in this particular way’) rather than about instilling a particular goal into the AI.And I’d also think of “we know this AI is trying to kill us; let’s figure out how to constrain its capabilities so that it keeps wanting that, but is too dumb to find a way to succeed in killing us, thereby forcing it to work with us rather than against us in order to achieve more of what it wants” as a pretty central example of alignment research, albeit not the sort of alignment research I feel optimistic about. The way I think about the field, you don’t have to specifically attack the “is it trying to kill you?” part of the system in order to be doing alignment research; there are other paths, and alignment researchers should consider all of them and focus on results rather than marrying a specific methodology.
But that isn’t an experience. It’s two experiences. You will not have an experience of having two experiences. Two experiences will experience having been one person.
Sure; from my perspective, you’re saying the same thing as me.
Are you going to care about 1000 different copies equally?
How am I supposed to choose between them?
Why? If “I” is arbitrary definition, then “When I step through this doorway, will I have another experience?” depends on this arbitrary definition and so is also arbitrary.
Which things count as “I” isn’t an arbitrary definition; it’s just a fuzzy natural-language concept.
(I guess you can call that “arbitrary” if you want, but then all the other words in the sentence, like “doorway” and “step”, are also “arbitrary”.)
Analogy: When you’re writing in your personal diary, you’re free to define “table” however you want. But in ordinary English-language discourse, if you call all penguins “tables” you’ll just be wrong. And this fact isn’t changed at all by the fact that “table” lacks a perfectly formal physics-level definition.
The same holds for “Will Rob Bensinger’s next experience be of sitting in his bedroom writing a LessWrong comment, or will it be of him grabbing some tomatoes in a supermarket in Beijing?”
Terms like ‘Rob Bensinger’ and ‘I’ aren’t perfectly physically crisp — there may be cases where the answer is “ehh, maybe?” rather than a clear yes or no. And if we live in a Big Universe and we allow that there can be many Beijings out there in space, then we’ll have to give a more nuanced quantitative answer, like “a lot more of Rob’s immediate futures are in his bedroom than in Beijing”.
But if we restrict our attention to this Beijing, then all that complexity goes away and we can pretty much rule out that anyone in Beijing will happen to momentarily exhibit exactly the right brain state to look like “Rob Bensinger plus one time step”.
The nuances and wrinkles don’t bleed over and make it a totally meaningless or arbitrary question; and indeed, if I thought I were likely to spontaneously teleport to Beijing in the next minute, I’d rightly be making very different life-choices! “Will I experience myself spontaneously teleporting to Beijing in the next second?” is a substantive (and easy) question, not a deep philosophical riddle.
So you always anticipate all possible experiences, because of multiverse?
Not all possible experiences; just all experiences of brains that have the same kinds of structural similarities to your current brain as, e.g., “me after I step through a doorway” has to “me before I stepped through the doorway”.
Two things:
For myself, I would not feel comfortable using language as confident-sounding as “on the default trajectory, AI is going to kill everyone” if I assigned (e.g.) 10% probability to “humanity [gets] a small future on a spare asteroid-turned-computer or an alien zoo or maybe even star”. I just think that scenario’s way, way less likely than that.
I’d be surprised if Nate assigns 10+% probability to scenarios like that, but he can speak for himself. 🤷♂️
I think some people at MIRI have significantly lower p(doom)? And I don’t expect those people to use language like “on the default trajectory, AI is going to kill everyone”.
I agree with you that there’s something weird about making lots of human-extinction-focused arguments when the thing we care more about is “does the cosmic endowment get turned into paperclips”? I do care about both of those things, an enormous amount; and I plan to talk about both of those things to some degree in public communications, rather than treating it as some kind of poorly-kept secret that MIRI folks care about whether flourishing interstellar civilizations get a chance to exist down the line. But I have this whole topic mentally flagged as a thing to be thoughtful and careful about, because it at least seems like an area that contains risk factors for future deceptive comms. E.g., if we update later to expecting the cosmic endowment to be wasted but all humans not dying, I would want us to adjust our messaging even if that means sacrificing some punchiness in our policy outreach.
Currently, however, I think the particular scenario “AI keeps a few flourishing humans around forever” is incredibly unlikely, and I don’t think Eliezer, Nate, etc. would say things like “this has a double-digit probability of happening in real life”? And, to be honest, the idea of myself and my family and friends and every other human being all dying in the near future really fucks me up and does not seem in any sense OK, even if (with my philosopher-hat on) I think this isn’t as big of a deal as “the cosmic endowment gets wasted”.
So I don’t currently feel bad about emphasizing a true prediction (“extremely likely that literally all humans literally nonconsensually die by violent means”), even though the philosophy-hat version of me thinks that the separate true prediction “extremely likely 99+% of the potential value of the long-term future is lost” is more morally important than that. Though I do feel obliged to semi-regularly mention the whole “cosmic endowment” thing in my public communication too, even if it doesn’t make it into various versions of my general-audience 60-second AI risk elevator pitch.