The more complex messges sounds like a great way to make the public communication more complex and offputting.
The difference between killing everyone and killing almost everyone while keeping a few alive for arcane purposes does not matter to most people, nor should it.
I agree that the arguments for misaligned AGI killing absolutely everyone aren’t solid, but the arguments against that seem at least as shaky. So rounding it to “might quite possibly kill everyone” seems fair and succinct.
From the other thread where this comment originated: the argument that AGI won’t kill everyone because people wouldn’t kill everyone seems very bad, even when applied to human-imitating LLM-based AGI. People are nice because evolution meticulously made us nice. And even humans have killed an awful lot of people, with no sign they’d stop before killing everyone if it seemed useful for their goals.
That phrase sounds like the Terminator movies to me; it sounds like plucky humans could still band together to overthrow their robot overlords. I want to convey a total loss of control.
In documents where we have more room to unpack concepts I can imagine getting into some of the more exotic scenarios like aliens buying brain scans, but mostly I don’t expect our audiences to find that scenario reassuring in any way, and going into any detail about it doesn’t feel like a useful way to spend weirdness points.
Some of the other things you suggest, like future systems keeping humans physically alive, do not seem plausible to me. Whatever they’re trying to do, there’s almost certainly a better way to do it than by keeping Matrix-like human body farms running.
going into any detail about it doesn’t feel like a useful way to spend weirdness points.
That may be a reasonable consequentialist decision given your goals, but it’s in tension with your claim in the post to be disregarding the advice of people telling you to “hoard status and credibility points, and [not] spend any on being weird.”
Whatever they’re trying to do, there’s almost certainly a better way to do it than by keeping Matrix-like human body farms running.
You’ve completely ignored the arguments from Paul Christiano that Ryan linked to at the top of the thread. (In case you missed it: 12.)
The claim under consideration is not that “keeping Matrix-like human body farms running” arises as an instrumental subgoal of “[w]hatever [AIs are] trying to do.” (If you didn’t have time to read the linked arguments, you could have just said that instead of inventing an obvious strawman.)
Rather, the claim is that it’s plausible that the AI we build (or some agency that has decision-theoretic bargaining power with it) cares about humans enough to spend some tiny fraction of the cosmic endowment on our welfare. (Compare to how humans care enough about nature preservation and animal welfare to spend some resources on it, even though it’s a tiny fraction of what our civilization is doing.)
Maybe you think that’s implausible, but if so, there should be a counterargument explaining why Christiano is wrong. As Ryan notes, Yudkowsky seems to believe that some scenarios in which an agency with bargaining power cares about humans are plausible, describing one example of such as “validly incorporat[ing] most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don’t expect Earthlings to think about validly.” I regard this statement as undermining your claim in the post that MIRI’s “reputation as straight shooters [...] remains intact.” Withholding information because you don’t trust your audience to reason validly (!!) is not at all the behavior of a “straight shooter”.
I think it makes sense to state the more direct threat-model of literal extinction; though I am also a little confused by the citing of weirdness points… I would’ve said that it makes the whole conversation more complex in a way that (I believe) everyone would reliably end up thinking was not a productive use of time.
(Expanding on this a little: I think that literal extinction is a likely default outcome, and most people who are newly coming to this topic would want to know that this is even in the hypothesis-space and find that to be key information. I think if I said “also maybe they later simulate us in weird configurations like pets for a day every billion years while experiencing insane things” they would not respond “ah, never mind then, this subject is no longer a very big issue”, they would be more like “I would’ve preferred that you had factored this element out of our discussion so far, we spent a lot of time on it yet it still seems to me like the extinction event being on the table is the primary thing that I want to debate”.)
Withholding information because you don’t trust your audience to reason validly (!!) is not at all the behavior of a “straight shooter”.
Hmm, I’m not sure I exactly buy this. I think you should probably follow something like onion honesty which can involve intentionally simplifying your message to something you expect will give the audience more true views. I think you should lean on the side of stating things, but still, sometimes stating a thing which is true can be clearly distracting and confusing and thus you shouldn’t.
Some of the other things you suggest, like future systems keeping humans physically alive, do not seem plausible to me.
I agree with Gretta here, and I think this is a crux. If MIRI folks thought it were likely that AI will leave a few humans biologically alive (as opposed to information-theoretically revivable), I don’t think we’d be comfortable saying “AI is going to kill everyone”. (I encourage other MIRI folks to chime in if they disagree with me about the counterfactual.)
I also personally have maybe half my probability mass on “the AI just doesn’t store any human brain-states long-term”, and I have less than 1% probability on “conditional on the AI storing human brain-states for future trade, the AI does in fact encounter aliens that want to trade and this trade results in a flourishing human civilization”.
That phrase sounds like the Terminator movies to me; it sounds like plucky humans could still band together to overthrow their robot overlords. I want to convey a total loss of control.
Yeah, seems like a reasonable concern.
FWIW, I also do think that it is reasonably likely that we’ll see conflict between human factions and AI factions (likely with humans allies) in which the human factions could very plausibly win. So, personally, I don’t think that “immediate total loss of control” is what people should typically be imagining.
Some of the other things you suggest, like future systems keeping humans physically alive, do not seem plausible to me. Whatever they’re trying to do, there’s almost certainly a better way to do it than by keeping Matrix-like human body farms running.
Insofar as AIs are doing things because they are what existing humans want (within some tiny cost budget), then I expect that you should imagine that what actually happens is what humans want (rather than e.g. what the AI thinks they “should want”) insofar as what humans want is cheap.
See also here which makes a similar argument in response to a similar point.
So, if humans don’t end up physically alive but do end up as uploads/body farms/etc one of a few things must be true:
Humans didn’t actually want to be physically alive and instead wanted to be uploads. In this case, it is very misleading to say “the AI will kill everyone (and sure there might be uploads, but you don’t want to be an upload right?)” because we’re conditioning on people deciding to become uploads!
It was too expensive to keep people physically alive rather than uploads. I think this is possible but somewhat implausible: the main reasons for cost here apply to uploads as much as to keeping humans physically alive. In particular, death due to conflict or mass slaughter in cases where conflict was the AI’s best option to increase the probability of long run control.
I don’t think slaughtering billions of people would be very useful. As a reference point, wars between countries almost never result in slaughtering that large a fraction of people
I would like to +1 the “I don’t expect our audiences to find that scenario reassuring in any way”—I would also add that the average policymaker I’ve ever met wouldn’t find a lack of including the exotic scenarios to be in any way inaccurate or deceitful, unless you were way in the weeds for a multi-hour convo and-or they asked you in detail for “well, are there any weird edge cases where we make it through”.
The difference between killing everyone and killing almost everyone while keeping a few alive for arcane purposes does not matter to most people, nor should it.
I basically agree with this as stated, but think these arguments also imply that it is reasonably likely that the vast majority of people will survive misaligned AI takeover (perhaps 50% likely).
I also don’t think this is very well described as arcane purposes:
Kindness is pretty normal.
Decision theory motivations is actually also pretty normal from some perspective: it’s just the generalization of relatively normal “if you wouldn’t have screwed me over and it’s cheap for me, I won’t screw you over”. (Of course, people typically don’t motivate this sort of thing in terms of decision theory so there is a bit of a midwit meme here.)
You’re right. I didn’t mean to say that kindness is arcane. I was referring to acausal trade or other strange reasons to keep some humans around for possible future use.
Kindness is normal in our world, but I wouldn’t assume it will exist in every or even most situations with intelligent beings. Humans are instinctively kind (except for sociopathic and sadistic people), because that is good game theory for our situation: interactions with peers, in which collaboration/teamwork is useful.
A being capable of real recursive self-improvement, let alone duplication and creation of subordinate minds is not in that situation. They may temporarily be dealing with peers, but they might reasonably expect to have no need of collaborators in the near future. Thus, kindness isn’t rational for that type of being.
The exception would be if they could make a firm commitment to kindness while they do have peers and need collaborators. They might have kindness merely as an instrumental goal, in which case it would be abandoned as soon as it was no longer useful.
Or they might display kindness more instinctively, as a tendency in their thought or behavior. They might even have it engineered as an innate goal, as Steve hopes to engineer. In those last two cases, I think it’s possible that reflexive stability would keep that kindness in place as the AGI continued to grow, but I wouldn’t bet on it unless kindness was their central goal. If it was merely a tendency and not an explicit and therefore self-endorsed goal, I’d expect it to be dropped like the bad habit it effectively is. If it was an innate goal but not the strongest one, I don’t know but wouldn’t bet on it being long-term reflexively stable under deliberate self-modification.
(As far as I know, nobody has tried hard to work through the logic of reflexive stability of multiple goals. I tried, and gave it up as too vague and less urgent than other alignment questions. My tentative answer was maybe multiple goals would be reflectively stable; it depends on the exact structure of the decision-making process in that AGI/mind).
The more complex messges sounds like a great way to make the public communication more complex and offputting.
The difference between killing everyone and killing almost everyone while keeping a few alive for arcane purposes does not matter to most people, nor should it.
I agree that the arguments for misaligned AGI killing absolutely everyone aren’t solid, but the arguments against that seem at least as shaky. So rounding it to “might quite possibly kill everyone” seems fair and succinct.
From the other thread where this comment originated: the argument that AGI won’t kill everyone because people wouldn’t kill everyone seems very bad, even when applied to human-imitating LLM-based AGI. People are nice because evolution meticulously made us nice. And even humans have killed an awful lot of people, with no sign they’d stop before killing everyone if it seemed useful for their goals.
Why not “AIs might violently takeover the world”?
Seems accurate to the concern while also avoiding any issues here.
That phrase sounds like the Terminator movies to me; it sounds like plucky humans could still band together to overthrow their robot overlords. I want to convey a total loss of control.
In documents where we have more room to unpack concepts I can imagine getting into some of the more exotic scenarios like aliens buying brain scans, but mostly I don’t expect our audiences to find that scenario reassuring in any way, and going into any detail about it doesn’t feel like a useful way to spend weirdness points.
Some of the other things you suggest, like future systems keeping humans physically alive, do not seem plausible to me. Whatever they’re trying to do, there’s almost certainly a better way to do it than by keeping Matrix-like human body farms running.
That may be a reasonable consequentialist decision given your goals, but it’s in tension with your claim in the post to be disregarding the advice of people telling you to “hoard status and credibility points, and [not] spend any on being weird.”
You’ve completely ignored the arguments from Paul Christiano that Ryan linked to at the top of the thread. (In case you missed it: 1 2.)
The claim under consideration is not that “keeping Matrix-like human body farms running” arises as an instrumental subgoal of “[w]hatever [AIs are] trying to do.” (If you didn’t have time to read the linked arguments, you could have just said that instead of inventing an obvious strawman.)
Rather, the claim is that it’s plausible that the AI we build (or some agency that has decision-theoretic bargaining power with it) cares about humans enough to spend some tiny fraction of the cosmic endowment on our welfare. (Compare to how humans care enough about nature preservation and animal welfare to spend some resources on it, even though it’s a tiny fraction of what our civilization is doing.)
Maybe you think that’s implausible, but if so, there should be a counterargument explaining why Christiano is wrong. As Ryan notes, Yudkowsky seems to believe that some scenarios in which an agency with bargaining power cares about humans are plausible, describing one example of such as “validly incorporat[ing] most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don’t expect Earthlings to think about validly.” I regard this statement as undermining your claim in the post that MIRI’s “reputation as straight shooters [...] remains intact.” Withholding information because you don’t trust your audience to reason validly (!!) is not at all the behavior of a “straight shooter”.
I think it makes sense to state the more direct threat-model of literal extinction; though I am also a little confused by the citing of weirdness points… I would’ve said that it makes the whole conversation more complex in a way that (I believe) everyone would reliably end up thinking was not a productive use of time.
(Expanding on this a little: I think that literal extinction is a likely default outcome, and most people who are newly coming to this topic would want to know that this is even in the hypothesis-space and find that to be key information. I think if I said “also maybe they later simulate us in weird configurations like pets for a day every billion years while experiencing insane things” they would not respond “ah, never mind then, this subject is no longer a very big issue”, they would be more like “I would’ve preferred that you had factored this element out of our discussion so far, we spent a lot of time on it yet it still seems to me like the extinction event being on the table is the primary thing that I want to debate”.)
Hmm, I’m not sure I exactly buy this. I think you should probably follow something like onion honesty which can involve intentionally simplifying your message to something you expect will give the audience more true views. I think you should lean on the side of stating things, but still, sometimes stating a thing which is true can be clearly distracting and confusing and thus you shouldn’t.
Passing the onion test is better than not passing it, but I think the relevant standard is having intent to inform. There’s a difference between trying to share relevant information in the hopes that the audience will integrate it with their own knowledge and use it to make better decisions, and selectively sharing information in the hopes of persuading the audience to make the decision you want them to make.
An evidence-filtering clever arguer can pass the onion test (by not omitting information that the audience would be surprised to learn was omitted) and pass the test of not technically lying (by not making false statements) while failing to make a rational argument in which the stated reasons are the real reasons.
Man I just want to say I appreciate you following up on each subthread and noting where you agree/disagree, it feels earnestly truthseeky to me.
I agree with Gretta here, and I think this is a crux. If MIRI folks thought it were likely that AI will leave a few humans biologically alive (as opposed to information-theoretically revivable), I don’t think we’d be comfortable saying “AI is going to kill everyone”. (I encourage other MIRI folks to chime in if they disagree with me about the counterfactual.)
I also personally have maybe half my probability mass on “the AI just doesn’t store any human brain-states long-term”, and I have less than 1% probability on “conditional on the AI storing human brain-states for future trade, the AI does in fact encounter aliens that want to trade and this trade results in a flourishing human civilization”.
Yeah, seems like a reasonable concern.
FWIW, I also do think that it is reasonably likely that we’ll see conflict between human factions and AI factions (likely with humans allies) in which the human factions could very plausibly win. So, personally, I don’t think that “immediate total loss of control” is what people should typically be imagining.
Insofar as AIs are doing things because they are what existing humans want (within some tiny cost budget), then I expect that you should imagine that what actually happens is what humans want (rather than e.g. what the AI thinks they “should want”) insofar as what humans want is cheap.
See also here which makes a similar argument in response to a similar point.
So, if humans don’t end up physically alive but do end up as uploads/body farms/etc one of a few things must be true:
Humans didn’t actually want to be physically alive and instead wanted to be uploads. In this case, it is very misleading to say “the AI will kill everyone (and sure there might be uploads, but you don’t want to be an upload right?)” because we’re conditioning on people deciding to become uploads!
It was too expensive to keep people physically alive rather than uploads. I think this is possible but somewhat implausible: the main reasons for cost here apply to uploads as much as to keeping humans physically alive. In particular, death due to conflict or mass slaughter in cases where conflict was the AI’s best option to increase the probability of long run control.
I don’t think slaughtering billions of people would be very useful. As a reference point, wars between countries almost never result in slaughtering that large a fraction of people
Unfortunately, if the AI really barely cares (e.g. <1/billion caring), it might only need to be barely useful.
I agree it is unlikely to be very useful.
I would like to +1 the “I don’t expect our audiences to find that scenario reassuring in any way”—I would also add that the average policymaker I’ve ever met wouldn’t find a lack of including the exotic scenarios to be in any way inaccurate or deceitful, unless you were way in the weeds for a multi-hour convo and-or they asked you in detail for “well, are there any weird edge cases where we make it through”.
Sure! I like it for brevity and accuracy of both the threat and its seriousness. I’ll try to use it instead of “kill everyone.”
I basically agree with this as stated, but think these arguments also imply that it is reasonably likely that the vast majority of people will survive misaligned AI takeover (perhaps 50% likely).
I also don’t think this is very well described as arcane purposes:
Kindness is pretty normal.
Decision theory motivations is actually also pretty normal from some perspective: it’s just the generalization of relatively normal “if you wouldn’t have screwed me over and it’s cheap for me, I won’t screw you over”. (Of course, people typically don’t motivate this sort of thing in terms of decision theory so there is a bit of a midwit meme here.)
You’re right. I didn’t mean to say that kindness is arcane. I was referring to acausal trade or other strange reasons to keep some humans around for possible future use.
Kindness is normal in our world, but I wouldn’t assume it will exist in every or even most situations with intelligent beings. Humans are instinctively kind (except for sociopathic and sadistic people), because that is good game theory for our situation: interactions with peers, in which collaboration/teamwork is useful.
A being capable of real recursive self-improvement, let alone duplication and creation of subordinate minds is not in that situation. They may temporarily be dealing with peers, but they might reasonably expect to have no need of collaborators in the near future. Thus, kindness isn’t rational for that type of being.
The exception would be if they could make a firm commitment to kindness while they do have peers and need collaborators. They might have kindness merely as an instrumental goal, in which case it would be abandoned as soon as it was no longer useful.
Or they might display kindness more instinctively, as a tendency in their thought or behavior. They might even have it engineered as an innate goal, as Steve hopes to engineer. In those last two cases, I think it’s possible that reflexive stability would keep that kindness in place as the AGI continued to grow, but I wouldn’t bet on it unless kindness was their central goal. If it was merely a tendency and not an explicit and therefore self-endorsed goal, I’d expect it to be dropped like the bad habit it effectively is. If it was an innate goal but not the strongest one, I don’t know but wouldn’t bet on it being long-term reflexively stable under deliberate self-modification.
(As far as I know, nobody has tried hard to work through the logic of reflexive stability of multiple goals. I tried, and gave it up as too vague and less urgent than other alignment questions. My tentative answer was maybe multiple goals would be reflectively stable; it depends on the exact structure of the decision-making process in that AGI/mind).