I appreciate the straightforward and honest nature of this communication strategy, in the sense of “telling it like it is” and not hiding behind obscure or vague language. In that same spirit, I’ll provide my brief, yet similarly straightforward reaction to this announcement:
I think MIRI is incorrect in their assessment of the likelihood of human extinction from AI. As per their messaging, several people at MIRI seem to believe that doom is >80% likely in the 21st century (conditional on no global pause) whereas I think it’s more like <20%.
MIRI’s arguments for doom are often difficult to pin down, given the informal nature of their arguments, and in part due to their heavy reliance on analogies, metaphors, and vague supporting claims instead of concrete empirically verifiable models. Consequently, I find it challenging to respond to MIRI’s arguments precisely. The fact that they want to essentially shut down the field of AI based on these largely informal arguments seems premature to me.
MIRI researchers rarely provide any novel predictions about what will happen before AI doom, making their theories of doom appear unfalsifiable. This frustrates me. Given a low prior probability of doom as apparent from the empirical track record of technological progress, I think we should generally be skeptical of purely theoretical arguments for doom, especially if they are vague and make no novel, verifiable predictions prior to doom.
Separately from the previous two points, MIRI’s current most prominent arguments for doom seem very weak to me. Their broad model of doom appears to be something like the following (although they would almost certainly object to the minutia of how I have written it here):
(1) At some point in the future, a powerful AGI will be created. This AGI will be qualitatively distinct from previous, more narrow AIs. Unlike concepts such as “the economy”, “GPT-4″, or “Microsoft”, this AGI is not a mere collection of entities or tools integrated into broader society that can automate labor, share knowledge, and collaborate on a wide scale. This AGI is instead conceived of as a unified and coherent decision agent, with its own long-term values that it acquired during training. As a result, it can do things like lie about all of its fundamental values and conjure up plans of world domination, by itself, without any risk of this information being exposed to the wider world.
(2) This AGI, via some process such as recursive self-improvement, will rapidly “foom” until it becomes essentially an immortal god, at which point it will be able to do almost anything physically attainable, including taking over the world at almost no cost or risk to itself. While recursive self-improvement is the easiest mechanism to imagine here, it is not the only way this could happen.
(3) The long-term values of this AGI will bear almost no relation to the values that we tried to instill through explicit training, because of difficulties in inner alignment (i.e., a specific version of the general phenomenon of models failing to generalize correctly from training data). This implies that the AGI will care almost literally 0% about the welfare of humans (despite potentially being initially trained from the ground up on human data, and carefully inspected and tested by humans for signs of misalignment, in diverse situations and environments). Instead, this AGI will pursue a completely meaningless goal until the heat death of the universe.
(4) Therefore, the AGI will kill literally everyone after fooming and taking over the world.
It is difficult to explain in a brief comment why I think the argument just given is very weak. Instead of going into the various subclaims here in detail, for now I want to simply say, “If your model of reality has the power to make these sweeping claims with high confidence, then you should almost certainly be able to use your model of reality to make novel predictions about the state of the world prior to AI doom that would help others determine if your model is correct.”
The fact that MIRI has yet to produce (to my knowledge) any major empirically validated predictions or important practical insights into the nature of AI, or AI progress, in the last 20 years, undermines the idea that they have the type of special insight into AI that would allow them to express high confidence in a doom model like the one outlined in (4).
Eliezer’s response to claims about unfalsifiability, namely that “predicting endpoints is easier than predicting intermediate points”, seems like a cop-out to me, since this would seem to reverse the usual pattern in forecasting and prediction, without good reason.
Since I think AI will most likely be a very good thing for currently existing people, I am much more hesitant to “shut everything down” compared to MIRI. I perceive MIRI researchers as broadly well-intentioned, thoughtful, yet ultimately fundamentally wrong in their worldview on the central questions that they research, and therefore likely to do harm to the world. This admittedly makes me sad to think about.
Eliezer’s response to claims about unfalsifiability, namely that “predicting endpoints is easier than predicting intermediate points”, seems like a cop-out to me, since this would seem to reverse the usual pattern in forecasting and prediction, without good reason
It’s pretty standard? Like, we can make reasonable prediction of climate in 2100, even if we can’t predict weather two month ahead.
To be blunt, it’s not just that Eliezer lacks a positive track record in predicting the nature of AI progress, which might be forgivable if we thought he had really good intuitions about this domain. Empiricism isn’t everything, theoretical arguments are important too and shouldn’t be dismissed. But-
Eliezer thought AGI would be developed from a recursively self-improving seed AI coded up by a small group, “brain in a box in a basement” style. He dismissed and mocked connectionist approaches to building AI. His writings repeatedly downplayed the importance of compute, and he has straw-manned writers like Moravec who did a better job at predicting when AGI would be developed than he did.
Old MIRI intuition pumps about why alignment should be difficult like the “Outcome Pump” and “Sorcerer’s apprentice” are now forgotten, it was a surprise that it would be easy to create helpful genies like LLMs who basically just do what we want. Remaining arguments for the difficulty of alignment are esoteric considerations about inductive biases, counting arguments, etc. So yes, let’s actually look at these arguments and not just dismiss them, but let’s not pretend that MIRI has a good track record.
I think the core concerns remain, and more importantly, there are other rather doom-y scenarios possible involving AI systems more similar to the ones we have that opened up and aren’t the straight up singleton ASI foom. The problem here is IMO not “this specific doom scenario will become a thing” but “we don’t have anything resembling a GOOD vision of the future with this tech that we are nevertheless developing at breakneck pace”. Yet the amount of dystopian or apocalyptic possible scenarios is enormous. Part of this is “what if we lose control of the AIs” (singleton or multipolar), part of it is “what if we fail to structure our society around having AIs” (loss of control, mass wireheading, and a lot of other scenarios I’m not sure how to name). The only positive vision the “optimists” on this have to offer is “don’t worry, it’ll be fine, this clearly revolutionary and never seen before technology that puts in question our very role in the world will play out the same way every invention ever did”. And that’s not terribly convincing.
I’m not saying anything on object-level about MIRI models, my point is that “outcomes are more predictable than trajectories” is pretty standard epistemically non-suspicious statement about wide range of phenomena. Moreover, in particular circumstances (and many others) you can reduce it to object-level claim, like “do observarions on current AIs generalize to future AI?”
How does the question of whether AI outcomes are more predictable than AI trajectories reduce to the (vague) question of whether observations on current AIs generalize to future AIs?
ChatGPT falsifies prediction about future superintelligent recursive self-improving AI only if ChatGPT is generalizable predictor of design of future superintelligent AIs.
There will be future superintelligent AIs that improve themselves. But they will be neural networks, they will at the very least start out as a compute-intensive project, in the infant stages of their self-improvement cycles they will understand and be motivated by human concepts rather than being dumb specialized systems that are only good for bootstrapping themselves to superintelligence.
True knowledge about later times doesn’t let you generally make arbitrary predictions about intermediate times, given valid knowledge of later times. But true knowledge does usually imply that you can make some theory-specific predictions about intermediate times, given later times.
Thus, vis-a-vis your examples: Predictions about the climate in 2100 don’t involve predicting tomorrow’s weather. But they do almost always involve predictions about the climate in 2040 and 2070, and they’d be really sus if they didn’t.
Similarly:
If an astronomer thought that an asteroid was going to hit the earth, the astronomer generally could predict points it will be observed at in the future before hitting the earth. This is true even if they couldn’t, for instance, predict the color of the asteroid.
People who predicted that C19 would infect millions by T + 5 months also had predictions about how many people would be infected at T + 2. This is true even if they couldn’t predict how hard it would be to make a vaccine.
(Extending analogy to scale rather than time) The ability to predict that nuclear war would kill billions involves a pretty good explanation for how a single nuke would kill millions.
So I think that—entirely apart from specific claims about whether MIRI does this—it’s pretty reasonable to expect them to be able to make some theory-specific predictions about the before-end-times, although it’s unreasonable to expect them to make arbitrary theory-specific predictions.
I agree this is usually the case, but I think it’s not always true, and I don’t think it’s necessarily true here. E.g., people as early as Da Vinci guessed that we’d be able to fly long before we had planes (or even any flying apparatus which worked). Because birds can fly, and so we should be able to as well (at least, this was Da Vinci and the Wright brothers’ reasoning). That end point was not dependent on details (early flying designs had wings like a bird, a design which we did not keep :p), but was closer to a laws of physics claim (if birds can do it there isn’t anything fundamentally holding us back from doing it either).
Superintelligence holds a similar place in my mind: intelligence is physically possible, because we exhibit it, and it seems quite arbitrary to assume that we’ve maxed it out. But also, intelligence is obviously powerful, and reality is obviously more manipulable than we currently have the means to manipulate it. E.g., we know that we should be capable of developing advanced nanotech, since cells can, and that space travel/terraforming/etc. is possible.
These two things together—“we can likely create something much smarter than ourselves” and “reality can be radically transformed”—is enough to make me feel nervous. At some point I expect most of the universe to be transformed by agents; whether this is us, or aligned AIs, or misaligned AIs or what, I don’t know. But looking ahead and noticing that I don’t know how to select the “aligned AI” option from the set “things which will likely be able to radically transform matter” seems enough cause, in my mind, for exercising caution.
There’s a pretty big difference between statements like “superintelligence is physically possible”, “superintelligence could be dangerous” and statements like “doom is >80% likely in the 21st century unless we globally pause”. I agree with (and am not objecting to) the former claims, but I don’t agree with the latter claim.
I also agree that it’s sometimes true that endpoints are easier to predict than intermediate points. I haven’t seen Eliezer give a reasonable defense of this thesis as it applies to his doom model. If all he means here is that superintelligence is possible, it will one day be developed, and we should be cautious when developing it, then I don’t disagree. But I think he’s saying a lot more than that.
Your general point is true, but it’s not necessarily true that a correct model can (1) predict the timing of AGI or (2) that the predictable precursors to disaster occur before the practical c-risk (catastrophic-risk) point of no return. While I’m not as pessimistic as Eliezer, my mental model has these two limitations. My model does predict that, prior to disaster, a fairly safe, non-ASI AGI or pseudo-AGI (e.g. GPT6, a chatbot that can do a lot of office jobs and menial jobs pretty well) is likely to be invented before the really deadly one (if any[1]). But if I predicted right, it probably won’t make people take my c-risk concerns more seriously?
I think it’s more similar to saying that the climate in 2040 is less predictable than the climate in 2100, or saying that the weather 3 days from now is less predictable than the weather 10 days from now, which are both not true. By contrast, the weather vs. climate distinction is more of a difference between predicting point estimates vs. predicting averages.
the climate in 2040 is less predictable than the climate in 2100
It’s certainly not a simple question. Say, Gulf Stream is projected to collapse somewhere between now and 2095, with median date 2050. So, slightly abusing meaning of confidence intervals, we can say that in 2100 we won’t have Gulf Stream with probability >95%, while in 2040 Gulf Stream will still be here with probability ~60%, which is literally less predictable.
Chemists would give an example of chemical reactions, where final thermodynamically stable states are easy to predict, while unstable intermediate states are very hard to even observe.
Very dumb example: if you are observing radioactive atom with half-life of one minute, you can’t predict when atom is going to decay, but you can be very certain that it will decay after hour.
And why don’t you accept classic MIRI example that even if it’s impossible for human to predict moves of Stockfish 16, you can be certain that Stockfish will win?
Chemists would give an example of chemical reactions, where final thermodynamically stable states are easy to predict, while unstable intermediate states are very hard to even observe.
I agree there are examples where predicting the end state is easier to predict than the intermediate states. Here, it’s because we have strong empirical and theoretical reasons to think that chemicals will settle into some equilibrium after a reaction. With AGI, I have yet to see a compelling argument for why we should expect a specific easy-to-predict equilibrium state after it’s developed, which somehow depends very little on how the technology is developed.
It’s also important to note that, even if we know that there will be an equilibrium state after AGI, more evidence is generally needed to establish that the end equilibrium state will specifically be one in which all humans die.
And why don’t you accept classic MIRI example that even if it’s impossible for human to predict moves of Stockfish 16, you can be certain that Stockfish will win?
I don’t accept this argument as a good reason to think doom is highly predictable partly because I think the argument is dramatically underspecified without shoehorning in assumptions about what AGI will look like to make the argument more comprehensible. I generally classify arguments like this under the category of “analogies that are hard to interpret because the assumptions are so unclear”.
To help explain my frustration at the argument’s ambiguity, I’ll just give a small yet certainly non-exhaustive set of questions I have about this argument:
Are we imagining that creating an AGI implies that we play a zero-sum game against it? Why?
Why is it a simple human vs. AGI game anyway? Does that mean we’re lumping together all the humans into a single agent, and all the AGIs into another agent, and then they play off against each other like a chess match? What is the justification for believing the battle will be binary like this?
Are we assuming the AGI wants to win? Maybe it’s not an agent at all. Or maybe it’s an agent but not the type of agent that wants this particular type of outcome.
What does “win” mean in the general case here? Does it mean the AGI merely gets more resources than us, or does it mean the AGI kills everyone? These seem like different yet legitimate ways that one can “win” in life, with dramatically different implications for the losing parties.
There’s a lot more I can say here, but the basic point I want to make is that once you start fleshing this argument out, and giving it details, I think it starts to look a lot weaker than the general heuristic that Stockfish 16 will reliably beat humans in chess, even if we can’t predict its exact moves.
>Like, we can make reasonable prediction of climate in 2100, even if we can’t predict weather two month ahead.
This is a strange claim to make in a thread about AGI destroying the world. Obviously if AGI destroys the world we can not predict the weather in 2100.
Predicting the weather in 2100 requires you to make a number of detailed claims about the years between now and 2100 (for example, the carbon-emissions per year), and it is precisely the lack of these claims that @Matthew Barnett is talking about.
I strongly doubt we can predict the climate in 2100. Actual prediction would be a model that also incorporates the possibility of nuclear fusion, geoengineering, AGIs altering the atmosphere, etc.
I think you are abusing/misusing the concept of falsifiability here. Ditto for empiricism. You aren’t the only one to do this, I’ve seen it happen a lot over the years and it’s very frustrating. I unfortunately am busy right now but would love to give a fuller response someday, especially if you are genuinely interested to hear what I have to say (which I doubt, given your attitude towards MIRI).
I unfortunately am busy right now but would love to give a fuller response someday, especially if you are genuinely interested to hear what I have to say (which I doubt, given your attitude towards MIRI).
I’m a bit surprised you suspect I wouldn’t be interested in hearing what you have to say?
I think the amount of time I’ve spent engaging with MIRI perspectives over the years provides strong evidence that I’m interested in hearing opposing perspectives on this issue. I’d guess I’ve engaged with MIRI perspectives vastly more than almost everyone on Earth who explicitly disagrees with them as strongly as I do (although obviously some people like Paul Christiano and other AI safety researchers have engaged with them even more than me).
(I might not reply to you, but that’s definitely not because I wouldn’t be interested in what you have to say. I read virtually every comment-reply to me carefully, even if I don’t end up replying.)
Here’s a new approach: Your list of points 1 − 7. Would you also make those claims about me? (i.e. replace references to MIRI with references to Daniel Kokotajlo.)
You’ve made detailed predictions about what you expect in the next several years, on numerous occasions, and made several good-faith attempts to elucidate your models of AI concretely. There are many ways we disagree, and many ways I could characterize your views, but “unfalsifiable” is not a label I would tend to use for your opinions on AI. I do not mentally lump you together with MIRI in any strong sense.
OK, glad to hear. And thank you. :) Well, you’ll be interested to know that I think of my views on AGI as being similar to MIRI’s, just less extreme in various dimensions. For example I don’t think literally killing everyone is the most likely outcome, but I think it’s a very plausible outcome. I also don’t expect the ‘sharp left turn’ to be particularly sharp, such that I don’t think it’s a particularly useful concept. I also think I’ve learned a lot from engaging with MIRI and while I have plenty of criticisms of them (e.g. I think some of them are arrogant and perhaps even dogmatic) I think they have been more epistemically virtuous than the average participant in the AGI risk conversation, even the average ‘serious’ or ‘elite’ participant.
I don’t think [AGI/ASI] literally killing everyone is the most likely outcome
Huh, I was surprised to read this. I’ve imbibed a non-trivial fraction of your posts and comments here on LessWrong, and, before reading the above, my shoulder Daniel definitely saw extinction as the most likely existential catastrophe.
If you have the time, I’d be very interested to hear what you do think is the most likely outcome. (It’s very possible that you have written about this before and I missed it—my bad, if so.)
(My model of Daniel thinks the AI will likely take over, but probably will give humanity some very small fraction of the universe, for a mixture of “caring a tiny bit” and game-theoretic reasons)
(Fwiw, I don’t find the ‘caring a tiny bit’ story very reassuring, for the same reasons as Wei Dai, although I do find the acausal tradestory for why humans might be left with Earth somewhat heartening. (I’m assuming that by ‘game-theoretic reasons’ you mean acausal trade.))
Yep, Habryka is right. Also, I agree with Wei Dai re: reassuringness. I think literal extinction is <50% likely, but this is cold comfort given the badness of some of the plausible alternatives, and overall I think the probability of something comparably bad happening is >50%.
I want to publicly endorse and express appreciation for Matthew’s apparent good faith.
Every time I’ve ever seen him disagreeing about AI stuff on the internet (a clear majority of the times I’ve encountered anything he’s written), he’s always been polite, reasonable, thoughtful, and extremely patient. Obviously conversations sometimes entail people talking past each other, but I’ve seen him carefully try to avoid miscommunication, and (to my ability to judge) strawmanning.
Followup: Matthew and I ended up talking about it in person. tl;dr of my position is that
Falsifiability is a symmetric two-place relation; one cannot say “X is unfalsifiable,” except as shorthand for saying “X and Y make the same predictions,” and thus Y is equally unfalsifiable. When someone is going around saying “X is unfalsifiable, therefore not-X,” that’s often a misuse of the concept—what they should say instead is “On priors / for other reasons (e.g. deference) I prefer not-X to X; and since both theories make the same predictions, I expect to continue thinking this instead of updating, since there won’t be anything to update on
.What is the point of falsifiability-talk then? Well, first of all, it’s quite important to track when two theories make the same predictions, or the same-predictions-till-time-T. It’s an important part of the bigger project of extracting predictions from theories so they can be tested. It’s exciting progress when you discover that two theories make different predictions, and nail it down well enough to bet on. Secondly, it’s quite important to track when people are making this worse rather than easier—e.g. fortunetellers and pundits will often go out of their way to avoid making any predictions that diverge from what their interlocutors already would predict. Whereas the best scientists/thinkers/forecasters, the ones you should defer to, should be actively trying to find alpha and then exploit it by making bets with people around them. So falsifiability-talk is useful for evaluating people as epistemically virtuous or vicious. But note that if this is what you are doing, it’s all a relative thing in a different way—in the case of MIRI, for example, the question should be “Should I defer to them more, or less, than various alternative thinkers A B and C? --> Are they generally more virtuous about making specific predictions, seeking to make bets with their interlocutors, etc. than A B or C?”
So with that as context, I’d say that (a) It’s just wrong to say ‘MIRI’s theories of doom are unfalsifiable.’ Instead say ‘unfortunately for us (not for the plausibility of the theories), both MIRI’s doom theories and (insert your favorite non-doom theories here) make the same predictions until it’s basically too late.’ (b) One should then look at MIRI and be suspicious and think ‘are they systematically avoiding making bets, making specific predictions, etc. relative to the other people we could defer to? Are they playing the sneaky fortuneteller or pundit’s game?’ to which I think the answer is ‘no not at all, they are actually more epistemically virtuous in this regard than the average intellectual. That said, they aren’t the best either—some other people in the AI risk community seem to be doing better than them in this regard, and deserve more virtue points (and possibly deference points) therefore.’ E.g. I think both Matthew and I have more concrete forecasting track records than Yudkowsky?
“If your model of reality has the power to make these sweeping claims with high confidence, then you should almost certainly be able to use your model of reality to make novel predictions about the state of the world prior to AI doom that would help others determine if your model is correct.”
This is partially derivable from Bayes rule. In order for you to gain confidence in a theory, you need to make observations which are more likely in worlds where the theory is correct. Since MIRI seems to have grown even more confident in their models, they must’ve observed something which is more likely to be correct under their models. Therefore, to obey Conservation of Expected Evidence, the world could have come out a different way which would have decreased their confidence. So it was falsifiable this whole time. However, in my experience, MIRI-sympathetic folk deny this for some reason.
It’s simply not possible, as a matter of Bayesian reasoning, to lawfully update (today) based on empirical evidence (like LLMs succeeding) in order to change your probability of a hypothesis that “doesn’t make” any empirical predictions (today).
The fact that MIRI has yet to produce (to my knowledge) any major empirically validated predictions or important practical insights into the nature AI, or AI progress, in the last 20 years, undermines the idea that they have the type of special insight into AI that would allow them to express high confidence in a doom model like the one outlined in (4).
In summer 2022, Quintin Pope was explaining the results of the ROME paper to Eliezer. Eliezer impatiently interrupted him and said “so they found that facts were stored in the attention layers, so what?”. Of course, this was exactly wrong—Bau et al. found the circuits in mid-network MLPs. Yet, there was no visible moment of “oops” for Eliezer.
In summer 2022, Quintin Pope was explaining the results of the ROME paper to Eliezer. Eliezer impatiently interrupted him and said “so they found that facts were stored in the attention layers, so what?”. Of course, this was exactly wrong—Bau et al. found the circuits in mid-network MLPs. Yet, there was no visible moment of “oops” for Eliezer.
I think I am missing context here. Why is that distinction between facts localized in attention layers and in MLP layers so earth-shaking Eliezer should have been shocked and awed by a quick guess during conversation being wrong, and is so revealing an anecdote you feel that it is the capstone of your comment, crystallizing everything wrong about Eliezer into a story?
^ Aggressive strawman which ignores the main point of my comment. I didn’t say “earth-shaking” or “crystallizing everything wrong about Eliezer” or that the situation merited “shock and awe.” Additionally, the anecdote was unrelated to the other section of my comment, so I didn’t “feel” it was a “capstone.”
I would have hoped, with all of the attention on this exchange, that someone would reply “hey, TurnTrout didn’t actually say that stuff.” You know, local validity and all that. I’m really not going to miss this site.
Anyways, gwern, it’s pretty simple. The community edifies this guy and promotes his writing as a way to get better at careful reasoning. However, my actual experience is that Eliezer goes around doing things like e.g. impatiently interrupting people and being instantly wrong about it (importantly, in the realm of AI, as was the original context). This makes me think that Eliezer isn’t deploying careful reasoning to begin with.
^ Aggressive strawman which ignores the main point of my comment. I didn’t say “earth-shaking” or “crystallizing everything wrong about Eliezer” or that the situation merited “shock and awe.”
I, uh, didn’t say you “say” either of those: I was sarcastically describing your comment about an anecdote that scarcely even seemed to illustrate what it was supposed to, much less was so important as to be worth recounting years later as a high profile story (surely you can come up with something better than that after all this time?), and did not put my description in quotes meant to imply literal quotation, like you just did right there. If we’re going to talk about strawmen...
someone would reply “hey, TurnTrout didn’t actually say that stuff.”
No one would say that or correct me for falsifying quotes, because I didn’t say you said that stuff. They might (and some do) disagree with my sarcastic description, but they certainly weren’t going to say ‘gwern, TurnTrout never actually used the phrase “shocked and awed” or the word “crystallizing”, how could you just make stuff up like that???’ …Because I didn’t. So it seems unfair to judge LW and talk about how you are “not going to miss this site”. (See what I did there? I am quoting you, which is why the text is in quotation marks, and if you didn’t write that in the comment I am responding to, someone is probably going to ask where the quote is from. But they won’t, because you did write that quote).
You know, local validity and all that. I’m really not going to miss this site.
In jumping to accusations of making up quotes and attacking an entire site for not immediately criticizing me in the way you are certain I should be criticized and saying that these failures illustrate why you are quitting it, might one say that you are being… overconfident?
Additionally, the anecdote was unrelated to the other section of my comment, so I didn’t “feel” it was a “capstone.”
Quite aside from it being in the same comment and so you felt it was related, it was obviously related to your first half about overconfidence in providing an anecdote of what you felt was overconfidence, and was rhetorically positioned at the end as the concrete Eliezer conclusion/illustration of the first half about abstract MIRI overconfidence. And you agree that that is what you are doing in your own description, that he “isn’t deploying careful reasoning” in the large things as well as the small, and you are presenting it as a small self-contained story illustrating that general overconfidence:
However, my actual experience is that Eliezer goes around doing things like e.g. impatiently interrupting people and being instantly wrong about it (importantly, in the realm of AI, as was the original context). This makes me think that Eliezer isn’t deploying careful reasoning to begin with.
That said, It also appears to me that Eliezer is probably not the most careful reasoner, and appears indeed often (perhaps egregiously) overconfident.
That doesn’t mean one should begrudge people finding value in the sequences although it is certainly not ideal if people take them as mantras rather than useful pointers and explainers for basic things (I didn’t read them, so might have an incorrect view here). There does appear to be some tendency to just link to some point made in the sequences as some airtight thing, although I haven’t found it too pervasive recently.
Disagree. Epistemics is a group project and impatiently interrupting people can make both you and your interlocutor less likely to combine your information into correct conclusions. It is also evidence that you’re incurious internally which makes you worse at reasoning, though I don’t want to speculate on Eliezer’s internal experience in particular.
I agree with the first sentence. I agree with the second sentence with the caveat that it’s not strong absolute evidence, but mostly applies to the given setting (which is exactly what I’m saying).
People aren’t fixed entities and the quality of their contributions can vary over time and depend on context.
One day a mathematician doesn’t know a thing. The next day they do. In between they made no observations with their senses of the world.
It’s possible to make progress through theoretical reasoning. It’s not my preferred approach to the problem (I work on a heavily empirical team at a heavily empirical lab) but it’s not an invalid approach.
I personally have updated a fair amount over time on
people (going on) expressing invalid reasoning for their beliefs about timelines and alignment;
people (going on) expressing beliefs about timelines and alignment that seemed relatively more explicable via explanations other than “they have some good reason to believe this that I don’t know about”;
other people’s alignment hopes and mental strategies have more visible flaws and visible doomednesses;
other people mostly don’t seem to cumulatively integrate the doomednesses of their approaches into their mental landscape as guiding elements;
my own attempts to do so fail in a different way, namely that I’m too dumb to move effectively in the resulting modified landscape.
We can back out predictions of my personal models from this, such as “we will continue to not have a clear theory of alignment” or “there will continue to be consensus views that aren’t supported by reasoning that’s solid enough that it ought to produce that consensus if everyone is being reasonable”.
I thought the first paragraph and the boldened bit of your comment seemed insightful. I don’t see why what you’re saying is wrong – it seems right to me (but I’m not sure).
(I didn’t get anything out of it, and it seems kind of aggressive in a way that seems non-sequitur-ish, and also I am pretty sure mischaracterizes people. I didn’t downvote it, but have disagree-voted with it)
I basically agree with your overall comment, but I’d like to push back in one spot:
If your model of reality has the power to make these sweeping claims with high confidence
From my understanding, for at least Nate Soares, he claims his internal case for >80% doom is disjunctive and doesn’t route all through 1, 2, 3, and 4.
I don’t really know exactly what the disjuncts are, so this doesn’t really help and I overall agree that MIRI does make “sweeping claims with high confidence”.
I think your summary is a good enough quick summary of my beliefs. The minutia that I object to is how confident and specific lots of parts of your summary are. I think many of the claims in the summary can be adjusted or completely changed and still lead to bad outcomes. But it’s hard to add lots of uncertainty and options to a quick summary, especially one you disagree with, so that’s fair enough. (As a side note, that paper you linked isn’t intended to represent anyone else’s views, other than myself and Peter, and we are relatively inexperienced. I’m also no longer working at MIRI).
I’m confused about why your <20% isn’t sufficient for you to want to shut down AI research. Is it because of benefits outweigh the risk, or because we’ll gain evidence about potential danger and can shut down later if necessary?
I’m also confused about why being able to generate practical insights about the nature of AI or AI progress is something that you think should necessarily follow from a model that predicts doom. I believe something close enough to (1) from your summary, but I don’t have much idea (above general knowledge) of how the first company to build such an agent will do so, or when they will work out how to do it. One doesn’t imply the other.
I’m confused about why your <20% isn’t sufficient for you to want to shut down AI research. Is it because of benefits outweigh the risk, or because we’ll gain evidence about potential danger and can shut down later if necessary?
I think the expected benefits outweigh the risks, given that I care about the existing generation of humans (to a large, though not overwhelming degree). The expected benefits here likely include (in my opinion) a large reduction in global mortality, a very large increase in the quality of life, a huge expansion in material well-being, and more generally a larger and more vibrant world earlier in time. Without AGI, I think most existing people would probably die and get replaced by the next generation of humans, in a relatively much poor world (compared to the alternative).
I also think the absolute level risk from AI barely decreases if we globally pause. My best guess is that pausing would mainly just delay adoption without significantly impacting safety. Under my model of AI, the primary risks are long-term, and will happen substantially after humans have already gradually “handed control” over to the AIs and retired their labor on a large scale. Most of these problems—such as cultural drift and evolution—do not seem to be the type of issue that can be satisfactorily solved in advance, prior to a pause (especially by working out a mathematical theory of AI, or something like that).
On the level of analogy, I think of AI development as more similar to “handing off control to our children” than “developing a technology that disempowers all humans at a discrete moment in time”. In general, I think the transition period to AI will be more diffuse and incremental than MIRI seems to imagine, and there won’t be a sharp distinction between “human values” and “AI values” either during, or after the period.
(I also think AIs will probably be conscious in a way that’s morally important, in case that matters to you.)
In fact, I think it’s quite plausible the absolute level of AI risk would increase under a global pause, rather than going down, given the high level of centralization of power required to achieve a global pause, and the perverse institutions and cultural values that would likely arise under such a regime of strict controls. As a result, even if I weren’t concerned at all about the current generation of humans, and their welfare, I’d still be pretty hesitant to push pause on the entire technology.
(I think of technology as itself being pretty risky, but worth it. To me, pushing pause on AI is like pushing pause on technology itself, in the sense that they’re both generically risky yet simultaneously seem great on average. Yes, there are dangers ahead. But I think we can be careful and cautious without completely ripping up all the value for ourselves.)
Would most existing people accept a gamble with 20% of chance of death in the next 5 years and 80% of life extension and radically better technology? I concede that many would, but I think it’s far from universal, and I wouldn’t be too surprised if half of people or more think this isn’t for them.
I personally wouldn’t want to take that gamble (strangely enough I’ve been quite happy lately and my life has been feeling meaningful, so the idea of dying in the next 5 years sucks).
(Also, I want to flag that I strongly disagree with your optimism.)
For what it’s worth, while my credence in human extinction from AI in the 21st century is 10-20%, I think the chance of human extinction in the next 5 years is much lower. I’d put that at around 1%. The main way I think AI could cause human extinction is by just generally accelerating technology and making the world a scarier and more dangerous place to live. I don’t really buy the model in which an AI will soon foom until it becomes a ~god.
I like this framing. I think the more common statement would be 20% chance of death in 10-30 years , and 80% chance of life extension and much better technology that they might not live to see.
I think the majority of humanity would actually take this bet. They are not utilitarians or longtermists.
So if the wager is framed in this way, we’re going full steam ahead.
I yet another time say that your tech tree model doesn’t make sense to me. To get immortality/mind uploading, you need really overpowered tech, far above the level when killing all humans and starting disassemble planet becomes negligibly cheap. So I wouldn’t expect that “existing people would probably die” is going to change much under your model “AIs can be misaligned but killing all humans is too costly”.
(I also think AIs will probably be conscious in a way that’s morally important, in case that matters to you.)
I don’t think that’s either a given nor something we can ever know for sure. “Handing off” the world to robots and AIs that for all we know might be perfect P-zombies doesn’t feel like a good idea.
Given a low prior probability of doom as apparent from the empirical track record of technological progress, I think we should generally be skeptical of purely theoretical arguments for doom, especially if they are vague and make no novel, verifiable predictions prior to doom.
And why such use of the empirical track record is valid? Like, what’s the actual hypothesis here? What law of nature says “if technological progress hasn’t caused doom yet, it won’t cause it tomorrow”?
MIRI’s arguments for doom are often difficult to pin down, given the informal nature of their arguments, and in part due to their heavy reliance on analogies, metaphors, and vague supporting claims instead of concrete empirically verifiable models.
And arguments against are based on concrete empirically verifiable models of metaphors.
If your model of reality has the power to make these sweeping claims with high confidence, then you should almost certainly be able to use your model of reality to make novel predictions about the state of the world prior to AI doom that would help others determine if your model is correct.
Doesn’t MIRI’s model predict some degree of the whole Shoggoth/actress thing in current system? Seems verifiable.
I share your frustration with MIRI’s communications with the alignment community.
And, the tone of this comment smells to me of danger. It looks a little too much like strawmanning, which always also implies that anyone who believes this scenario must be, at least in this context, an idiot. Since even rationalists are human, this leads to arguments instead of clarity.
I’m sure this is an accident born of frustration, and the unclarity of the MIRI argument.
I think we should prioritize not creating a polarized doomer-vs-optimist split in the safety community. It is very easy to do, and it looks to me like that’s frequently how important movements get bogged down.
Since time is of the essence, this must not happen in AI safety.
We can all express our views, we just need to play nice and extend the benefit of the doubt. MIRI actually does this quite well, although they don’t convey their risk model clearly. Let’s follow their example in the first and not the second.
Eliezer’s response to claims about unfalsifiability, namely that “predicting endpoints is easier than predicting intermediate points”, seems like a cop-out to me, since this would seem to reverse the usual pattern in forecasting and prediction, without good reason.
Note that MIRI has made some intermediate predictions. For example, I’m fairly certain Eliezer predicted that AlphaGo would go 5 for 5 against LSD, and it didn’t. I would respect his intellectual honesty more if he’d registered the alleged difficulty of intermediate predictions before making them unsuccessfully.
I think MIRI has something valuable to contribute to alignment discussions, but I’d respect them more if they did a “5 Whys” type analysis on their poor prediction track record, so as to improve the accuracy of predictions going forwards. I’m not seeing any evidence of that. It seems more like the standard pattern where a public figure invests their ego in some position, then tries to avoid losing face.
On your (2), I think you’re ignoring an understanding-related asymmetry:
Without clear models describing (a path to) a solution, it is highly unlikely we have a workable solution to a deep and complex problem:
Absence of concrete [we have (a path to) a solution] is pretty strong evidence of absence. [EDIT for clarity, by “we have” I mean “we know of”, not “there exists”; I’m not claiming there’s strong evidence that no path to a solution exists]
Whether or not we have clear models of a problem, it is entirely possible for it to exist and to kill us:
Absence of concrete [there-is-a-problem] evidence is weakevidence of absence.
A problem doesn’t have to wait until we have formal arguments or strong, concrete empirical evidence for its existence before killing us. To claim that it’s “premature” to shut down the field before we have [evidence of type x], you’d need to make a case that [doom before we have evidence of type x] is highly unlikely.
A large part of the MIRI case is that there is much we don’t understand, and that parts of the problem we don’t understand are likely to be hugely important. An evidential standard that greatly down-weights any but the most rigorous, legible evidence is liable to lead to death-by-sampling-bias.
Of course it remains desirable for MIRI arguments to be as legible and rigorous as possible. Empiricism would be nice too (e.g. if someone could come up with concrete problems whose solution would be significant evidence for understanding something important-according-to-MIRI about alignment).
But ignoring the asymmetry here is a serious problem.
On your (3), it seems to me that you want “skeptical” to do more work than is reasonable. I agree that we “should be skeptical of purely theoretical arguments for doom”—but initial skepticism does not imply [do not update much on this]. It implies [consider this very carefully before updating]. It’s perfectly reasonable to be initially skeptical but to make large updates once convinced.
I do not think [the arguments are purely theoretical] is one of your true objections—rather it’s that you don’t find these particular theoretical arguments convincing. That’s fine, but no argument against theoretical arguments.
tl;dr: “lack of rigorous arguments for P is evidence against P” is typically valid, but not in case of P = AI X-risk.
A high-level reaction to your point about unfalsifiability: There seems to be a general sentiment that “AI X-risk arguments are unfalsifiable ==> the arguments are incorrect” and “AI X-risk arguments are unfalsifiable ==> AI X-risk is low”.[1] I am very sympathetic to this sentiment—but I also think that in the particular case of AI X-risk, it is not justified.[2] For quite non-obvious reasons.
Why I believe this? Take this simplified argument for AI X-risk:
Some important future AIs will be goal-oriented, or will behave in a goal-oriented way in sometimes[3]. (Read: If you think of them as trying to maximise some goal, you will make pretty good predictions.[4])
The “AI-progress tech-tree” is such that discontinous jumps in impact are possible. In particular, we will one day go from “an AI that is trying to maximise some goal, but not doing a very good job of it” to “an AI that is able to treat humans and other existing AIs as ‘environment’, and is going to do a very good job at maximising some goal”.
For virtually any[5] goal specification, doing a sufficiently[6] good job at maximising that goal specification leads to an outcome where every human is dead.
FWIW, I think that having a strong opinion on (1) and (2), in either direction, is not justified.[7] But in this comment, I only want to focus on (3) --- so let’s please pretend, for the sake of this discussion, that we find (1) and (2) at least plausible. What I claim is that even if we lived in a universe where (3) is true, we should still expect even the best arguments for (3) (that we might realistically identify) to be unfalsifiable—at least given realistic constraints on falsification effort and assumming that we use rigorous standards for what counts as a solid evidence, like people do in mathematics, physics, or CS.
What is my argument for “even best arguments for (3) will be unfalsifiable”? Suppose you have an environment E that contains a Cartesian agent (a thing that takes actions in the environment and—let’s assume for simplicity—has perfect information about the environment, but whose decison-making computation happens outside of the environment). And suppose that this agent acts in a way that maximises[8] some goal specification[9] over E. Now, E might or might not contain humans, or representations of humans. We can now ask the following question: Is it true that, unless we spend an extremely high amont of effort (eg, >5 civilisation-years), any (non-degenerate[10]) goal-specification we come up with will result in human extinction[11] in E when maximised by the agent. I refer to this as “Extinction-level Goodhart’s Law”.
I claim that: (A) Extinction-level Goodhart’s Law plausibly holds in the real world. (At least the thought expertiments I know, eg here or here, of suggest it does.) (B) Even if Extinction-level Goodhart’s Law was true in the real world, it would still be false in environments where we could verify it experimentally (today, or soon) or mathematically (by proofs, given realistic amounts of effort). ==> And (B) implies that if we want “solid arguments”, rather than just thought expertiments, we might be kinda screwed when it comes to Extinction-level Goodhart’s Law.
And why do I believe (B)? The long story is that I try to gesture at this in my sequence on “Formalising Catastrophic Goodhart”. The short story is that there are many strategies for finding “safe to optimise” goal specifications that work in simpler environments, but not in the real-world (examples below). So to even start gaining evidence on whether the law holds in our world, we need to investigate envrionments where those simpler strategies don’t work—and it seems to me that those are always too complex for us to analyse mathematically or run an AI there which could “do a sufficiently good job a trying to maximise the goal specification”. Some examples of the above-mentioned strategies for finding safe-to-optimise goal specifications: (i) The environment contains no (representations of) humans, or those “humans” can’t “die”, so it doesn’t matter. EG, most gridworlds. (ii) The environment doesn’t have any resources or similar things that would give rise to convergent instrumental goals, so it doesn’t matter. EG, most gridworlds. (iii) The environment allows for a simple formula that checks whether “humans” are “extinct”, so just add a huge penalty if that formula holds. (EG, most gridworlds where you added “humans”.) (iv) There is a limited set of actions that result in “killing” “humans”, so just add a huge penalty to those. (v) There is a simple formula for expressing a criterion that limits the agent’s impact. (EG, “don’t go past these coordinates” in a gridworld.)
All together, this should explain why the “unfalsifiability” counter-argument does not hold as much weight, in the case of AI X-risk, as one might intuitively expect.
If I understand you correctly, you would endorse something like this? Quite possibly with some disclaimers, ofc. (Certainly I feel that many other people endorse something like this.)
I acknowledge that the general heuristic “argument for X is unfalsifiable ==> the argument is wrong” holds in most cases. And I am aware we should be sceptical whenever somebody goes “but my case is an exception!”. Despite this, I still believe that AI X-risk genuinely is different from invisible dragons in your garage and conspiracy theories.
That said, I feel there should be a bunch of other examples where the heuristic doesn’t apply. If you have some that are good, please share!
An example of this would be if GPT-4 acted like a chatbot most of the time, but tried to take over the world if you prompt it with “act as a paperclipper”.
By “virtual any” goal specification (leading to extinction when maximised), I mean that finding a goal specification for which extinction does not happen (when maximised) is extremely difficult. One example of operationalising “extremely difficult” would be “if our civilisation spent all its efforts on trying to find some goal specification, for 5 years from today, we would still fail”. In particular, the claim (3) is meant to imply that if you do anything like “do RLHF for a year, then optimise the result extremely hard”, then everybody dies.
For the purposes of this simplified AI X-risk argument, the AIs from (2), which are “very good at maximising a goal”, are meant to qualify for the “sufficiently good job at maximising a goal” from (3). In practice, this is of course more complicated—see e.g. my post on Weak vs Quantitative Extinction-level Goodhart’s Law.
Or at least there are no publicly available writings, known to me, which could justifiy claims like “It’s >=80% likely that (1) (or 2) holds (or doesn’t hold)”. Of course, (1) and (2) are too vague for this to even make sense, but imagine replacing (1) and (2) by more serious attempts at operationalising the ideas that they gesture at.
Most reasonable ways of defining what “goal specification” means should work for the argument. As a simple example, we can think of having a reward function R : states --> R and maximising the sum of R(s) over any long time horizon.
To be clear, there are some trivial ways of avoiding Extinction-level Goodhart’s Law. One is to consider a constant utility function, which means that the agent might as well take random actions. Another would be to use reward functions in the spirit of “shut down now, or get a huge penalty”. And there might be other weird edge cases. I acknowledge that this part should be better developed. But in the meantime, hopefully it is clear—at least somewhat—what I am trying to gesture at.
Most environments won’t contain actual humans. So by “human extinction”, I mean the “metaphorical humans being metaphorically dead”. EG, if your environment was pacman, then the natural thing would be to view the pacman as representing a “human”, and being eaten by the ghosts as representing “extinction”. (Not that this would be a good model for studying X-risk.)
An illustrative example, describing a scenario that is similar to our world, but where “Extinction-level Goodhart’s law” would be false & falsifiable (hat tip Vincent Conitzer):
Suppose that we somehow only start working on AGI many years from now, after we have already discovered a way to colonize the universe at the close to the speed of light. And some of the colonies are already unreachable, outside of our future lightcone. But suppose we still understand “humanity” as the collection of all humans, including those in the unreachable colonies. Then any AI that we build, no matter how smart, would be unable to harm these portions of humanity. And thus full-blown human extinction, from AI we build here on Earth, would be impossible. And you could “prove” this using a simple, yet quite rigorous, physics argument.[1]
(To be clear, I am not saying that “AI X-risk’s unfalsifiability is justifiable ==> we should update in favour of AI X-risk compared to our priors”. I am just saying that the justifiability means we should not update against it compared to our priors. Though I guess that in practice, it means that some people should undo some of their updates against AI X-risk… )
And sure, maybe some weird magic is actually possible, and the AI could actually beat speed of light. But whatever, I am ignoring this, and an argument like this would count as falsification as far as I am concerned.
FWIW, I acknowledge that my presentation of the argument isn’t ironclad, but I hope that it makes my position a bit clearer. If anybody has ideas for how to present it better, or has some nice illustrative examples, I would be extremely grateful.
I appreciate the straightforward and honest nature of this communication strategy, in the sense of “telling it like it is” and not hiding behind obscure or vague language. In that same spirit, I’ll provide my brief, yet similarly straightforward reaction to this announcement:
I think MIRI is incorrect in their assessment of the likelihood of human extinction from AI. As per their messaging, several people at MIRI seem to believe that doom is >80% likely in the 21st century (conditional on no global pause) whereas I think it’s more like <20%.
MIRI’s arguments for doom are often difficult to pin down, given the informal nature of their arguments, and in part due to their heavy reliance on analogies, metaphors, and vague supporting claims instead of concrete empirically verifiable models. Consequently, I find it challenging to respond to MIRI’s arguments precisely. The fact that they want to essentially shut down the field of AI based on these largely informal arguments seems premature to me.
MIRI researchers rarely provide any novel predictions about what will happen before AI doom, making their theories of doom appear unfalsifiable. This frustrates me. Given a low prior probability of doom as apparent from the empirical track record of technological progress, I think we should generally be skeptical of purely theoretical arguments for doom, especially if they are vague and make no novel, verifiable predictions prior to doom.
Separately from the previous two points, MIRI’s current most prominent arguments for doom seem very weak to me. Their broad model of doom appears to be something like the following (although they would almost certainly object to the minutia of how I have written it here):
(1) At some point in the future, a powerful AGI will be created. This AGI will be qualitatively distinct from previous, more narrow AIs. Unlike concepts such as “the economy”, “GPT-4″, or “Microsoft”, this AGI is not a mere collection of entities or tools integrated into broader society that can automate labor, share knowledge, and collaborate on a wide scale. This AGI is instead conceived of as a unified and coherent decision agent, with its own long-term values that it acquired during training. As a result, it can do things like lie about all of its fundamental values and conjure up plans of world domination, by itself, without any risk of this information being exposed to the wider world.
(2) This AGI, via some process such as recursive self-improvement, will rapidly “foom” until it becomes essentially an immortal god, at which point it will be able to do almost anything physically attainable, including taking over the world at almost no cost or risk to itself. While recursive self-improvement is the easiest mechanism to imagine here, it is not the only way this could happen.
(3) The long-term values of this AGI will bear almost no relation to the values that we tried to instill through explicit training, because of difficulties in inner alignment (i.e., a specific version of the general phenomenon of models failing to generalize correctly from training data). This implies that the AGI will care almost literally 0% about the welfare of humans (despite potentially being initially trained from the ground up on human data, and carefully inspected and tested by humans for signs of misalignment, in diverse situations and environments). Instead, this AGI will pursue a completely meaningless goal until the heat death of the universe.
(4) Therefore, the AGI will kill literally everyone after fooming and taking over the world.
It is difficult to explain in a brief comment why I think the argument just given is very weak. Instead of going into the various subclaims here in detail, for now I want to simply say, “If your model of reality has the power to make these sweeping claims with high confidence, then you should almost certainly be able to use your model of reality to make novel predictions about the state of the world prior to AI doom that would help others determine if your model is correct.”
The fact that MIRI has yet to produce (to my knowledge) any major empirically validated predictions or important practical insights into the nature of AI, or AI progress, in the last 20 years, undermines the idea that they have the type of special insight into AI that would allow them to express high confidence in a doom model like the one outlined in (4).
Eliezer’s response to claims about unfalsifiability, namely that “predicting endpoints is easier than predicting intermediate points”, seems like a cop-out to me, since this would seem to reverse the usual pattern in forecasting and prediction, without good reason.
Since I think AI will most likely be a very good thing for currently existing people, I am much more hesitant to “shut everything down” compared to MIRI. I perceive MIRI researchers as broadly well-intentioned, thoughtful, yet ultimately fundamentally wrong in their worldview on the central questions that they research, and therefore likely to do harm to the world. This admittedly makes me sad to think about.
It’s pretty standard? Like, we can make reasonable prediction of climate in 2100, even if we can’t predict weather two month ahead.
To be blunt, it’s not just that Eliezer lacks a positive track record in predicting the nature of AI progress, which might be forgivable if we thought he had really good intuitions about this domain. Empiricism isn’t everything, theoretical arguments are important too and shouldn’t be dismissed. But-
Eliezer thought AGI would be developed from a recursively self-improving seed AI coded up by a small group, “brain in a box in a basement” style. He dismissed and mocked connectionist approaches to building AI. His writings repeatedly downplayed the importance of compute, and he has straw-manned writers like Moravec who did a better job at predicting when AGI would be developed than he did.
Old MIRI intuition pumps about why alignment should be difficult like the “Outcome Pump” and “Sorcerer’s apprentice” are now forgotten, it was a surprise that it would be easy to create helpful genies like LLMs who basically just do what we want. Remaining arguments for the difficulty of alignment are esoteric considerations about inductive biases, counting arguments, etc. So yes, let’s actually look at these arguments and not just dismiss them, but let’s not pretend that MIRI has a good track record.
I think the core concerns remain, and more importantly, there are other rather doom-y scenarios possible involving AI systems more similar to the ones we have that opened up and aren’t the straight up singleton ASI foom. The problem here is IMO not “this specific doom scenario will become a thing” but “we don’t have anything resembling a GOOD vision of the future with this tech that we are nevertheless developing at breakneck pace”. Yet the amount of dystopian or apocalyptic possible scenarios is enormous. Part of this is “what if we lose control of the AIs” (singleton or multipolar), part of it is “what if we fail to structure our society around having AIs” (loss of control, mass wireheading, and a lot of other scenarios I’m not sure how to name). The only positive vision the “optimists” on this have to offer is “don’t worry, it’ll be fine, this clearly revolutionary and never seen before technology that puts in question our very role in the world will play out the same way every invention ever did”. And that’s not terribly convincing.
I’m not saying anything on object-level about MIRI models, my point is that “outcomes are more predictable than trajectories” is pretty standard epistemically non-suspicious statement about wide range of phenomena. Moreover, in particular circumstances (and many others) you can reduce it to object-level claim, like “do observarions on current AIs generalize to future AI?”
How does the question of whether AI outcomes are more predictable than AI trajectories reduce to the (vague) question of whether observations on current AIs generalize to future AIs?
ChatGPT falsifies prediction about future superintelligent recursive self-improving AI only if ChatGPT is generalizable predictor of design of future superintelligent AIs.
There will be future superintelligent AIs that improve themselves. But they will be neural networks, they will at the very least start out as a compute-intensive project, in the infant stages of their self-improvement cycles they will understand and be motivated by human concepts rather than being dumb specialized systems that are only good for bootstrapping themselves to superintelligence.
Edit: Retracted because some of my exegesis of the historical seed AI concept may not be accurate
True knowledge about later times doesn’t let you generally make arbitrary predictions about intermediate times, given valid knowledge of later times. But true knowledge does usually imply that you can make some theory-specific predictions about intermediate times, given later times.
Thus, vis-a-vis your examples: Predictions about the climate in 2100 don’t involve predicting tomorrow’s weather. But they do almost always involve predictions about the climate in 2040 and 2070, and they’d be really sus if they didn’t.
Similarly:
If an astronomer thought that an asteroid was going to hit the earth, the astronomer generally could predict points it will be observed at in the future before hitting the earth. This is true even if they couldn’t, for instance, predict the color of the asteroid.
People who predicted that C19 would infect millions by T + 5 months also had predictions about how many people would be infected at T + 2. This is true even if they couldn’t predict how hard it would be to make a vaccine.
(Extending analogy to scale rather than time) The ability to predict that nuclear war would kill billions involves a pretty good explanation for how a single nuke would kill millions.
So I think that—entirely apart from specific claims about whether MIRI does this—it’s pretty reasonable to expect them to be able to make some theory-specific predictions about the before-end-times, although it’s unreasonable to expect them to make arbitrary theory-specific predictions.
I agree this is usually the case, but I think it’s not always true, and I don’t think it’s necessarily true here. E.g., people as early as Da Vinci guessed that we’d be able to fly long before we had planes (or even any flying apparatus which worked). Because birds can fly, and so we should be able to as well (at least, this was Da Vinci and the Wright brothers’ reasoning). That end point was not dependent on details (early flying designs had wings like a bird, a design which we did not keep :p), but was closer to a laws of physics claim (if birds can do it there isn’t anything fundamentally holding us back from doing it either).
Superintelligence holds a similar place in my mind: intelligence is physically possible, because we exhibit it, and it seems quite arbitrary to assume that we’ve maxed it out. But also, intelligence is obviously powerful, and reality is obviously more manipulable than we currently have the means to manipulate it. E.g., we know that we should be capable of developing advanced nanotech, since cells can, and that space travel/terraforming/etc. is possible.
These two things together—“we can likely create something much smarter than ourselves” and “reality can be radically transformed”—is enough to make me feel nervous. At some point I expect most of the universe to be transformed by agents; whether this is us, or aligned AIs, or misaligned AIs or what, I don’t know. But looking ahead and noticing that I don’t know how to select the “aligned AI” option from the set “things which will likely be able to radically transform matter” seems enough cause, in my mind, for exercising caution.
There’s a pretty big difference between statements like “superintelligence is physically possible”, “superintelligence could be dangerous” and statements like “doom is >80% likely in the 21st century unless we globally pause”. I agree with (and am not objecting to) the former claims, but I don’t agree with the latter claim.
I also agree that it’s sometimes true that endpoints are easier to predict than intermediate points. I haven’t seen Eliezer give a reasonable defense of this thesis as it applies to his doom model. If all he means here is that superintelligence is possible, it will one day be developed, and we should be cautious when developing it, then I don’t disagree. But I think he’s saying a lot more than that.
Your general point is true, but it’s not necessarily true that a correct model can (1) predict the timing of AGI or (2) that the predictable precursors to disaster occur before the practical c-risk (catastrophic-risk) point of no return. While I’m not as pessimistic as Eliezer, my mental model has these two limitations. My model does predict that, prior to disaster, a fairly safe, non-ASI AGI or pseudo-AGI (e.g. GPT6, a chatbot that can do a lot of office jobs and menial jobs pretty well) is likely to be invented before the really deadly one (if any[1]). But if I predicted right, it probably won’t make people take my c-risk concerns more seriously?
technically I think AGI inevitably ends up deadly, but it could be deadly “in a good way”
I think it’s more similar to saying that the climate in 2040 is less predictable than the climate in 2100, or saying that the weather 3 days from now is less predictable than the weather 10 days from now, which are both not true. By contrast, the weather vs. climate distinction is more of a difference between predicting point estimates vs. predicting averages.
It’s certainly not a simple question. Say, Gulf Stream is projected to collapse somewhere between now and 2095, with median date 2050. So, slightly abusing meaning of confidence intervals, we can say that in 2100 we won’t have Gulf Stream with probability >95%, while in 2040 Gulf Stream will still be here with probability ~60%, which is literally less predictable.
Chemists would give an example of chemical reactions, where final thermodynamically stable states are easy to predict, while unstable intermediate states are very hard to even observe.
Very dumb example: if you are observing radioactive atom with half-life of one minute, you can’t predict when atom is going to decay, but you can be very certain that it will decay after hour.
And why don’t you accept classic MIRI example that even if it’s impossible for human to predict moves of Stockfish 16, you can be certain that Stockfish will win?
I agree there are examples where predicting the end state is easier to predict than the intermediate states. Here, it’s because we have strong empirical and theoretical reasons to think that chemicals will settle into some equilibrium after a reaction. With AGI, I have yet to see a compelling argument for why we should expect a specific easy-to-predict equilibrium state after it’s developed, which somehow depends very little on how the technology is developed.
It’s also important to note that, even if we know that there will be an equilibrium state after AGI, more evidence is generally needed to establish that the end equilibrium state will specifically be one in which all humans die.
I don’t accept this argument as a good reason to think doom is highly predictable partly because I think the argument is dramatically underspecified without shoehorning in assumptions about what AGI will look like to make the argument more comprehensible. I generally classify arguments like this under the category of “analogies that are hard to interpret because the assumptions are so unclear”.
To help explain my frustration at the argument’s ambiguity, I’ll just give a small yet certainly non-exhaustive set of questions I have about this argument:
Are we imagining that creating an AGI implies that we play a zero-sum game against it? Why?
Why is it a simple human vs. AGI game anyway? Does that mean we’re lumping together all the humans into a single agent, and all the AGIs into another agent, and then they play off against each other like a chess match? What is the justification for believing the battle will be binary like this?
Are we assuming the AGI wants to win? Maybe it’s not an agent at all. Or maybe it’s an agent but not the type of agent that wants this particular type of outcome.
What does “win” mean in the general case here? Does it mean the AGI merely gets more resources than us, or does it mean the AGI kills everyone? These seem like different yet legitimate ways that one can “win” in life, with dramatically different implications for the losing parties.
There’s a lot more I can say here, but the basic point I want to make is that once you start fleshing this argument out, and giving it details, I think it starts to look a lot weaker than the general heuristic that Stockfish 16 will reliably beat humans in chess, even if we can’t predict its exact moves.
See here
I don’t think the Gulf Stream can collapse as long as the Earth spins, I guess you mean the AMOC?
Yep, AMOC is what I mean
>Like, we can make reasonable prediction of climate in 2100, even if we can’t predict weather two month ahead.
This is a strange claim to make in a thread about AGI destroying the world. Obviously if AGI destroys the world we can not predict the weather in 2100.
Predicting the weather in 2100 requires you to make a number of detailed claims about the years between now and 2100 (for example, the carbon-emissions per year), and it is precisely the lack of these claims that @Matthew Barnett is talking about.
I strongly doubt we can predict the climate in 2100. Actual prediction would be a model that also incorporates the possibility of nuclear fusion, geoengineering, AGIs altering the atmosphere, etc.
I think you are abusing/misusing the concept of falsifiability here. Ditto for empiricism. You aren’t the only one to do this, I’ve seen it happen a lot over the years and it’s very frustrating. I unfortunately am busy right now but would love to give a fuller response someday, especially if you are genuinely interested to hear what I have to say (which I doubt, given your attitude towards MIRI).
I’m a bit surprised you suspect I wouldn’t be interested in hearing what you have to say?
I think the amount of time I’ve spent engaging with MIRI perspectives over the years provides strong evidence that I’m interested in hearing opposing perspectives on this issue. I’d guess I’ve engaged with MIRI perspectives vastly more than almost everyone on Earth who explicitly disagrees with them as strongly as I do (although obviously some people like Paul Christiano and other AI safety researchers have engaged with them even more than me).
(I might not reply to you, but that’s definitely not because I wouldn’t be interested in what you have to say. I read virtually every comment-reply to me carefully, even if I don’t end up replying.)
I apologize, I shouldn’t have said that parenthetical.
Here’s a new approach: Your list of points 1 − 7. Would you also make those claims about me? (i.e. replace references to MIRI with references to Daniel Kokotajlo.)
You’ve made detailed predictions about what you expect in the next several years, on numerous occasions, and made several good-faith attempts to elucidate your models of AI concretely. There are many ways we disagree, and many ways I could characterize your views, but “unfalsifiable” is not a label I would tend to use for your opinions on AI. I do not mentally lump you together with MIRI in any strong sense.
OK, glad to hear. And thank you. :) Well, you’ll be interested to know that I think of my views on AGI as being similar to MIRI’s, just less extreme in various dimensions. For example I don’t think literally killing everyone is the most likely outcome, but I think it’s a very plausible outcome. I also don’t expect the ‘sharp left turn’ to be particularly sharp, such that I don’t think it’s a particularly useful concept. I also think I’ve learned a lot from engaging with MIRI and while I have plenty of criticisms of them (e.g. I think some of them are arrogant and perhaps even dogmatic) I think they have been more epistemically virtuous than the average participant in the AGI risk conversation, even the average ‘serious’ or ‘elite’ participant.
Huh, I was surprised to read this. I’ve imbibed a non-trivial fraction of your posts and comments here on LessWrong, and, before reading the above, my shoulder Daniel definitely saw extinction as the most likely existential catastrophe.
If you have the time, I’d be very interested to hear what you do think is the most likely outcome. (It’s very possible that you have written about this before and I missed it—my bad, if so.)
(My model of Daniel thinks the AI will likely take over, but probably will give humanity some very small fraction of the universe, for a mixture of “caring a tiny bit” and game-theoretic reasons)
Thanks, that’s helpful!
(Fwiw, I don’t find the ‘caring a tiny bit’ story very reassuring, for the same reasons as Wei Dai, although I do find the acausal trade story for why humans might be left with Earth somewhat heartening. (I’m assuming that by ‘game-theoretic reasons’ you mean acausal trade.))
Yep, Habryka is right. Also, I agree with Wei Dai re: reassuringness. I think literal extinction is <50% likely, but this is cold comfort given the badness of some of the plausible alternatives, and overall I think the probability of something comparably bad happening is >50%.
I want to publicly endorse and express appreciation for Matthew’s apparent good faith.
Every time I’ve ever seen him disagreeing about AI stuff on the internet (a clear majority of the times I’ve encountered anything he’s written), he’s always been polite, reasonable, thoughtful, and extremely patient. Obviously conversations sometimes entail people talking past each other, but I’ve seen him carefully try to avoid miscommunication, and (to my ability to judge) strawmanning.
Thank you Mathew. Keep it up. : )
Followup: Matthew and I ended up talking about it in person. tl;dr of my position is that
Falsifiability is a symmetric two-place relation; one cannot say “X is unfalsifiable,” except as shorthand for saying “X and Y make the same predictions,” and thus Y is equally unfalsifiable. When someone is going around saying “X is unfalsifiable, therefore not-X,” that’s often a misuse of the concept—what they should say instead is “On priors / for other reasons (e.g. deference) I prefer not-X to X; and since both theories make the same predictions, I expect to continue thinking this instead of updating, since there won’t be anything to update on
.What is the point of falsifiability-talk then? Well, first of all, it’s quite important to track when two theories make the same predictions, or the same-predictions-till-time-T. It’s an important part of the bigger project of extracting predictions from theories so they can be tested. It’s exciting progress when you discover that two theories make different predictions, and nail it down well enough to bet on. Secondly, it’s quite important to track when people are making this worse rather than easier—e.g. fortunetellers and pundits will often go out of their way to avoid making any predictions that diverge from what their interlocutors already would predict. Whereas the best scientists/thinkers/forecasters, the ones you should defer to, should be actively trying to find alpha and then exploit it by making bets with people around them. So falsifiability-talk is useful for evaluating people as epistemically virtuous or vicious. But note that if this is what you are doing, it’s all a relative thing in a different way—in the case of MIRI, for example, the question should be “Should I defer to them more, or less, than various alternative thinkers A B and C? --> Are they generally more virtuous about making specific predictions, seeking to make bets with their interlocutors, etc. than A B or C?”
So with that as context, I’d say that (a) It’s just wrong to say ‘MIRI’s theories of doom are unfalsifiable.’ Instead say ‘unfortunately for us (not for the plausibility of the theories), both MIRI’s doom theories and (insert your favorite non-doom theories here) make the same predictions until it’s basically too late.’ (b) One should then look at MIRI and be suspicious and think ‘are they systematically avoiding making bets, making specific predictions, etc. relative to the other people we could defer to? Are they playing the sneaky fortuneteller or pundit’s game?’ to which I think the answer is ‘no not at all, they are actually more epistemically virtuous in this regard than the average intellectual. That said, they aren’t the best either—some other people in the AI risk community seem to be doing better than them in this regard, and deserve more virtue points (and possibly deference points) therefore.’ E.g. I think both Matthew and I have more concrete forecasting track records than Yudkowsky?
This is partially derivable from Bayes rule. In order for you to gain confidence in a theory, you need to make observations which are more likely in worlds where the theory is correct. Since MIRI seems to have grown even more confident in their models, they must’ve observed something which is more likely to be correct under their models. Therefore, to obey Conservation of Expected Evidence, the world could have come out a different way which would have decreased their confidence. So it was falsifiable this whole time. However, in my experience, MIRI-sympathetic folk deny this for some reason.
It’s simply not possible, as a matter of Bayesian reasoning, to lawfully update (today) based on empirical evidence (like LLMs succeeding) in order to change your probability of a hypothesis that “doesn’t make” any empirical predictions (today).
In summer 2022, Quintin Pope was explaining the results of the ROME paper to Eliezer. Eliezer impatiently interrupted him and said “so they found that facts were stored in the attention layers, so what?”. Of course, this was exactly wrong—Bau et al. found the circuits in mid-network MLPs. Yet, there was no visible moment of “oops” for Eliezer.
I think I am missing context here. Why is that distinction between facts localized in attention layers and in MLP layers so earth-shaking Eliezer should have been shocked and awed by a quick guess during conversation being wrong, and is so revealing an anecdote you feel that it is the capstone of your comment, crystallizing everything wrong about Eliezer into a story?
^ Aggressive strawman which ignores the main point of my comment. I didn’t say “earth-shaking” or “crystallizing everything wrong about Eliezer” or that the situation merited “shock and awe.” Additionally, the anecdote was unrelated to the other section of my comment, so I didn’t “feel” it was a “capstone.”
I would have hoped, with all of the attention on this exchange, that someone would reply “hey, TurnTrout didn’t actually say that stuff.” You know, local validity and all that. I’m really not going to miss this site.
Anyways, gwern, it’s pretty simple. The community edifies this guy and promotes his writing as a way to get better at careful reasoning. However, my actual experience is that Eliezer goes around doing things like e.g. impatiently interrupting people and being instantly wrong about it (importantly, in the realm of AI, as was the original context). This makes me think that Eliezer isn’t deploying careful reasoning to begin with.
I, uh, didn’t say you “say” either of those: I was sarcastically describing your comment about an anecdote that scarcely even seemed to illustrate what it was supposed to, much less was so important as to be worth recounting years later as a high profile story (surely you can come up with something better than that after all this time?), and did not put my description in quotes meant to imply literal quotation, like you just did right there. If we’re going to talk about strawmen...
No one would say that or correct me for falsifying quotes, because I didn’t say you said that stuff. They might (and some do) disagree with my sarcastic description, but they certainly weren’t going to say ‘gwern, TurnTrout never actually used the phrase “shocked and awed” or the word “crystallizing”, how could you just make stuff up like that???’ …Because I didn’t. So it seems unfair to judge LW and talk about how you are “not going to miss this site”. (See what I did there? I am quoting you, which is why the text is in quotation marks, and if you didn’t write that in the comment I am responding to, someone is probably going to ask where the quote is from. But they won’t, because you did write that quote).
In jumping to accusations of making up quotes and attacking an entire site for not immediately criticizing me in the way you are certain I should be criticized and saying that these failures illustrate why you are quitting it, might one say that you are being… overconfident?
Quite aside from it being in the same comment and so you felt it was related, it was obviously related to your first half about overconfidence in providing an anecdote of what you felt was overconfidence, and was rhetorically positioned at the end as the concrete Eliezer conclusion/illustration of the first half about abstract MIRI overconfidence. And you agree that that is what you are doing in your own description, that he “isn’t deploying careful reasoning” in the large things as well as the small, and you are presenting it as a small self-contained story illustrating that general overconfidence:
That said, It also appears to me that Eliezer is probably not the most careful reasoner, and appears indeed often (perhaps egregiously) overconfident. That doesn’t mean one should begrudge people finding value in the sequences although it is certainly not ideal if people take them as mantras rather than useful pointers and explainers for basic things (I didn’t read them, so might have an incorrect view here). There does appear to be some tendency to just link to some point made in the sequences as some airtight thing, although I haven’t found it too pervasive recently.
You’re describing a situational character flaw which doesn’t really have any bearing on being able to reason carefully overall.
Disagree. Epistemics is a group project and impatiently interrupting people can make both you and your interlocutor less likely to combine your information into correct conclusions. It is also evidence that you’re incurious internally which makes you worse at reasoning, though I don’t want to speculate on Eliezer’s internal experience in particular.
I agree with the first sentence. I agree with the second sentence with the caveat that it’s not strong absolute evidence, but mostly applies to the given setting (which is exactly what I’m saying).
People aren’t fixed entities and the quality of their contributions can vary over time and depend on context.
One day a mathematician doesn’t know a thing. The next day they do. In between they made no observations with their senses of the world.
It’s possible to make progress through theoretical reasoning. It’s not my preferred approach to the problem (I work on a heavily empirical team at a heavily empirical lab) but it’s not an invalid approach.
I agree, and I was thinking explicitly of that when I wrote “empirical” evidence and predictions in my original comment.
I personally have updated a fair amount over time on
people (going on) expressing invalid reasoning for their beliefs about timelines and alignment;
people (going on) expressing beliefs about timelines and alignment that seemed relatively more explicable via explanations other than “they have some good reason to believe this that I don’t know about”;
other people’s alignment hopes and mental strategies have more visible flaws and visible doomednesses;
other people mostly don’t seem to cumulatively integrate the doomednesses of their approaches into their mental landscape as guiding elements;
my own attempts to do so fail in a different way, namely that I’m too dumb to move effectively in the resulting modified landscape.
We can back out predictions of my personal models from this, such as “we will continue to not have a clear theory of alignment” or “there will continue to be consensus views that aren’t supported by reasoning that’s solid enough that it ought to produce that consensus if everyone is being reasonable”.
I thought the first paragraph and the boldened bit of your comment seemed insightful. I don’t see why what you’re saying is wrong – it seems right to me (but I’m not sure).
(I didn’t get anything out of it, and it seems kind of aggressive in a way that seems non-sequitur-ish, and also I am pretty sure mischaracterizes people. I didn’t downvote it, but have disagree-voted with it)
I basically agree with your overall comment, but I’d like to push back in one spot:
From my understanding, for at least Nate Soares, he claims his internal case for >80% doom is disjunctive and doesn’t route all through 1, 2, 3, and 4.
I don’t really know exactly what the disjuncts are, so this doesn’t really help and I overall agree that MIRI does make “sweeping claims with high confidence”.
I think your summary is a good enough quick summary of my beliefs. The minutia that I object to is how confident and specific lots of parts of your summary are. I think many of the claims in the summary can be adjusted or completely changed and still lead to bad outcomes. But it’s hard to add lots of uncertainty and options to a quick summary, especially one you disagree with, so that’s fair enough.
(As a side note, that paper you linked isn’t intended to represent anyone else’s views, other than myself and Peter, and we are relatively inexperienced. I’m also no longer working at MIRI).
I’m confused about why your <20% isn’t sufficient for you to want to shut down AI research. Is it because of benefits outweigh the risk, or because we’ll gain evidence about potential danger and can shut down later if necessary?
I’m also confused about why being able to generate practical insights about the nature of AI or AI progress is something that you think should necessarily follow from a model that predicts doom. I believe something close enough to (1) from your summary, but I don’t have much idea (above general knowledge) of how the first company to build such an agent will do so, or when they will work out how to do it. One doesn’t imply the other.
I think the expected benefits outweigh the risks, given that I care about the existing generation of humans (to a large, though not overwhelming degree). The expected benefits here likely include (in my opinion) a large reduction in global mortality, a very large increase in the quality of life, a huge expansion in material well-being, and more generally a larger and more vibrant world earlier in time. Without AGI, I think most existing people would probably die and get replaced by the next generation of humans, in a relatively much poor world (compared to the alternative).
I also think the absolute level risk from AI barely decreases if we globally pause. My best guess is that pausing would mainly just delay adoption without significantly impacting safety. Under my model of AI, the primary risks are long-term, and will happen substantially after humans have already gradually “handed control” over to the AIs and retired their labor on a large scale. Most of these problems—such as cultural drift and evolution—do not seem to be the type of issue that can be satisfactorily solved in advance, prior to a pause (especially by working out a mathematical theory of AI, or something like that).
On the level of analogy, I think of AI development as more similar to “handing off control to our children” than “developing a technology that disempowers all humans at a discrete moment in time”. In general, I think the transition period to AI will be more diffuse and incremental than MIRI seems to imagine, and there won’t be a sharp distinction between “human values” and “AI values” either during, or after the period.
(I also think AIs will probably be conscious in a way that’s morally important, in case that matters to you.)
In fact, I think it’s quite plausible the absolute level of AI risk would increase under a global pause, rather than going down, given the high level of centralization of power required to achieve a global pause, and the perverse institutions and cultural values that would likely arise under such a regime of strict controls. As a result, even if I weren’t concerned at all about the current generation of humans, and their welfare, I’d still be pretty hesitant to push pause on the entire technology.
(I think of technology as itself being pretty risky, but worth it. To me, pushing pause on AI is like pushing pause on technology itself, in the sense that they’re both generically risky yet simultaneously seem great on average. Yes, there are dangers ahead. But I think we can be careful and cautious without completely ripping up all the value for ourselves.)
Would most existing people accept a gamble with 20% of chance of death in the next 5 years and 80% of life extension and radically better technology? I concede that many would, but I think it’s far from universal, and I wouldn’t be too surprised if half of people or more think this isn’t for them.
I personally wouldn’t want to take that gamble (strangely enough I’ve been quite happy lately and my life has been feeling meaningful, so the idea of dying in the next 5 years sucks).
(Also, I want to flag that I strongly disagree with your optimism.)
For what it’s worth, while my credence in human extinction from AI in the 21st century is 10-20%, I think the chance of human extinction in the next 5 years is much lower. I’d put that at around 1%. The main way I think AI could cause human extinction is by just generally accelerating technology and making the world a scarier and more dangerous place to live. I don’t really buy the model in which an AI will soon foom until it becomes a ~god.
I like this framing. I think the more common statement would be 20% chance of death in 10-30 years , and 80% chance of life extension and much better technology that they might not live to see.
I think the majority of humanity would actually take this bet. They are not utilitarians or longtermists.
So if the wager is framed in this way, we’re going full steam ahead.
I yet another time say that your tech tree model doesn’t make sense to me. To get immortality/mind uploading, you need really overpowered tech, far above the level when killing all humans and starting disassemble planet becomes negligibly cheap. So I wouldn’t expect that “existing people would probably die” is going to change much under your model “AIs can be misaligned but killing all humans is too costly”.
I don’t think that’s either a given nor something we can ever know for sure. “Handing off” the world to robots and AIs that for all we know might be perfect P-zombies doesn’t feel like a good idea.
And why such use of the empirical track record is valid? Like, what’s the actual hypothesis here? What law of nature says “if technological progress hasn’t caused doom yet, it won’t cause it tomorrow”?
And arguments against are based on concrete empirically verifiable models of metaphors.
Doesn’t MIRI’s model predict some degree of the whole Shoggoth/actress thing in current system? Seems verifiable.
I share your frustration with MIRI’s communications with the alignment community.
And, the tone of this comment smells to me of danger. It looks a little too much like strawmanning, which always also implies that anyone who believes this scenario must be, at least in this context, an idiot. Since even rationalists are human, this leads to arguments instead of clarity.
I’m sure this is an accident born of frustration, and the unclarity of the MIRI argument.
I think we should prioritize not creating a polarized doomer-vs-optimist split in the safety community. It is very easy to do, and it looks to me like that’s frequently how important movements get bogged down.
Since time is of the essence, this must not happen in AI safety.
We can all express our views, we just need to play nice and extend the benefit of the doubt. MIRI actually does this quite well, although they don’t convey their risk model clearly. Let’s follow their example in the first and not the second.
Edit: I wrote a short form post about MIRI’s communication strategy, including how I think you’re getting their risk model importantly wrong
Note that MIRI has made some intermediate predictions. For example, I’m fairly certain Eliezer predicted that AlphaGo would go 5 for 5 against LSD, and it didn’t. I would respect his intellectual honesty more if he’d registered the alleged difficulty of intermediate predictions before making them unsuccessfully.
I think MIRI has something valuable to contribute to alignment discussions, but I’d respect them more if they did a “5 Whys” type analysis on their poor prediction track record, so as to improve the accuracy of predictions going forwards. I’m not seeing any evidence of that. It seems more like the standard pattern where a public figure invests their ego in some position, then tries to avoid losing face.
On your (2), I think you’re ignoring an understanding-related asymmetry:
Without clear models describing (a path to) a solution, it is highly unlikely we have a workable solution to a deep and complex problem:
Absence of concrete [we have (a path to) a solution] is pretty strong evidence of absence.
[EDIT for clarity, by “we have” I mean “we know of”, not “there exists”; I’m not claiming there’s strong evidence that no path to a solution exists]
Whether or not we have clear models of a problem, it is entirely possible for it to exist and to kill us:
Absence of concrete [there-is-a-problem] evidence is weak evidence of absence.
A problem doesn’t have to wait until we have formal arguments or strong, concrete empirical evidence for its existence before killing us. To claim that it’s “premature” to shut down the field before we have [evidence of type x], you’d need to make a case that [doom before we have evidence of type x] is highly unlikely.
A large part of the MIRI case is that there is much we don’t understand, and that parts of the problem we don’t understand are likely to be hugely important. An evidential standard that greatly down-weights any but the most rigorous, legible evidence is liable to lead to death-by-sampling-bias.
Of course it remains desirable for MIRI arguments to be as legible and rigorous as possible. Empiricism would be nice too (e.g. if someone could come up with concrete problems whose solution would be significant evidence for understanding something important-according-to-MIRI about alignment).
But ignoring the asymmetry here is a serious problem.
On your (3), it seems to me that you want “skeptical” to do more work than is reasonable. I agree that we “should be skeptical of purely theoretical arguments for doom”—but initial skepticism does not imply [do not update much on this]. It implies [consider this very carefully before updating]. It’s perfectly reasonable to be initially skeptical but to make large updates once convinced.
I do not think [the arguments are purely theoretical] is one of your true objections—rather it’s that you don’t find these particular theoretical arguments convincing. That’s fine, but no argument against theoretical arguments.
tl;dr: “lack of rigorous arguments for P is evidence against P” is typically valid, but not in case of P = AI X-risk.
A high-level reaction to your point about unfalsifiability:
There seems to be a general sentiment that “AI X-risk arguments are unfalsifiable ==> the arguments are incorrect” and “AI X-risk arguments are unfalsifiable ==> AI X-risk is low”.[1] I am very sympathetic to this sentiment—but I also think that in the particular case of AI X-risk, it is not justified.[2] For quite non-obvious reasons.
Why I believe this?
Take this simplified argument for AI X-risk:
Some important future AIs will be goal-oriented, or will behave in a goal-oriented way in sometimes[3]. (Read: If you think of them as trying to maximise some goal, you will make pretty good predictions.[4])
The “AI-progress tech-tree” is such that discontinous jumps in impact are possible. In particular, we will one day go from “an AI that is trying to maximise some goal, but not doing a very good job of it” to “an AI that is able to treat humans and other existing AIs as ‘environment’, and is going to do a very good job at maximising some goal”.
For virtually any[5] goal specification, doing a sufficiently[6] good job at maximising that goal specification leads to an outcome where every human is dead.
FWIW, I think that having a strong opinion on (1) and (2), in either direction, is not justified.[7] But in this comment, I only want to focus on (3) --- so let’s please pretend, for the sake of this discussion, that we find (1) and (2) at least plausible. What I claim is that even if we lived in a universe where (3) is true, we should still expect even the best arguments for (3) (that we might realistically identify) to be unfalsifiable—at least given realistic constraints on falsification effort and assumming that we use rigorous standards for what counts as a solid evidence, like people do in mathematics, physics, or CS.
What is my argument for “even best arguments for (3) will be unfalsifiable”?
Suppose you have an environment E that contains a Cartesian agent (a thing that takes actions in the environment and—let’s assume for simplicity—has perfect information about the environment, but whose decison-making computation happens outside of the environment). And suppose that this agent acts in a way that maximises[8] some goal specification[9] over E. Now, E might or might not contain humans, or representations of humans. We can now ask the following question: Is it true that, unless we spend an extremely high amont of effort (eg, >5 civilisation-years), any (non-degenerate[10]) goal-specification we come up with will result in human extinction[11] in E when maximised by the agent. I refer to this as “Extinction-level Goodhart’s Law”.
I claim that:
(A) Extinction-level Goodhart’s Law plausibly holds in the real world. (At least the thought expertiments I know, eg here or here, of suggest it does.)
(B) Even if Extinction-level Goodhart’s Law was true in the real world, it would still be false in environments where we could verify it experimentally (today, or soon) or mathematically (by proofs, given realistic amounts of effort).
==> And (B) implies that if we want “solid arguments”, rather than just thought expertiments, we might be kinda screwed when it comes to Extinction-level Goodhart’s Law.
And why do I believe (B)? The long story is that I try to gesture at this in my sequence on “Formalising Catastrophic Goodhart”. The short story is that there are many strategies for finding “safe to optimise” goal specifications that work in simpler environments, but not in the real-world (examples below). So to even start gaining evidence on whether the law holds in our world, we need to investigate envrionments where those simpler strategies don’t work—and it seems to me that those are always too complex for us to analyse mathematically or run an AI there which could “do a sufficiently good job a trying to maximise the goal specification”.
Some examples of the above-mentioned strategies for finding safe-to-optimise goal specifications: (i) The environment contains no (representations of) humans, or those “humans” can’t “die”, so it doesn’t matter. EG, most gridworlds. (ii) The environment doesn’t have any resources or similar things that would give rise to convergent instrumental goals, so it doesn’t matter. EG, most gridworlds. (iii) The environment allows for a simple formula that checks whether “humans” are “extinct”, so just add a huge penalty if that formula holds. (EG, most gridworlds where you added “humans”.) (iv) There is a limited set of actions that result in “killing” “humans”, so just add a huge penalty to those. (v) There is a simple formula for expressing a criterion that limits the agent’s impact. (EG, “don’t go past these coordinates” in a gridworld.)
All together, this should explain why the “unfalsifiability” counter-argument does not hold as much weight, in the case of AI X-risk, as one might intuitively expect.
If I understand you correctly, you would endorse something like this? Quite possibly with some disclaimers, ofc. (Certainly I feel that many other people endorse something like this.)
I acknowledge that the general heuristic “argument for X is unfalsifiable ==> the argument is wrong” holds in most cases. And I am aware we should be sceptical whenever somebody goes “but my case is an exception!”. Despite this, I still believe that AI X-risk genuinely is different from invisible dragons in your garage and conspiracy theories.
That said, I feel there should be a bunch of other examples where the heuristic doesn’t apply. If you have some that are good, please share!
An example of this would be if GPT-4 acted like a chatbot most of the time, but tried to take over the world if you prompt it with “act as a paperclipper”.
And this way of thinking about them is easier—description length, etc—than other options. EG, no “water bottles maximising being a water battle”.
By “virtual any” goal specification (leading to extinction when maximised), I mean that finding a goal specification for which extinction does not happen (when maximised) is extremely difficult. One example of operationalising “extremely difficult” would be “if our civilisation spent all its efforts on trying to find some goal specification, for 5 years from today, we would still fail”. In particular, the claim (3) is meant to imply that if you do anything like “do RLHF for a year, then optimise the result extremely hard”, then everybody dies.
For the purposes of this simplified AI X-risk argument, the AIs from (2), which are “very good at maximising a goal”, are meant to qualify for the “sufficiently good job at maximising a goal” from (3). In practice, this is of course more complicated—see e.g. my post on Weak vs Quantitative Extinction-level Goodhart’s Law.
Or at least there are no publicly available writings, known to me, which could justifiy claims like “It’s >=80% likely that (1) (or 2) holds (or doesn’t hold)”. Of course, (1) and (2) are too vague for this to even make sense, but imagine replacing (1) and (2) by more serious attempts at operationalising the ideas that they gesture at.
(or does a sufficiently good job of maximising)
Most reasonable ways of defining what “goal specification” means should work for the argument. As a simple example, we can think of having a reward function R : states --> R and maximising the sum of R(s) over any long time horizon.
To be clear, there are some trivial ways of avoiding Extinction-level Goodhart’s Law. One is to consider a constant utility function, which means that the agent might as well take random actions. Another would be to use reward functions in the spirit of “shut down now, or get a huge penalty”. And there might be other weird edge cases.
I acknowledge that this part should be better developed. But in the meantime, hopefully it is clear—at least somewhat—what I am trying to gesture at.
Most environments won’t contain actual humans. So by “human extinction”, I mean the “metaphorical humans being metaphorically dead”. EG, if your environment was pacman, then the natural thing would be to view the pacman as representing a “human”, and being eaten by the ghosts as representing “extinction”. (Not that this would be a good model for studying X-risk.)
An illustrative example, describing a scenario that is similar to our world, but where “Extinction-level Goodhart’s law” would be false & falsifiable (hat tip Vincent Conitzer):
Suppose that we somehow only start working on AGI many years from now, after we have already discovered a way to colonize the universe at the close to the speed of light. And some of the colonies are already unreachable, outside of our future lightcone. But suppose we still understand “humanity” as the collection of all humans, including those in the unreachable colonies. Then any AI that we build, no matter how smart, would be unable to harm these portions of humanity. And thus full-blown human extinction, from AI we build here on Earth, would be impossible. And you could “prove” this using a simple, yet quite rigorous, physics argument.[1]
(To be clear, I am not saying that “AI X-risk’s unfalsifiability is justifiable ==> we should update in favour of AI X-risk compared to our priors”. I am just saying that the justifiability means we should not update against it compared to our priors. Though I guess that in practice, it means that some people should undo some of their updates against AI X-risk… )
And sure, maybe some weird magic is actually possible, and the AI could actually beat speed of light. But whatever, I am ignoring this, and an argument like this would count as falsification as far as I am concerned.
FWIW, I acknowledge that my presentation of the argument isn’t ironclad, but I hope that it makes my position a bit clearer. If anybody has ideas for how to present it better, or has some nice illustrative examples, I would be extremely grateful.