Daniel Kokotajlo comments on MIRI 2024 Communications Strategy

Daniel Kokotajlo 31 May 2024 18:56 UTC
23 points
5
I think you are abusing/misusing the concept of falsifiability here. Ditto for empiricism. You aren’t the only one to do this, I’ve seen it happen a lot over the years and it’s very frustrating. I unfortunately am busy right now but would love to give a fuller response someday, especially if you are genuinely interested to hear what I have to say (which I doubt, given your attitude towards MIRI).
- Matthew Barnett 31 May 2024 19:02 UTC
  56 points
  45
  Parent
  I unfortunately am busy right now but would love to give a fuller response someday, especially if you are genuinely interested to hear what I have to say (which I doubt, given your attitude towards MIRI).
  I’m a bit surprised you suspect I wouldn’t be interested in hearing what you have to say?
  I think the amount of time I’ve spent engaging with MIRI perspectives over the years provides strong evidence that I’m interested in hearing opposing perspectives on this issue. I’d guess I’ve engaged with MIRI perspectives vastly more than almost everyone on Earth who explicitly disagrees with them as strongly as I do (although obviously some people like Paul Christiano and other AI safety researchers have engaged with them even more than me).
  (I might not reply to you, but that’s definitely not because I wouldn’t be interested in what you have to say. I read virtually every comment-reply to me carefully, even if I don’t end up replying.)
  - Daniel Kokotajlo 31 May 2024 20:26 UTC
    66 points
    8
    Parent
    I apologize, I shouldn’t have said that parenthetical.
  - Eli Tyre 20 Jun 2024 4:18 UTC
    24 points
    9
    Parent
    I want to publicly endorse and express appreciation for Matthew’s apparent good faith.
    
    Every time I’ve ever seen him disagreeing about AI stuff on the internet (a clear majority of the times I’ve encountered anything he’s written), he’s always been polite, reasonable, thoughtful, and extremely patient. Obviously conversations sometimes entail people talking past each other, but I’ve seen him carefully try to avoid miscommunication, and (to my ability to judge) strawmanning.
    
    Thank you Mathew. Keep it up. : )
  - Daniel Kokotajlo 1 Jun 2024 12:08 UTC
    17 points
    2
    Parent
    Here’s a new approach: Your list of points 1 − 7. Would you also make those claims about me? (i.e. replace references to MIRI with references to Daniel Kokotajlo.)
    - Matthew Barnett 1 Jun 2024 20:16 UTC
      33 points
      24
      Parent
      You’ve made detailed predictions about what you expect in the next several years, on numerous occasions, and made several good-faith attempts to elucidate your models of AI concretely. There are many ways we disagree, and many ways I could characterize your views, but “unfalsifiable” is not a label I would tend to use for your opinions on AI. I do not mentally lump you together with MIRI in any strong sense.
      - Daniel Kokotajlo 2 Jun 2024 12:30 UTC
        39 points
        17
        Parent
        OK, glad to hear. And thank you. :) Well, you’ll be interested to know that I think of my views on AGI as being similar to MIRI’s, just less extreme in various dimensions. For example I don’t think literally killing everyone is the most likely outcome, but I think it’s a very plausible outcome. I also don’t expect the ‘sharp left turn’ to be particularly sharp, such that I don’t think it’s a particularly useful concept. I also think I’ve learned a lot from engaging with MIRI and while I have plenty of criticisms of them (e.g. I think some of them are arrogant and perhaps even dogmatic) I think they have been more epistemically virtuous than the average participant in the AGI risk conversation, even the average ‘serious’ or ‘elite’ participant.
        _will_ 30 Oct 2024 4:08 UTC
        1 point
        0
        Parent
        I don’t think [AGI/ASI] literally killing everyone is the most likely outcome
        Huh, I was surprised to read this. I’ve imbibed a non-trivial fraction of your posts and comments here on LessWrong, and, before reading the above, my shoulder Daniel definitely saw extinction as the most likely existential catastrophe.
        If you have the time, I’d be very interested to hear what you do think is the most likely outcome. (It’s very possible that you have written about this before and I missed it—my bad, if so.)
        habryka 30 Oct 2024 4:15 UTC
        7 points
        2
        Parent
        (My model of Daniel thinks the AI will likely take over, but probably will give humanity some very small fraction of the universe, for a mixture of “caring a tiny bit” and game-theoretic reasons)
        _will_ 30 Oct 2024 4:40 UTC
        3 points
        2
        Parent
        Thanks, that’s helpful!
        (Fwiw, I don’t find the ‘caring a tiny bit’ story very reassuring, for the same reasons as Wei Dai, although I do find the acausal trade story for why humans might be left with Earth somewhat heartening. (I’m assuming that by ‘game-theoretic reasons’ you mean acausal trade.))
        Daniel Kokotajlo 30 Oct 2024 16:59 UTC
        3 points
        0
        Parent
        Yep, Habryka is right. Also, I agree with Wei Dai re: reassuringness. I think literal extinction is <50% likely, but this is cold comfort given the badness of some of the plausible alternatives, and overall I think the probability of something comparably bad happening is >50%.
- Daniel Kokotajlo 18 Jun 2024 14:22 UTC
  45 points
  13
  Parent
  Followup: Matthew and I ended up talking about it in person. tl;dr of my position is that
  
  Falsifiability is a symmetric two-place relation; one cannot say “X is unfalsifiable,” except as shorthand for saying “X and Y make the same predictions,” and thus Y is equally unfalsifiable. When someone is going around saying “X is unfalsifiable, therefore not-X,” that’s often a misuse of the concept—what they should say instead is “On priors / for other reasons (e.g. deference) I prefer not-X to X; and since both theories make the same predictions, I expect to continue thinking this instead of updating, since there won’t be anything to update on
  
  .What is the point of falsifiability-talk then? Well, first of all, it’s quite important to track when two theories make the same predictions, or the same-predictions-till-time-T. It’s an important part of the bigger project of extracting predictions from theories so they can be tested. It’s exciting progress when you discover that two theories make different predictions, and nail it down well enough to bet on. Secondly, it’s quite important to track when people are making this worse rather than easier—e.g. fortunetellers and pundits will often go out of their way to avoid making any predictions that diverge from what their interlocutors already would predict. Whereas the best scientists/thinkers/forecasters, the ones you should defer to, should be actively trying to find alpha and then exploit it by making bets with people around them. So falsifiability-talk is useful for evaluating people as epistemically virtuous or vicious. But note that if this is what you are doing, it’s all a relative thing in a different way—in the case of MIRI, for example, the question should be “Should I defer to them more, or less, than various alternative thinkers A B and C? --> Are they generally more virtuous about making specific predictions, seeking to make bets with their interlocutors, etc. than A B or C?”
  
  So with that as context, I’d say that (a) It’s just wrong to say ‘MIRI’s theories of doom are unfalsifiable.’ Instead say ‘unfortunately for us (not for the plausibility of the theories), both MIRI’s doom theories and (insert your favorite non-doom theories here) make the same predictions until it’s basically too late.’ (b) One should then look at MIRI and be suspicious and think ‘are they systematically avoiding making bets, making specific predictions, etc. relative to the other people we could defer to? Are they playing the sneaky fortuneteller or pundit’s game?’ to which I think the answer is ‘no not at all, they are actually more epistemically virtuous in this regard than the average intellectual. That said, they aren’t the best either—some other people in the AI risk community seem to be doing better than them in this regard, and deserve more virtue points (and possibly deference points) therefore.’ E.g. I think both Matthew and I have more concrete forecasting track records than Yudkowsky?
  - Martin Randall 19 Jan 2025 1:29 UTC
    7 points
    1
    Parent
    Falsifiability is not symmetric. Consider two theories:
    
    Theory X: Jesus will come again.
    Theory Y: Jesus will not come again.
    
    If Jesus comes again tomorrow, this falsifies theory Y and confirms theory X. If Jesus does not come again tomorrow, neither theory is falsified or confirmed. So we can say that X is unfalsifiable (with respect to a finite time frame) and Y is falsifiable.
    
    Another example:
    
    Theory X: blah blah and therefore the sky is green
    Theory Y: blah blah and therefore the sky is not green
    Theory Z: blah blah and therefore the sky could be green or not green.
    
    Here, theory X and Y are falsifiable with respect to the color of the sky and theory Z is not.
    - Daniel Kokotajlo 19 Jan 2025 5:33 UTC
      2 points
      0
      Parent
      Here’s how I’d deal with those examples:
      
      Theory X: Jesus will come again: Presumably this theory assigns some probability mass >0 to observing Jesus tomorrow, whereas theory Y assigns ~0. If jesus is not observed tomorrow, that’s a small amount of evidence for theory Y and a small amount of evidence against theory X. So you can say that theory X has been partially falsified. Repeat this enough times, and then you can say theory X has been fully falsified, or close enough. (Your credence in theory X will never drop to 0 probably, but that’s fine, that’s also true of all sorts of physical theories in good standing e.g. all the major theories of cosmology and cognitive science, which allow for tiny probabilities of arbitrary sequences of experiences happening in e.g. Boltzmann Brains)
      
      With the sky color example:
      My way of thinking about falsifiability is, we say two theories are falsifiable relative to each other if there is evidence we expect to encounter that will distinguish them / cause us to shift our relative credence in them.
      
      In the case of Theory Z, there is an implicit theory Z2 which is “NOT blah blah, and therefore the sky could be green or not green.” (Presumably that’s what you are holding in the back of your mind as the alternative to Z, when you imagine updating for or against Z on the basis of seeing blue sky, and decide that you wouldn’t?) Because the theory Z3 “NOT blah blah and therefore the sky is blue” would be confirmed by seeing a blue sky, and if somehow you were splitting your credence between Z and Z3, then you would decrease your credence in Z if you saw a blue sky.
      - Martin Randall 19 Jan 2025 18:14 UTC
        7 points
        1
        Parent
        Thanks for explaining. I think we have a definition dispute. Wikipedia:Falsifiability has:
        
        A theory or hypothesis is falsifiable if it can be logically contradicted by an empirical test.
        
        Whereas your definition is:
        
        Falsifiability is a symmetric two-place relation; one cannot say “X is unfalsifiable,” except as shorthand for saying “X and Y make the same predictions,” and thus Y is equally unfalsifiable.
        
        In one of the examples I gave earlier:
        
        Theory X: blah blah and therefore the sky is green
        Theory Y: blah blah and therefore the sky is not green
        Theory Z: blah blah and therefore the sky could be green or not green.
        
        None of X, Y, or Z are Unfalsifiable-Daniel with respect to each other, because they all make different predictions. However, X and Y are Falsifiable-Wikipedia, whereas Z is Unfalsifiable-Wikipedia.
        
        I prefer the Wikipedia definition. To say that two theories produce exactly the same predictions, I would instead say they are indistinguishable, similar to this Phyiscs StackExchange: Are different interpretations of quantum mechanics empirically distinguishable?.
        
        In the ancestor post, Barnett writes:
        
        MIRI researchers rarely provide any novel predictions about what will happen before AI doom, making their theories of doom appear unfalsifiable.
        
        Barnett is using something like the Wikipedia definition of falsifiability here. It’s unfair to accuse him of abusing or misusing the concept when he’s using it in a very standard way.
        Daniel Kokotajlo 20 Jan 2025 15:56 UTC
        7 points
        4
        Parent
        Very good point.
        So, by the Wikipedia definition, it seems that all the mainstream theories of cosmology are unfalsifiable, because they allow for tiny probabilities of boltmann brains etc. with arbitrary experiences. There is literally nothing you could observe that would rule them out / logically contradict them.
        
        Also, in practice, it’s extremely rare for a theory to be ruled out or even close-to-ruled out from any single observation or experiment. Instead, evidence accumulates in a bunch of minor and medium-sized updates.
        Martin Randall 21 Jan 2025 3:08 UTC
        4 points
        0
        Parent
        I think cosmology theories have to be phrased as including background assumptions like “I am not a Boltzmann brain” and “this is not a simulation” and such. Compare Acknowledging Background Information with P(Q|I) for example. Given that, they are Falsifiable-Wikipedia.
        
        I view Falsifiable-Wikipedia in a similar way to Occam’s Razor. The true epistemology has a simplicity prior, and Occam’s Razor is a shadow of that. The true epistemology considers “empirical vulnerability” / “experimental risk” to be positive. Possibly because it falls out of Bayesian updates, possibly because they are “big if true”, possibly for other reasons. Falsifiability is a shadow of that.
        
        In that context, if a hypothesis makes no novel predictions, and the predictions it makes are a superset of the predictions of other hypotheses, it’s less empirically vulnerable, and in some relative sense “unfalsifiable”, compared to those other hypotheses.
        Noosphere89 21 Jan 2025 15:21 UTC
        2 points
        2
        Parent
        
        “this is not a simulation”
        
        I personally wouldn’t include it, because essentially everything (given a powerful enough model of computation) could be simulated, and this is why the simulation hypothesis is so bad in casual discourse: It explains everything, which means it explains nothing that is specific to our universe:
        
        https://arxiv.org/abs/1806.08747
        Daniel Kokotajlo 20 Jan 2025 16:58 UTC
        4 points
        2
        Parent
        Also note that Barnett said “any novel predictions” which is not part of the wikipedia definition of falsifiability right? The wikipedia definition doesn’t make reference to an existing community of scientists who already made predictions, such that a new hypothesis can be said to have made novel vs. non-novel predictions.
        
        Daniel Kokotajlo 20 Jan 2025 17:00 UTC
        6 points
        4
        Parent
        I totally agree btw that it matters sociologically who is making novel predictions and who is sticking with the crowd. And I do in fact ding MIRI points for this relative to some other groups. However I think relative to most elite opinion-formers on AGI matters, MIRI performs better than average on this metric.
        
        But note that this ‘novel predictions’ metric is about people/institutions, not about hypotheses.
        Noosphere89 21 Jan 2025 15:40 UTC
        3 points
        0
        Parent
        
        However I think relative to most elite opinion-formers on AGI matters, MIRI performs better than average on this metric.
        
        Agree with this, with the caveat that I think all of their rightness relative to others fundamentally was in believing that short timelines were plausible enough, combined with believing in AI being the most major force of the 21st century by far, compared to other technologies, and basically a lot of their other specific predictions are likely to be pretty wrong.
        
        I like this comment here about a useful comparison point to MIRI, where physicists were right about the higgs boson existing, but wrong on the theories like supersymmetry where people expected the higgs mass to be naturally stabilized, and assuming supersymmetry is correct for our universe, the theory cannot stabilize the mass of the higgs, or solve the hierarchy problem:
        
        https://www.lesswrong.com/posts/ZLAnH5epD8TmotZHj/you-can-in-fact-bamboozle-an-unaligned-ai-into-sparing-your#Ha9hfFHzJQn68Zuhq
        Daniel Kokotajlo 21 Jan 2025 17:53 UTC
        6 points
        2
        Parent
        I think I agree with this—but do you see how it makes me frustrated to hear people dunk on MIRI’s doomy views as unfalsifiable? Here’s what happened in a nutshell:
        
        MIRI: “AGI is coming and it will kill everyone.”
        Everyone else: “AGI is not coming and if it did it wouldn’t kill everyone.”
        time passes, evidence accumulates...
        Everyone else: “OK, AGI is coming, but it won’t kill everyone”
        Everyone else: “Also, the hypothesis that it won’t kill everyone is unfalsifiable so we shouldn’t believe it.”
        Noosphere89 21 Jan 2025 18:18 UTC
        6 points
        0
        Parent
        Yeah, I think this is actually a problem I see here, though admittedly I often see the hypotheses be vaguely formulated, and I kind of agree with Jotto999 that the verbal forecasts give far too much room for leeway here:
        
        I like Eli Tyre’s comment here:
        
        https://www.lesswrong.com/posts/ZEgQGAjQm5rTAnGuM/beware-boasting-about-non-existent-forecasting-track-records#Dv7aTjGXEZh6ALmZn
        Martin Randall 21 Jan 2025 19:42 UTC
        2 points
        0
        Parent
        I like that metric, but the metric I’m discussing is more:
        
        Are they proposing clear hypotheses?
        Do their hypotheses make novel testable predictions?
        Are they making those predictions explicit?
        
        So for example, looking at MIRI’s very first blog post in 2007: The Power of Intelligence. I used the first just to avoid cherry-picking.
        
        Hypothesis: intelligence is powerful. (yes it is)
        
        This hypothesis is a necessary precondition for what we’re calling “MIRI doom theory” here. If intelligence is weak then AI is weak and we are not doomed by AI.
        
        Predictions that I extract:
        
        An AI can do interesting things over the Internet without a robot body.
        An AI can get money.
        An AI can be charismatic.
        An AI can send a ship to Mars.
        An AI can invent a grand unified theory of physics.
        An AI can prove the Riemann Hypothesis.
        An AI can cure obesity, cancer, aging, and stupidity.
        
        Not a novel hypothesis, nor novel predictions, but also not widely accepted in 2007. As predictions they have aged very well, but they were unfalsifiable. If 2025 Claude had no charisma, it would not falsify the prediction that an AI can be charismatic.
        
        I don’t mean to ding MIRI any points here, relative or otherwise, it’s just one blog post, I don’t claim it supports Barnett’s complaint by itself. I mostly joined the thread to defend the concept of asymmetric falsifiability.
        Noosphere89 21 Jan 2025 15:26 UTC
        2 points
        2
        Parent
        Martin Randall extracted the practical consequences of this here:
        
        In that context, if a hypothesis makes no novel predictions, and the predictions it makes are a superset of the predictions of other hypotheses, it’s less empirically vulnerable, and in some relative sense “unfalsifiable”, compared to those other hypotheses.