Rob Bensinger comments on Discussion with Eliezer Yudkowsky on AGI interventions

Rob Bensinger 12 Nov 2021 19:06 UTC
LW: 81 AF: 32
AF
Thanks for naming specific work you think is really good! I think it’s pretty important here to focus on the object-level. Even if you think the goodness of these particular research directions isn’t cruxy (because there’s a huge list of other things you find promising, and your view is mainly about the list as a whole rather than about any particular items on it), I still think it’s super important for us to focus on object-level examples, since this will probably help draw out what the generators for the disagreement are.
John Wentworth’s Natural Abstraction Hypothesis, which is about checking his formalism-backed intuition that NNs actually learn similar abstractions that humans do. The success story is pretty obvious, in that if John is right, alignment should be far easier.
Eliezer liked this post enough that he asked me to signal-boost it in the MIRI Newsletter back in April.
And Paul Christiano and Stuart Armstrong are two of the people Eliezer named as doing very-unusually good work. We continue to pay Stuart to support his research, though he’s mainly supported by FHI.
And Evan works at MIRI, which provides some Bayesian evidence about how much we tend to like his stuff. :)
So maybe there’s not much disagreement here about what’s relatively good? (Or maybe you’re deliberately picking examples you think should be ‘easy sells’ to Steel Eliezer.)
The main disagreement, of course, is about how absolutely promising this kind of stuff is, not how relatively promising it is. This could be some of the best stuff out there, but my understanding of the Adam/Eliezer disagreement is that it’s about ‘how much does this move the dial on actually saving the world?’ / ‘how much would we move the dial if we just kept doing more stuff like this?’.
Actually, this feels to me like a thing that your comments have bounced off of a bit. From my perspective, Eliezer’s statement was mostly saying ‘the field as a whole is failing at our mission of preventing human extinction; I can name a few tiny tidbits of relatively cool things (not just MIRI stuff, but Olah and Christiano), but the important thing is that in absolute terms the whole thing is not getting us to the world where we actually align the first AGI systems’.
My Eliezer-model thinks nothing (including MIRI stuff) has moved the dial much, relative to the size of the challenge. But your comments have mostly been about a sort of status competition between decision theory stuff and ML stuff, between prosaic stuff and ‘gain new insights into intelligence’ stuff, between MIRI stuff and non-MIRI stuff, etc. This feels to me like it’s ignoring the big central point (‘our work so far is wildly insufficient’) in order to haggle over the exact ordering of the wildly-insufficient things.
You’re zeroed in on the “vast desert” part, but the central point wasn’t about the desert-oasis contrast, it was that the whole thing is (on Eliezer’s model) inadequate to the task at hand. Likewise, you’re talking a lot about the “fake” part (and misstating Eliezer’s view as “everyone else [is] a faker”), when the actual claim was about “work that seems to me to be mostly fake or pointless or predictable” (emphasis added).
Maybe to you these feel similar, because they’re all just different put-downs. But… if those were true descriptions of things about the field, they would imply very different things.
I would like to put forward that Eliezer thinks, in good faith, that this is the best hypothesis that fits the data. I absolutely think reasonable people can disagree with Eliezer on this, and I don’t think we need to posit any bad faith or personality failings to explain why people would disagree.
What links here?
- A positive case for how we might succeed at prosaic AI alignment by evhub (16 Nov 2021 1:49 UTC; 81 points)
- Richard_Ngo's comment on Discussion with Eliezer Yudkowsky on AGI interventions by Rob Bensinger (15 Nov 2021 12:13 UTC; 19 points)
- Rob Bensinger 12 Nov 2021 19:13 UTC
  LW: 85 AF: 35
  AF Parent
  Also, I feel like I want to emphasize that, like… it’s OK to believe that the field you’re working in is in a bad state? The social pressure against saying that kind of thing (or even thinking it to yourself) is part of why a lot of scientific fields are unhealthy, IMO. I’m in favor of you not taking for granted that Eliezer’s right, and pushing back insofar as your models disagree with his. But I want to advocate against:
  - Saying false things about what the other person is saying. A lot of what you’ve said about Eliezer and MIRI is just obviously false (e.g., we have contempt for “experimental work” and think you can’t make progress by “Actually working with AIs and Thinking about real AIs”).
  - Shrinking the window of ‘socially acceptable things to say about the field as a whole’ (as opposed to unsolicited harsh put-downs of a particular researcher’s work, where I see more value in being cautious).
  I want to advocate ‘smack-talking the field is fine, if that’s your honest view; and pushing back is fine, if you disagree with the view’. I want to see more pushing back on the object level (insofar as people disagree), and less ‘how dare you say that, do you think you’re the king of alignment or something’ or ‘saying that will have bad social consequences’.
  I think you’re picking up on a real thing of ‘a lot of people are too deferential to various community leaders, when they should be doing more model-building, asking questions, pushing back where they disagree, etc.’ But I think the solution is to shift more of the conversation to object-level argument (that is, modeling the desired behavior), and make that argument as high-quality as possible.
  - adamShimi 12 Nov 2021 23:00 UTC
    LW: 8 AF: 7
    AF Parent
    Thanks for your great comments!
    One thing I want to make clear is that I’m quite aware that my comments have not been as high-quality as they should have been. As I wrote in the disclaimer, I was writing from a place of frustration and annoyance, which also implies a focus on more status-y thing. That sounded necessary to me to air out this frustration, and I think this was a good idea given the upvotes of my original post and the couple of people who messaged me to tell me that they were also annoyed.
    That being said, much of what I was railing against is a general perception of the situation, from reading a lot of stuff but not necessarily stopping to study all the evidence before writing a fully though-through opinion. I think this is where the “saying obviously false things” comes from (which I think are pretty easy to believe from just reading this post and a bunch of MIRI write-ups), and why your comments are really important to clarify the discrepancy between this general mental picture I was drawing from and the actual reality. Also recentering the discussion on the object-level instead of on status arguments sounds like a good move.
    You make a lot of good points and I definitely want to continue the conversation and have more detailed discussion, but I also feel that for the moment I need to take some steps back, read your comments and some of the pointers in other comments, and think a bit more about the question. I don’t think there’s much more to gain from me answering quickly, mostly in reaction.
    (I also had the brilliant idea of starting this thread just when I was on the edge of burning out from working too much (during my holidays), so I’m just going to take some time off from work. But I definitely want to continue this conversation further when I come back, although probably not in this thread ^^)
    - Rob Bensinger 13 Nov 2021 4:04 UTC
      LW: 53 AF: 23
      AF Parent
      Enjoy your rest! :)
      That sounded necessary to me to air out this frustration, and I think this was a good idea given the upvotes of my original post and the couple of people who messaged me to tell me that they were also annoyed.
      If you’d just aired out your frustration, framing claims about others in NVC-like ‘I feel like...’ terms (insofar as you suspect you wouldn’t reflectively endorse them), and then a bunch of people messaged you in private to say “thank you! you captured my feelings really well”, then that would seem clearly great to me.
      I’m a bit worried that what instead happened is that you made a bunch of clearly-false claims about other people and gave a bunch of invalid arguments, mixed in with the feelings-stuff; and you used the content warning at the top of the message to avoid having to distinguish which parts of your long, detailed comment are endorsed or not (rather than also flagging this within the comment); and then you also ran with this in a bunch of follow-up comments that were similarly not-endorsed but didn’t even have the top-of-comment disclaimer. So that I could imagine some people who also aren’t independently familiar with all the background facts, could come away with a lot of wrong beliefs about the people you’re criticizing.
      ‘Other people liked my comment, so it was clearly a good thing’ doesn’t distinguish between the worlds where they like it because they share the feelings, vs. agreeing with the factual claims and arguments (and if the latter, whether they’re noticing and filtering out all the seriously false or not-locally-valid parts). If the former, I think it was good. If the latter, I think it was bad.
      (By default I’d assume it’s some mix.)
      - adamShimi 13 Nov 2021 21:35 UTC
        LW: 10 AF: 7
        AF Parent
        I’m a bit worried that what instead happened is that you made a bunch of clearly-false claims about other people and gave a bunch of invalid arguments, mixed in with the feelings-stuff; and you used the content warning at the top of the message to avoid having to distinguish which parts of your long, detailed comment are endorsed or not (rather than also flagging this within the comment); and then you also ran with this in a bunch of follow-up comments that were similarly not-endorsed but didn’t even have the top-of-comment disclaimer. So that I could imagine some people who also aren’t independently familiar with all the background facts, could come away with a lot of wrong beliefs about the people you’re criticizing.
        That sounds a bit unfair, in the sense that it makes it look like I just invented stuff I didn’t believe and ran with it. When what actually happen was that I wrote about my frustrations, but made the mistake of stating them as obvious facts instead of impressions.
        Of course, I imagine you feel that my portrayal of EY and MIRI was also unfair, sorry about that.
        (I added a note to the three most ranty comments on this thread saying that people should mentally add “I feel like...” to judgments in them.)
        Rob Bensinger 13 Nov 2021 22:07 UTC
        LW: 18 AF: 9
        AF Parent
        Thanks for adding the note! :)
        I’m confused. When I say ‘that’s just my impression’, I mean something like ‘that’s an inside-view belief that I endorse but haven’t carefully vetted’. (See, e.g., Impression Track Records, referring to Naming Beliefs.)
        Example: you said that MIRI has “contempt with experimental work and not doing only decision theory and logic”.
        My prior guess would have been that you don’t actually, for-real believe that—that it’s not your ‘impression’ in the above sense, more like ‘unendorsed venting/hyperbole that has a more complicated relation to something you really believe’.
        If you do (or did) think that’s actually true, then our models of MIRI are much more different than I thought! Alternatively, if you agree this is not true, then that’s all I meant in the previous comment. (Sorry if I was unclear about that.)
        adamShimi 13 Nov 2021 23:17 UTC
        LW: 4 AF: 3
        AF Parent
        I would say that with slight caveats (make “decision theory and logic” a bit larger to include some more mathy stuff and make “all experimental work” a bit smaller to not includes Redwood’s work), this was indeed my model.
        
        What made me update from our discussion is the realization that I interpreted the dismissal of basically all alignment research as “this has no value whatsoever and people doing it are just pretending to care on alignment”, where it should have been interpreted as something like “this is potentially interesting/new/exciting, but it doesn’t look like it brings us closer to solving alignment in a significant way, hence we’re still failing”.
        Rob Bensinger 13 Nov 2021 23:59 UTC
        LW: 4 AF: 1
        AF Parent
        ‘Experimental work is categorically bad, but Redwood’s work doesn’t count’ does not sound like a “slight caveat” to me! What does this generalization mean at all if Redwood’s stuff doesn’t count?
        (Neither, for that matter, does the difference between ‘decision theory and logic’ and ‘all mathy stuff MIRI has ever focused on’ seem like a ‘slight caveat’ to me—but in that case maybe it’s because I have a lot more non-logic, non-decision-theory examples in my mind that you might not be familiar with, since it sounds like you haven’t read much MIRI stuff?).
        Rohin Shah 14 Nov 2021 10:28 UTC
        LW: 49 AF: 29
        AF Parent
        (Responding to entire comment thread) Rob, I don’t think you’re modeling what MIRI looks like from the outside very well.
        There’s a lot of public stuff from MIRI on a cluster that has as central elements decision theory and logic (logical induction, Vingean reflection, FDT, reflective oracles, Cartesian Frames, Finite Factored Sets...)
        There was once an agenda (AAMLS) that involved thinking about machine learning systems, but it was deprioritized, and the people working on it left MIRI.
        There was a non-public agenda that involved Haskell programmers. That’s about all I know about it. For all I know they were doing something similar to the modal logic work I’ve seen in the past.
        Eliezer frequently talks about how everyone doing ML work is pursuing dead ends, with potentially the exception of Chris Olah. Chris’s work is not central to the cluster I would call “experimentalist”.
        There has been one positive comment on the KL-divergence result in summarizing from human feedback. That wasn’t the main point of that paper and was an extremely predictable result.
        There has also been one positive comment on Redwood Research, which was founded by people who have close ties to MIRI. The current steps they are taking are not dramatically different from what other people have been talking about and/or doing.
        There was a positive-ish comment on aligning narrowly superhuman models, though iirc it gave off more of an impression of “well, let’s at least die in a slightly more dignified way”.
        I don’t particularly agree with Adam’s comments, but it does not surprise me that someone could come to honestly believe the claims within them.
        Rob Bensinger 15 Nov 2021 2:08 UTC
        LW: 29 AF: 13
        AF Parent
        So, the point of my comments was to draw a contrast between having a low opinion of “experimental work and not doing only decision theory and logic”, and having a low opinion of “mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc.” I didn’t intend to say that the latter is obviously-wrong; my goal was just to point out how different those two claims are, and say that the difference actually matters, and that this kind of hyperbole (especially when it never gets acknowledged later as ‘oh yeah, that’s not true and wasn’t what I was thinking’) is not great for discussion.
        I think it’s true that ‘MIRI is super not into most ML alignment work’, and I think it used to be true that MIRI put almost all of its research effort into HRAD-ish work, and regardless, this all seems like a completely understandable cached impression to have of current-MIRI. If I wrote stuff that makes it sound like I don’t think those views are common, reasonable, etc., then I apologize for that and disavow the thing I said.
        But this is orthogonal to what I thought I was talking about, so I’m confused about what seems to me like a topic switch. Maybe the implied background view here is:
        ‘Adam’s elision between those two claims was a totally normal level of hyperbole/imprecision, like you might find in any LW comment. Picking on word choices like “only decision theory and logic” versus “only research that’s clustered near decision theory and logic in conceptspace”, or “contempt with experimental work” versus “assigning low EV to typical instances of empirical ML alignment work”, is an isolated demand for rigor that wouldn’t make sense as a general policy and isn’t, in any case, the LW norm.’
        Is that right?
        dxu 15 Nov 2021 5:10 UTC
        LW: 38 AF: 25
        AF Parent
        So, the point of my comments was to draw a contrast between having a low opinion of “experimental work and not doing only decision theory and logic”, and having a low opinion of “mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc.” I didn’t intend to say that the latter is obviously-wrong; my goal was just to point out how different those two claims are, and say that the difference actually matters, and that this kind of hyperbole (especially when it never gets acknowledged later as ‘oh yeah, that’s not true and wasn’t what I meant’) is not great for discussion.
        It occurs to me that part of the problem may be precisely that Adam et al. don’t think there’s a large difference between these two claims (that actually matters). For example, when I query my (rough, coarse-grained) model of [your typical prosaic alignment optimist], the model in question responds to your statement with something along these lines:
        If you remove “mainstream ML alignment work, and nearly all work outside of the HRAD-ish cluster of decision theory, logic, etc.” from “experimental work”, what’s left? Perhaps there are one or two (non-mainstream, barely-pursued) branches of “experimental work” that MIRI endorses and that I’m not aware of—but even if so, that doesn’t seem to me to be sufficient to justify the idea of a large qualitative difference between these two categories.
        In a similar vein to the above: perhaps one description is (slightly) hyperbolic and the other isn’t. But I don’t think replacing the hyperbolic version with the non-hyperbolic version would substantially change my assessment of MIRI’s stance; the disagreement feels non-cruxy to me. In light of this, I’m not particularly bothered by either description, and it’s hard for me to understand why you view it as such an important distinction.
        Moreover: I don’t think [my model of] the prosaic alignment optimist is being stupid here. I think, to the extent that his words miss an important distinction, it is because that distinction is missing from his very thoughts and framing, not because he happened to use choose his words somewhat carelessly when attempting to describe the situation. Insofar as this is true, I expect him to react to your highlighting of this distinction with (mostly) bemusement, confusion, and possibly even some slight suspicion (e.g. that you’re trying to muddy the waters with irrelevant nitpicking).
        To be clear: I don’t think you’re attempting to muddy the waters with irrelevant nitpicking here. I think you think the distinction in question is important because it’s pointing to something real, true, and pertinent—but I also think you’re underestimating how non-obvious this is to people who (A) don’t already deeply understand MIRI’s view, and (B) aren’t in the habit of searching for ways someone’s seemingly pointless statement might actually be right.
        I don’t consider myself someone who deeply understands MIRI’s view. But I do want to think of myself as someone who, when confronted with a puzzling statement [from someone whose intellectual prowess I generally respect], searches for ways their statement might be right. So, here is my attempt at describing the real crux behind this disagreement:
        (with the caveat that, as always, this is my view, not Rob’s, MIRI’s, or anybody else’s)
        (and with the additional caveat that, even if my read of the situation turns out to be correct, I think in general the onus is on MIRI to make sure they are understood correctly, rather than on outsiders to try to interpret them—at least, assuming that MIRI wants to make sure they’re understood correctly, which may not always be the best use of researcher time)
        I think the disagreement is mostly about MIRI’s counterfactual behavior, not about their actual behavior. I think most observers (including both Adam and Rob) would agree that MIRI leadership has been largely unenthusiastic about a large class of research that currently falls under the umbrella “experimental work”, and that the amount of work in this class MIRI has been unenthused about significantly outweighs the amount of work they have been excited about.
        Where I think Adam and Rob diverge is in their respective models of the generator of this observed behavior. I think Adam (and those who agree with him) thinks that the true boundary of the category [stuff MIRI finds unpromising] roughly coincides with the boundary of the category [stuff most researchers would call “experimental work”], such that anything that comes too close to “running ML experiments and seeing what happens” will be met with an immediate dismissal from MIRI. In other words, [my model of] Adam thinks MIRI’s generator is configured such that the ratio of “experimental work” they find promising-to-unpromising would be roughly the same across many possible counterfactual worlds, even if each of those worlds is doing “experiments” investigating substantially different hypotheses.
        Conversely, I think Rob thinks the true boundary of the category [stuff MIRI finds unpromising] is mostly unrelated to the boundary of the category [stuff most researchers would call “experimental work”], and that—to the extent MIRI finds most existing “experimental work” unpromising—this is mostly because the existing work is not oriented along directions MIRI finds promising. In other words, [my model of] Rob thinks MIRI’s generator is configured such that the ratio of “experimental work” they find promising-to-unpromising would vary significantly across counterfactual worlds where researchers investigate different hypotheses; in particular, [my model of] Rob thinks MIRI would find most “experimental work” highly promising in the world where the “experiments” being run are those whose results Eliezer/Nate/etc. would consider difficult to predict in advance, and therefore convey useful information regarding the shape of the alignment problem.
        I think Rob’s insistence on maintaining the distinction between having a low opinion of “experimental work and not doing only decision theory and logic”, and having a low opinion of “mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc.” is in fact an attempt to gesture at the underlying distinction outlined above, and I think that his stringency on this matter makes significantly more sense in light of this. (Though, once again, I note that I could be completely mistaken in everything I just wrote.)
        Assuming, however, that I’m (mostly) not mistaken, I think there’s an obvious way forward in terms of resolving the disagreement: try to convey the underlying generators of MIRI’s worldview. In other words, do the thing you were going to do anyway, and save the discussions about word choice for afterwards.
        Expand this thread
        Rohin Shah 15 Nov 2021 9:57 UTC
        LW: 9 AF: 6
        AF Parent
        ^ This response is great.
        I also think I naturally interpreted the terms in Adam’s comment as pointing to specific clusters of work in today’s world, rather than universal claims about all work that could ever be done. That is, when I see “experimental work and not doing only decision theory and logic”, I automatically think of “experimental work” as pointing to a specific cluster of work that exists in today’s world (which we might call mainstream ML alignment), rather than “any information you can get by running code”. Whereas it seems you interpreted it as something closer to “MIRI thinks there isn’t any information to get by running code”.
        My brain insists that my interpretation is the obvious one and is confused how anyone (within the AI alignment field, who knows about the work that is being done) could interpret it as the latter. (Although the existence of non-public experimental work that isn’t mainstream ML is a good candidate for how you would start to interpret “experimental work” as the latter.) But this seems very plausibly a typical mind fallacy.
        EDIT: Also, to explicitly say it, sorry for misunderstanding what you were trying to say. I did in fact read your comments as saying “no, MIRI is not categorically against mainstream ML work, and MIRI is not only working on HRAD-ish stuff like decision theory and logic, and furthermore this should be pretty obvious to outside observers”, and now I realize that is not what you were saying.
        Rob Bensinger 15 Nov 2021 6:11 UTC
        LW: 2 AF: 1
        AF Parent
        This is a good comment! I also agree that it’s mostly on MIRI to try to explain its views, not on others to do painstaking exegesis. If I don’t have a ready-on-hand link that clearly articulates the thing I’m trying to say, then it’s not surprising if others don’t have it in their model.
        And based on these comments, I update that there’s probably more disagreement-about-MIRI than I was thinking, and less (though still a decent amount of) hyperbole/etc. If so, sorry about jumping to conclusions, Adam!
        jsteinhardt 15 Nov 2021 3:22 UTC
        LW: 24 AF: 15
        AF Parent
        Not sure if this helps, and haven’t read the thread carefully, but my sense is your framing might be eliding distinctions that are actually there, in a way that makes it harder to get to the bottom of your disagreement with Adam. Some predictions I’d have are that:
        * For almost any experimental result, a typical MIRI person (and you, and Eliezer) would think it was less informative about AI alignment than I would.
        * For almost all experimental results you would think they were so much less informative as to not be worthwhile.
        * There’s a small subset of experimental results that we would think are comparably informative, and also a some that you would find much more informative than I would.
        (I’d be willing to take bets on these or pick candidate experiments to clarify this.)
        In addition, a consequence of these beliefs is that compared to me you think we should be spending way more time sitting around thinking about stuff, and way less time doing experiments, than I do.
        I would agree with you that “MIRI hates all experimental work” / etc. is not a faithful representation of this state of affairs, but I think there is nevertheless an important disagreement MIRI has with typical ML people, and that the disagreement is primarily about what we can learn from experiments.
        Expand this thread
        Rob Bensinger 15 Nov 2021 5:48 UTC
        LW: 42 AF: 21
        AF Parent
        I would agree with you that “MIRI hates all experimental work” / etc. is not a faithful representation of this state of affairs, but I think there is nevertheless an important disagreement MIRI has with typical ML people, and that the disagreement is primarily about what we can learn from experiments.
        Ooh, that’s really interesting. Thinking about it, I think my sense of what’s going on is (and I’d be interested to hear how this differs from your sense):
        Compared to the average alignment researcher, MIRI tends to put more weight on reasoning like ‘sufficiently capable and general AI is likely to have property X as a strong default, because approximately-X-ish properties don’t seem particularly difficult to implement (e.g., they aren’t computationally intractable), and we can see from first principles that agents will be systematically less able to get what they want when they lack property X’. My sense is that MIRI puts more weight on arguments like this for reasons like:
        We’re more impressed with the track record of inside-view reasoning in science.
        I suspect this is partly because the average alignment researcher is impressed with how unusually-poorly inside-view reasoning has done in AI—many have tried to gain a deep understanding of intelligence, and many have failed—whereas (for various reasons) MIRI is less impressed with this, and defaults more to the base rate for other fields, where inside-view reasoning has more extraordinary feats under its belt.
        We’re more wary of “modest epistemology”, which we think often acts like a self-fulfilling prophecy. (You don’t practice trying to mechanistically model everything yourself, you despair of overcoming biases, you avoid thinking thoughts that would imply you’re a leader or pioneer because that feels arrogant, so you don’t gain as much skill or feedback in those areas.)
        Compared to the average alignment researcher, MIRI tends to put less weight on reasoning like ‘X was true about AI in 1990, in 2000, in 2010, and in 2020; therefore X is likely to be true about AGI when it’s developed’. This is for a variety of reasons, including:
        MIRI is more generally wary of putting much weight on surface generalizations, if we don’t have an inside-view reason to expect the generalization to keep holding.
        MIRI thinks AGI is better thought of as ‘a weird specific sort of AI’, rather than as ‘like existing AI but more so’.
        Relatedly, MIRI thinks AGI is mostly insight-bottlenecked (we don’t know how to build it), rather than hardware-bottlenecked. Progress on understanding AGI is much harder to predict than progress on hardware, so we can’t derive as much from trends.
        Applying this to experiments:
        Some predictions I’d have are that:
        * For almost any experimental result, a typical MIRI person (and you, and Eliezer) would think it was less informative about AI alignment than I would.
        * For almost all experimental results you would think they were so much less informative as to not be worthwhile.
        * There’s a small subset of experimental results that we would think are comparably informative, and also a some that you would find much more informative than I would.
        I’d have the same prediction, though I’m less confident that ‘pessimism about experiments’ is doing much work here, vs. ‘pessimism about alignment’. To distinguish the two, I’d want to look at more conceptual work too, where I’d guess MIRI is also more pessimistic than you (though probably the gap will be smaller?).
        I do expect there to be some experiment-specific effect. I don’t know your views well, but if your views are sufficiently like my mental model of ‘typical alignment researcher whose intuitions differ a lot from MIRI’s’, then my guess would be that the disagreement comes down to the above two factors.
        1 (more trust in inside view): For many experiments, I’m imagining Eliezer saying ‘I predict the outcome will be X’, and then the outcome is X, and the Modal Alignment Researcher says: ‘OK, but now we’ve validated your intuition—you should be much more confident, and that update means the experiment was still useful.’
        
        To which Hypothetical Eliezer says: ‘I was already more than confident enough. Empirical data is great—I couldn’t have gotten this confident without years of honing my models and intuitions through experience—but now that I’m there, I don’t need to feign modesty and pretend I’m uncertain about everything until I see it with my own eyes.’
        
        2 (less trust in AGI sticking to trends): For many obvious ML experiments Eliezer can’t predict the outcome of, I expect Eliezer to say ‘This experiment isn’t relevant, because factors X, Y, and Z give us strong reason to think that the thing we learn won’t generalize to AGI.’
        
        Which ties back in to 1 as well, because if you don’t think we can build very reliable models in AI without constant empirical feedback, you’ll rarely be confident of abstract reasons X/Y/Z to expect a difference between current ML and AGI, since you can’t go walk up to an AGI today and observe what it’s like.
        
        (You also won’t be confident that X/Y/Z don’t hold—all the possibilities will seem plausible until AGI is actually here, because you generally don’t trust yourself to reason your way to conclusions with much confidence.)
        jsteinhardt 15 Nov 2021 17:14 UTC
        LW: 15 AF: 9
        AF Parent
        Thanks. For time/brevity, I’ll just say which things I agree / disagree with:
        
        > sufficiently capable and general AI is likely to have property X as a strong default [...]
        I generally agree with this, although for certain important values of X (such as “fooling humans for instrumental reasons”) I’m probably more optimistic than you that there will be a robust effort to get not-X, including by many traditional ML people. I’m also probably more optimistic (but not certain) that those efforts will succeed.
        [inside view, modest epistemology]: I don’t have a strong take on either of these. My main take on inside views is that they are great for generating interesting and valuable hypotheses, but usually wrong on the particulars.
        > less weight on reasoning like ‘X was true about AI in 1990, in 2000, in 2010, and in 2020; therefore X is likely to be true about AGI when it’s developed
        
        I agree, see my post On the Risks of Emergent Behavior in Foundation Models. In the past I think I put too much weight on this type of reasoning, and also think most people in ML put too much weight on it.
        
        > MIRI thinks AGI is better thought of as ‘a weird specific sort of AI’, rather than as ‘like existing AI but more so’.
        
        Probably disagree but hard to tell. I think there will both be a lot of similarities and a lot of differences.
        
        > AGI is mostly insight-bottlenecked (we don’t know how to build it), rather than hardware-bottlenecked
        
        Seems pretty wrong to me. We probably need both insight and hardware, but the insights themselves are hardware-bottlenecked: once you can easily try lots of stuff and see what happens, insights are much easier, see Crick on x-ray crystallography for historical support (ctrl+f for Crick).
        
        > I’d want to look at more conceptual work too, where I’d guess MIRI is also more pessimistic than you
        
        I’m more pessimistic than MIRI about HRAD, though that has selection effects. I’ve found conceptual work to be pretty helpful for pointing to where problems might exist, but usually relatively confused about how to address them or how specifically they’re likely to manifest. (Which is to say, overall highly valuable, but consistent with my take above on inside views.)
        
        [experiments are either predictable or uninformative]: Seems wrong to me. As a concrete example: Do larger models have better or worse OOD generalization? I’m not sure if you’d pick “predictable” or “uninformative”, but my take is:
        * The outcome wasn’t predictable: within ML there are many people who would have taken each side. (I personally was on the wrong side, i.e. predicting “worse”.)
        * It’s informative, for two reasons: (1) It shows that NNs “automatically” generalize more than I might have thought, and (2) Asymptotically, we expect the curve to eventually reverse, so when does that happen and how can we study it?
        See also my take on Measuring and Forecasting Risks from AI, especially the section on far-off risks.
        
        > Most ML experiments either aren’t about interpretability and ‘cracking open the hood’, or they’re not approaching the problem in a way that MIRI’s excited by.
        
        Would agree with “most”, but I think you probably meant something like “almost all”, which seems wrong. There’s lots of people working on interpretability, and some of the work seems quite good to me (aside from Chris, I think Noah Goodman, Julius Adebayo, and some others are doing pretty good work).
        Eliezer Yudkowsky 16 Nov 2021 3:51 UTC
        LW: 31 AF: 17
        AF Parent
        I’m not (retroactively in imaginary prehindsight) excited by this problem because neither of the 2 possible answers (3 possible if you count “the same”) had any clear-to-my-model relevance to alignment, or even AGI. AGI will have better OOD generalization on capabilities than current tech, basically by the definition of AGI; and then we’ve got less-clear-to-OpenPhil forces which cause the alignment to generalize more poorly than the capabilities did, which is the Big Problem. Bigger models generalizing better or worse doesn’t say anything obvious to any piece of my model of the Big Problem. Though if larger models start generalizing more poorly, then it takes longer to stupidly-brute-scale to AGI, which I suppose affects timelines some, but that just takes timelines from unpredictable to unpredictable sooo.
        If we qualify an experiment as interesting when it can tell anyone about anything, then there’s an infinite variety of experiments “interesting” in this sense and I could generate an unlimited number of them. But I do restrict my interest to experiments which can not only tell me something I don’t know, but tell me something relevant that I don’t know. There is also something to be said for opening your eyes and staring around, but even then, there’s an infinity of blank faraway canvases to stare at, and the trick is to go wandering with your eyes wide open someplace you might see something really interesting. Others will be puzzled and interested by different things and I don’t wish them ill on their spiritual journeys, but I don’t expect the vast majority of them to return bearing enlightenment that I’m at all interested in, though now and then Miles Brundage tweets something (from far outside of EA) that does teach me some little thing about cognition.
        I’m interested at all in Redwood Research’s latest project because it seems to offer a prospect of wandering around with our eyes open asking questions like “Well, what if we try to apply this nonviolence predicate OOD, can we figure out what really went into the ‘nonviolence’ predicate instead of just nonviolence?” or if it works maybe we can try training on corrigibility and see if we can start to manifest the tiniest bit of the predictable breakdowns, which might manifest in some different way.
        Do larger models generalize better or more poorly OOD? It’s a relatively basic question as such things go, and no doubt of interest to many, and may even update our timelines from ‘unpredictable’ to ‘unpredictable’, but… I’m trying to figure out how to say this, and I think I should probably accept that there’s no way to say it that will stop people from trying to sell other bits of research as Super Relevant To Alignment… it’s okay to have an understanding of reality which makes narrower guesses than that about which projects will turn out to be very relevant.
        adamShimi 16 Nov 2021 17:30 UTC
        LW: 2 AF: 1
        AF Parent
        I’m interested at all in Redwood Research’s latest project because it seems to offer a prospect of wandering around with our eyes open asking questions like “Well, what if we try to apply this nonviolence predicate OOD, can we figure out what really went into the ‘nonviolence’ predicate instead of just nonviolence?” or if it works maybe we can try training on corrigibility and see if we can start to manifest the tiniest bit of the predictable breakdowns, which might manifest in some different way.
        Trying to rephrase it in my own words (which will necessarily lose some details), are you interested in Redwood’s research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now? Which might tell us for example “what aspect of these predictable problems crop up first, and why?”
        Eliezer Yudkowsky 19 Nov 2021 6:40 UTC
        LW: 21 AF: 13
        AF Parent
        are you interested in Redwood’s research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now?
        It potentially sheds light on small subpieces of things that are particular aspects that contribute to the Real Problem, like “What actually went into the nonviolence predicate instead of just nonviolence?” Much of the Real Meta-Problem is that you do not get things analogous to the full Real Problem until you are just about ready to die.
        Rob Bensinger 15 Nov 2021 6:25 UTC
        LW: 13 AF: 10
        AF Parent
        I suspect a third important reason is that MIRI thinks alignment is mostly about achieving a certain kind of interpretability/understandability/etc. in the first AGI systems. Most ML experiments either aren’t about interpretability and ‘cracking open the hood’, or they’re not approaching the problem in a way that MIRI’s excited by.
        E.g., if you think alignment research is mostly about testing outer reward function to see what first-order behavior they produce in non-AGI systems, rather than about ‘looking in the learned model’s brain’ to spot mesa-optimization and analyze what that optimization is ultimately ‘trying to do’ (or whatever), then you probably won’t produce stuff that MIRI’s excited about regardless of how experimental vs. theoretical your work is.
        (In which case, maybe this is not actually a crux for the usefulness of most alignment experiments, and is instead a crux for the usefulness of most alignment research in general.)
        Rob Bensinger 15 Nov 2021 5:52 UTC
        LW: 3 AF: 2
        AF Parent
        (I suspect there are a bunch of other disagreements going into this too, including basic divergences on questions like ‘What’s even the point of aligning AGI? What should humanity do with aligned AGI once it has it?’.)
        orthonormal 14 Nov 2021 19:58 UTC
        LW: 21 AF: 11
        AF Parent
        One tiny note: I was among the people on AAMLS; I did leave MIRI the next year; and my reasons for so doing are not in any way an indictment of MIRI. (I was having some me-problems.)
        I still endorse MIRI as, in some sense, being the adults in the AI Safety room, which has… disconcerting effects on my own level of optimism.
        adamShimi 14 Nov 2021 15:01 UTC
        LW: 9 AF: 7
        AF Parent
        Not planning to answer more on this thread, but given how my last messages seem to have confused you, here is my last attempt of sharing my mental model (so you can flag in an answer where I’m wrong in your opinion for readers of this thread)
        Also, I just checked on the publication list, and I’ve read or skimmed most things MIRI published since 2014 (including most newsletters and blog posts on MIRI website).
        My model of MIRI is that initially, there was a bunch of people including EY who were working mostly on decision theory stuff, tiling, model theory, the sort of stuff I was pointing at. That predates Nate’s arrival, but in my model it becomes far more legible after that (so circa 2014/2015). In my model, I call that “old school MIRI”, and that was a big chunk of what I was pointing out in my original comment.
        Then there are a bunch of thing that seem to have happened:
        Newer people (Abram and Scott come to mind, but mostly because they’re the one who post on the AF and who I’ve talked to) join this old-school MIRI approach and reshape it into Embedded Agency. Now this new agenda is a bit different from the old-school MIRI work, but I feel like it’s still not that far from decision theory and logic (with maybe a stronger emphasis on the bayesian part for stuff like logical induction). That might be a part where we’re disagreeing.
        A direction related to embedded agency and the decision theory and logic stuff, but focused on implementations through strongly typed programming languages like Haskell and type theory. That’s technically practical, but in my mental model this goes in the same category as “decision theory and logic stuff”, especially because that sort of programming is very close to logic and natural deduction.
        MIRI starts it’s ML-focused agenda, which you already mentioned. The impression I still have is that this didn’t lead to much published work that was actually experimental, instead focusing on recasting questions of alignment through ML theory. But I’ve updated towards thinking MIRI has invested efforts into looking at stuff from a more prosaic angle, based on looking more into what has been published there, because some of these ML papers had flown under my radar (there’s also the difficulty that when I read a paper by someone who has a position elsewhere now — say Ryan Carey or Stuart Armstrong — I don’t think MIRI but I think of their current affiliation, even though the work was supported by MIRI (and apparently Stuart is still supported by MIRI)). This is the part of the model where I expect that we might have very different models because of your knowledge of what was being done internally and never released.
        Some new people hired by MIRI fall into what I call the “Bells Lab MIRI” model, where MIRI just hires/funds people that have different approaches from them, but who they think are really bright (Evan and Vanessa come to mind, although I don’t know if that’s the though process that went into hiring them).
        Based on that model and some feedback and impressions I’ve gathered from people of some MIRI researchers being very doubtful of experimental work, that lead to my “all experimental work is useless”. I tried to include Redwood and Chris Olah’s work in there with the caveat (which is a weird model but makes sense if you have a strong prior for “experimental work is useless for MIRI”).
        Our discussion made me think that there’s probably far better generators for this general criticism of experimental work, and that they would actually make more sense than “experimental work is useless except this and that”.