Daniel Kokotajlo comments on Defining alignment research

Daniel Kokotajlo Aug 22, 2024, 3:45 PM
LW: 7 AF: 5
3
AF
Answering your first question. If you truly aren’t trying to make AGI, and you truly aren’t trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) …great! That’s neither capabilities nor alignment research afaict, but basic science. Good for you. I still think it would be better if you switched to doing alignment research (e.g. you could switch to ’i want to understand how neural networks work… so that I can understand how a prosaic AGI system being RLHF’d might behave when presented with genuine credible opportunities for takeover + lots of time to think about what to do) but I don’t feel so strongly about it as I would if you were doing capabilities research.

re: judging the research itself rather than the motivations: idk I think it’s actually easier, and less subjective, to judge the motivations, at least in this case. People usually just state what their motivations are. Also I’m not primarily trying to judge people, I’m trying to exhort people—I’m giving people advice about what they should do to make the world a better place, I’m not e.g. talking about which research should be published and which should be restricted (I’m generally in favor of publishing research with maybe a few exceptions, but I think corporations on the margin should publish more) Moreover it’s much easier for the researcher to judge their own motivations, than for them to judge the long-term impact of their work or to fit it into your diagram.
- Richard_Ngo Aug 24, 2024, 6:37 PM
  LW: 9 AF: 7
  0
  AF Parent
  If you truly aren’t trying to make AGI, and you truly aren’t trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) …great! That’s neither capabilities nor alignment research afaict, but basic science.
  Consider Chris Olah, who I think has done more than almost anyone else to benefit alignment. It would be very odd if we had a definition of alignment research where you could read all of Chris’s interpretability work and still not know whether or not he’s an “alignment researcher”. On your definition, when I read a paper by a researcher I haven’t heard of, I don’t know anything about whether it’s alignment research or not until I stalk them on facebook and find out how socially proximal they are to the AI safety community. That doesn’t seem great.
  Back to Chris. Because I’ve talked to Chris and read other stuff by him, I’m confident that he does care about alignment. But I still don’t know whether his actual motivations are more like 10% intrinsic interest in how neural networks work and 90% in alignment, or vice versa, or anything in between. (It’s probably not even a meaningful thing to measure.) It does seem likely to me that the ratio of how much intrinsic interest he has in how neural networks work, to how much he cares about alignment, is significantly higher than that of most alignment researchers, and I don’t think that’s a coincidence—based on the history of science (Darwin, Newton, etc) intrinsic interest in a topic seems like one of the best predictors of actually making the most important breakthroughs.
  In other words: I think your model of what produces more useful research from an alignment perspective overprioritizes towards first-order effects (if people care more they’ll do more relevant work) and ignores the second-order effects that IMO are more important (1. Great breakthroughs seem, historically, to be primarily motivated by intrinsic interest; and 2. Creating research communities that are gatekept by people’s beliefs/motivations/ideologies is corrosive, and leads to political factionalism + ingroupiness rather than truth-seeking.)
  I’m not primarily trying to judge people, I’m trying to exhort people
  Well, there are a lot of grants given out for alignment research. Under your definition, those grants would only be given to people who express the right shibboleths.
  I also think that the best exhortation of researchers mostly looks like nerdsniping them, and the way to do that is to build a research community that is genuinely very interested in a certain set of (relatively object-level) topics. I’d much rather an interpretability team hire someone who’s intrinsically fascinated by neural networks (but doesn’t think much about alignment) than someone who deeply cares about making AI go well (but doesn’t find neural nets very interesting). But any step in the pipeline that prioritizes “alignment researchers” (like: who gets invited to alignment workshops, who gets alignment funding or career coaching, who gets mentorship, etc) will prioritize the latter over the former if they’re using your definition.
  - Buck Aug 26, 2024, 5:28 PM
    LW: 10 AF: 7
    5
    AF Parent
    I’d much rather an interpretability team hire someone who’s intrinsically fascinated by neural networks (but doesn’t think much about alignment) than someone who deeply cares about making AI go well (but doesn’t find neural nets very interesting).
    I disagree, I’d rather they’d hire someone who cares about making AI go well. E.g. I like Sam Marks’s work on making interpretability techniques useful (e.g. here), and I think he gets a lot of leverage compared to most interpretability researchers via trying to do stuff that’s in the direction of being useful. (Though note that his work builds on the work of non-backchaining interpretability researchers.)
  - habryka Aug 24, 2024, 8:09 PM
    LW: 7 AF: 5
    −7
    AF Parent
    (FWIW I think Chris Olah’s work is approximately irrelevant to alignment and indeed this is basically fully explained by the motivational dimension)
    - Richard_Ngo Aug 25, 2024, 3:43 PM
      LW: 3 AF: 3
      0
      AF Parent
      Whose work is relevant, according to you?
      - habryka Aug 25, 2024, 5:21 PM
        LW: 4 AF: 3
        0
        AF Parent
        Lots of people’s work:
        
        Paul’s work (ELK more than RLHF though it was useful to see what happens when you throw RL at LLMs in a way that’s kind of similar to how I do get some value out of Chris’s work)
        Eliezer’s work
        Nate’s work
        Holden’s writing on cold takes
        Ajeya’s work
        Wentworth’s work
        The debate stuff
        Redwood’s work
        Bostrom’s work
        Evan’s work
        Scott and Abram’s work
        
        There is of course still huge variance in how relevant and how much for the throat these different people’s work is going for, but all of these seem more relevant to AI Alignment/AI-not-kill-everyonism than Chris’s work (which again, I found interesting, but not like super interesting).
        Evan_Gaensbauer Sep 24, 2024, 8:13 AM
        2 points
        0
        Parent
        Do you mean Evan Hubinger, Evan R. Murphy, or a different Evan? (I would be surprised and humbled if it was me, though my priors on that are low.)
        habryka Sep 24, 2024, 2:16 PM
        8 points
        0
        Parent
        Hubinger
      - sunwillrise Aug 25, 2024, 4:39 PM
        4 points
        −2
        Parent
        Definitely not trying to put words in Habryka’s mouth, but I did want to make a concrete prediction to test my understanding of his position; I expect he will say that:
        the only work which is relevant is the one that tries to directly tackle what Nate Soares described as “the hard bits of the alignment challenge” (the identity of which Habryka basically agrees with Soares about)
        nobody is fully on the ball yet
        but agent foundations-like research by MIRI-aligned or formerly MIRI-aligned people (Vanessa Kosoy, Abram Demski, etc.) is the one that’s most relevant, in theory
        however, in practice, even that is kinda irrelevant because timelines are short and that work is going along too slowly to be useful even for deconfusion purposes
        Edit: I was wrong.
    - Leon Lang Aug 24, 2024, 8:47 PM
      2 points
      0
      Parent
      To clarify: are you saying that since you perceive Chris Olah as mostly intrinsically caring about understanding neural networks (instead of mostly caring about alignment), you conclude that his work is irrelevant to alignment?
      - habryka Aug 24, 2024, 8:59 PM
        6 points
        2
        Parent
        No, I have detailed inside view models of the alignment problem, and under those models consider Chris Olah’s work to be interesting but close to irrelevant (or to be about as relevant as the work of top capability researchers, whose work, to be clear, does have some relevance since of course understanding how to make systems better is relevant for understanding how AGI will behave, but where the relevance is pretty limited).
  - Daniel Kokotajlo Aug 25, 2024, 1:14 AM
    LW: 2 AF: 2
    0
    AF Parent
    I feel like we are misunderstanding each other, and I think it’s at least in large part my fault.
    
    I definitely agree that we don’t want to be handing out grants or judging people on the basis of what shibboleths they spout or what community they are part of. In fact I agree with most of what you’ve said above, except for when you start attributing stuff to me.
    I think that grantmakers should evaluate research proposals not on the basis of the intentions of the researcher, but on the basis of whether the proposed research seems useful for alignment. This is not your view though right? You are proposing something else?