habryka comments on Defining alignment research

habryka 19 Aug 2024 21:03 UTC
LW: 14 AF: 6
14
AF
To be fair, the alignment community has caused some confusion by describing models as more or less “aligned”
I am curious what you are thinking about. My sense is the trend of calling models “aligned” started with OpenAI and ChatGPT, and wasn’t really driven by anyone in the AI Alignment community (and is something that I complained a lot about at the time, because it did indeed seem like a thing that predictably would lead to confusion).
It’s plausible to me that Paul also contributed to this, but my sense is most senior alignment people have been very hesitant to use “alignment” as a generalized term to describe the behavior of present-day models.
- Richard_Ngo 19 Aug 2024 21:40 UTC
  LW: 5 AF: 4
  0
  AF Parent
  Fair point. I’ve now removed that section from the post (and also, unrelatedly, renamed the post).
  I was trying to make a point about people wanting to ensure that AI in general (not just current models) is “aligned”, but in hindsight I think people usually talk about alignment with human values or similar. I have some qualms about that but will discuss in a different post.
  - [ ]
    [deleted]
- LawrenceC 19 Aug 2024 21:19 UTC
  LW: 5 AF: 4
  0
  AF Parent
  Pretty sure Anthropic’s early assistant stuff used the word this way too: See e.g. Bai et al https://arxiv.org/abs/2204.05862
  But yes, people complained about it a lot at the time
  - habryka 19 Aug 2024 21:28 UTC
    LW: 4 AF: 4
    0
    AF Parent
    Yeah, my sense is others (like Anthropic) followed along after OpenAI did that, though it seemed to me mostly to be against consensus in the alignment field (though I agree it’s messy).
    - LawrenceC 19 Aug 2024 22:42 UTC
      LW: 3 AF: 3
      0
      AF Parent
      (The Anthropic paper I cited predates ChatGPT by 7 months)
      - habryka 19 Aug 2024 23:00 UTC
        LW: 3 AF: 2
        1
        AF Parent
        Huh, interesting. Maybe the OpenAI statements about their models being “more aligned” came earlier than that in the context of Instruct-GPT? I definitely feel like I remember some Twitter threads and LW comment threads about it in the context of OpenAI announcements, and nothing in the context of Anthropic announcements.
        leogao 20 Aug 2024 2:10 UTC
        LW: 10 AF: 6
        2
        AF Parent
        This is likely not the first instance, but OpenAI was already using the word “aligned” in this way in 2021 in the Codex paper.
        
        https://arxiv.org/abs/2107.03374 (section 7.2)
        LawrenceC 20 Aug 2024 20:20 UTC
        LW: 4 AF: 4
        0
        AF Parent
        Ah, you’re correct, it’s from the original instructGPT release in Jan 2022:
        https://openai.com/index/instruction-following/