To be fair, the alignment community has caused some confusion by describing models as more or less “aligned”
I am curious what you are thinking about. My sense is the trend of calling models “aligned” started with OpenAI and ChatGPT, and wasn’t really driven by anyone in the AI Alignment community (and is something that I complained a lot about at the time, because it did indeed seem like a thing that predictably would lead to confusion).
It’s plausible to me that Paul also contributed to this, but my sense is most senior alignment people have been very hesitant to use “alignment” as a generalized term to describe the behavior of present-day models.
Fair point. I’ve now removed that section from the post (and also, unrelatedly, renamed the post).
I was trying to make a point about people wanting to ensure that AI in general (not just current models) is “aligned”, but in hindsight I think people usually talk about alignment with human values or similar. I have some qualms about that but will discuss in a different post.
Yeah, my sense is others (like Anthropic) followed along after OpenAI did that, though it seemed to me mostly to be against consensus in the alignment field (though I agree it’s messy).
Huh, interesting. Maybe the OpenAI statements about their models being “more aligned” came earlier than that in the context of Instruct-GPT? I definitely feel like I remember some Twitter threads and LW comment threads about it in the context of OpenAI announcements, and nothing in the context of Anthropic announcements.
I am curious what you are thinking about. My sense is the trend of calling models “aligned” started with OpenAI and ChatGPT, and wasn’t really driven by anyone in the AI Alignment community (and is something that I complained a lot about at the time, because it did indeed seem like a thing that predictably would lead to confusion).
It’s plausible to me that Paul also contributed to this, but my sense is most senior alignment people have been very hesitant to use “alignment” as a generalized term to describe the behavior of present-day models.
Fair point. I’ve now removed that section from the post (and also, unrelatedly, renamed the post).
I was trying to make a point about people wanting to ensure that AI in general (not just current models) is “aligned”, but in hindsight I think people usually talk about alignment with human values or similar. I have some qualms about that but will discuss in a different post.
Pretty sure Anthropic’s early assistant stuff used the word this way too: See e.g. Bai et al https://arxiv.org/abs/2204.05862
But yes, people complained about it a lot at the time
Yeah, my sense is others (like Anthropic) followed along after OpenAI did that, though it seemed to me mostly to be against consensus in the alignment field (though I agree it’s messy).
(The Anthropic paper I cited predates ChatGPT by 7 months)
Huh, interesting. Maybe the OpenAI statements about their models being “more aligned” came earlier than that in the context of Instruct-GPT? I definitely feel like I remember some Twitter threads and LW comment threads about it in the context of OpenAI announcements, and nothing in the context of Anthropic announcements.
This is likely not the first instance, but OpenAI was already using the word “aligned” in this way in 2021 in the Codex paper.
https://arxiv.org/abs/2107.03374 (section 7.2)
Ah, you’re correct, it’s from the original instructGPT release in Jan 2022:
https://openai.com/index/instruction-following/