I’m not sure what order the history happened in and whether “AI Existential Safety” got rebranded into “AI Alignment” (my impression is that AI Alignment was first used to mean existential safety, and maybe this was a bad term, but it wasn’t a rebrand)
There’s the additional problem where “AI Existential Safety” easily gets rounded to “AI Safety” which often in practice means “self driving cars” as well as overlapping with an existing term-of-art “community safety” which means things like harassment.
I don’t have a good contender for a short phrase that is actually reasonable to say that conveys “Technical AI Existential Safety” work.
But if we had such a name, I would be in favor of renaming the AI Alignment Forum to an easy-to-say-variation on “The Technical Research for AIdontkilleveryoneism Forum”. (I think this was always the intended subject matter of the forum). And that forum (convergently) has Alignment research on it, but only insofar as it’s relevant to Technical Research for AIdontkilleveryoneism”.
“AI Safety” which often in practice means “self driving cars”
This may have been true four years ago, but ML researchers at leading labs rarely directly work on self-driving cars (e.g., research on sensor fusion). AV is has not been hot in quite a while. Fortunately now that AGI-like chatbots are popular, we’re moving out of the realm of talking about making very narrow systems safer. The association with AV was not that bad since it was about getting many nines of reliability/extreme reliability, which was a useful subgoal. Unfortunately the world has not been able to make a DL model completely reliable in any specific domain (even MNIST).
Of course, they weren’t talking about x-risks, but neither are industry researchers using the word “alignment” today to mean they’re fine-tuning a model to be more knowledgable or making models better satisfy capabilities wants (sometimes dressed up as “human values”).
If you want a word that reliably denotes catastrophic risks that is also mainstream, you’ll need to make catastrophic risk ideas mainstream. Expect it to be watered down for some time, or expect it not to go mainstream.
Unfortunately, I think even “catastrophic risk” has a high potential to be watered down and be applied to situations where dozens as opposed to millions/billions die. Even existential risk has this potential, actually, but I think it’s a safer bet.
I’m not sure what order the history happened in and whether “AI Existential Safety” got rebranded into “AI Alignment” (my impression is that AI Alignment was first used to mean existential safety, and maybe this was a bad term, but it wasn’t a rebrand)
There was a pretty extensive discussion about this between Paul Christiano and me. tl;dr “AI Alignment” clearly had a broader (but not very precise) meaning than “How to get AI systems to try to do what we want” when it first came into use. Paul later used “AI Alignment” for his narrower meaning, but after that discussion, switched to using “Intent Alignment” for this instead.
tl;dr “AI Alignment” clearly had a broader (but not very precise) meaning than “How to get AI systems to try to do what we want” when it first came into use. Paul later used “AI Alignment” for his narrower meaning, but after that discussion, switched to using “Intent Alignment” for this instead.
I don’t think I really agree with this summary. Your main justification was that Eliezer used the term with an extremely broad definition on Arbital, but the Arbital page was written way after a bunch of other usage (including after me moving to ai-alignment.com I think). I think very few people at the time would have argued that e.g. “getting your AI to be better at politics so it doesn’t accidentally start a war” is value alignment though it obviously fits under Eliezer’s definition.
(ETA: actually the Arbital page is old, it just wasn’t indexed by the wayback machine and doesn’t come with a date on Arbital itself. so So I agree with the point that this post is evidence for an earlier very broad usage.)
I would agree with “some people used it more broadly” but not “clearly had a broader meaning.” Unless “broader meaning” is just “used very vaguely such that there was no agreement about what it means.”
(I don’t think this really matters except for the periodic post complaining about linguistic drift.)
Your main justification was that Eliezer used the term with an extremely broad definition on Arbital, but the Arbital page was written way after a bunch of other usage (including after me moving to ai-alignment.com I think).
But that talk appears to use the narrower meaning though, not the crazy broad one from the later Arbital page. Looking at the transcript:
The first usage is “At the point where we say, “OK, this robot’s utility function is misaligned with our utility function. How do we fix that in a way that it doesn’t just break again later?” we are doing AI alignment theory.” Which seems like it’s really about the goal the agent is pursuing.
The subproblems are all about agents having the right goals. And it continuously talks about pointing agents in the right direction when talking informally about what alignment is.
It doesn’t talk about how there are other parts of alignment that Eliezer just doesn’t care about. It really feels like “alignment” is supposed to be understood to mean getting your AI to be not trying to kill you / trying to help you / something about its goals.
The talk doesn’t have any definitions to disabuse you of this apparent implication.
What part of this talk makes it seem clear that alignment is about the broader thing rather than about making an AI that’s not actively trying to kill you?
I say it is a rebrand of the “AI (x-)safety” community. When AI alignment came along we were calling it AI safety, even though it was really basically AI existential safety all along that everyone in the community meant. “AI safety” was (IMO) a somewhat successful bid for more mainstream acceptance, that then lead to dillution and confusion, necessitating a new term.
I don’t think the history is that important; what’s important is having good terminology going forward. This is also why I stress that I work on AI existential safety.
So I think people should just say what kind of technical work they are doing and “existential safety” should be considered as a social-technical problem that motivates a community of researchers, and used to refer to that problem and that community. In particular, I think we are not able to cleanly delineate what is or isn’t technical AI existential safety research at this point, and we should welcome intellectual debates about the nature of the problem and how different technical research may or may not contribute to increasing x-safety.
I’m not sure what order the history happened in and whether “AI Existential Safety” got rebranded into “AI Alignment” (my impression is that AI Alignment was first used to mean existential safety, and maybe this was a bad term, but it wasn’t a rebrand)
There’s the additional problem where “AI Existential Safety” easily gets rounded to “AI Safety” which often in practice means “self driving cars” as well as overlapping with an existing term-of-art “community safety” which means things like harassment.
I don’t have a good contender for a short phrase that is actually reasonable to say that conveys “Technical AI Existential Safety” work.
But if we had such a name, I would be in favor of renaming the AI Alignment Forum to an easy-to-say-variation on “The Technical Research for AIdontkilleveryoneism Forum”. (I think this was always the intended subject matter of the forum). And that forum (convergently) has Alignment research on it, but only insofar as it’s relevant to Technical Research for AIdontkilleveryoneism”.
This may have been true four years ago, but ML researchers at leading labs rarely directly work on self-driving cars (e.g., research on sensor fusion). AV is has not been hot in quite a while. Fortunately now that AGI-like chatbots are popular, we’re moving out of the realm of talking about making very narrow systems safer. The association with AV was not that bad since it was about getting many nines of reliability/extreme reliability, which was a useful subgoal. Unfortunately the world has not been able to make a DL model completely reliable in any specific domain (even MNIST).
Of course, they weren’t talking about x-risks, but neither are industry researchers using the word “alignment” today to mean they’re fine-tuning a model to be more knowledgable or making models better satisfy capabilities wants (sometimes dressed up as “human values”).
If you want a word that reliably denotes catastrophic risks that is also mainstream, you’ll need to make catastrophic risk ideas mainstream. Expect it to be watered down for some time, or expect it not to go mainstream.
Unfortunately, I think even “catastrophic risk” has a high potential to be watered down and be applied to situations where dozens as opposed to millions/billions die. Even existential risk has this potential, actually, but I think it’s a safer bet.
There was a pretty extensive discussion about this between Paul Christiano and me. tl;dr “AI Alignment” clearly had a broader (but not very precise) meaning than “How to get AI systems to try to do what we want” when it first came into use. Paul later used “AI Alignment” for his narrower meaning, but after that discussion, switched to using “Intent Alignment” for this instead.
I don’t think I really agree with this summary. Your main justification was that Eliezer used the term with an extremely broad definition on Arbital, but the Arbital page was written way after a bunch of other usage (including after me moving to ai-alignment.com I think). I think very few people at the time would have argued that e.g. “getting your AI to be better at politics so it doesn’t accidentally start a war” is value alignment though it obviously fits under Eliezer’s definition.
(ETA: actually the Arbital page is old, it just wasn’t indexed by the wayback machine and doesn’t come with a date on Arbital itself. so So I agree with the point that this post is evidence for an earlier very broad usage.)
I would agree with “some people used it more broadly” but not “clearly had a broader meaning.” Unless “broader meaning” is just “used very vaguely such that there was no agreement about what it means.”
(I don’t think this really matters except for the periodic post complaining about linguistic drift.)
Eliezer used “AI alignment” as early as 2016 and ai-alignment.com wasn’t registered until 2017. Any other usage of the term that potentially predates Eliezer?
But that talk appears to use the narrower meaning though, not the crazy broad one from the later Arbital page. Looking at the transcript:
The first usage is “At the point where we say, “OK, this robot’s utility function is misaligned with our utility function. How do we fix that in a way that it doesn’t just break again later?” we are doing AI alignment theory.” Which seems like it’s really about the goal the agent is pursuing.
The subproblems are all about agents having the right goals. And it continuously talks about pointing agents in the right direction when talking informally about what alignment is.
It doesn’t talk about how there are other parts of alignment that Eliezer just doesn’t care about. It really feels like “alignment” is supposed to be understood to mean getting your AI to be not trying to kill you / trying to help you / something about its goals.
The talk doesn’t have any definitions to disabuse you of this apparent implication.
What part of this talk makes it seem clear that alignment is about the broader thing rather than about making an AI that’s not actively trying to kill you?
FWIW, I didn’t mean to kick off a historical debate, which seems like probably not a very valuable use of y’all’s time.
I say it is a rebrand of the “AI (x-)safety” community.
When AI alignment came along we were calling it AI safety, even though it was really basically AI existential safety all along that everyone in the community meant. “AI safety” was (IMO) a somewhat successful bid for more mainstream acceptance, that then lead to dillution and confusion, necessitating a new term.
I don’t think the history is that important; what’s important is having good terminology going forward.
This is also why I stress that I work on AI existential safety.
So I think people should just say what kind of technical work they are doing and “existential safety” should be considered as a social-technical problem that motivates a community of researchers, and used to refer to that problem and that community. In particular, I think we are not able to cleanly delineate what is or isn’t technical AI existential safety research at this point, and we should welcome intellectual debates about the nature of the problem and how different technical research may or may not contribute to increasing x-safety.