I think a really substantial fraction of people who are doing “AI Alignment research” are instead acting with the primary aim of “make AI Alignment seem legit”. These are not the same goal, a lot of good people can tell and this makes them feel kind of deceived, and also this creates very messy dynamics within the field where people have strong opinions about what the secondary effects of research are, because that’s the primary thing they are interested in, instead of asking whether the research points towards useful true things for actually aligning the AI.
This doesn’t feel right to me, off the top of my head, it does seem like most of the field is just trying to make progress. For most of those that aren’t, it feels like they are pretty explicit about not trying to solve alignment, and also I’m excited about most of the projects. I’d guess like 10-20% of the field are in the “make alignment seem legit” camp. My rough categorization:
Make alignment progress:
Anthropic Interp
Redwood
ARC Theory
Conjecture
MIRI
Most independent researchers that I can think of (e.g. John, Vanessa, Steven Byrnes, the MATS people I know)
Some of the safety teams at OpenAI/DM
Aligned AI
Team Shard
make alignment seem legit:
CAIS safe.ai
Anthropic scaring laws
ARC Evals (arguably, but it seems like this isn’t quite the main aim)
Some of the safety teams at OpenAI/DM
Open Phil (I think I’d consider Cold Takes to be doing this, but it doesn’t exactly brand itself as alignment research)
What am I missing? I would be curious which projects you feel this way about.
This list seems partially right, though I would basically put all of Deepmind in the “make legit” category (I think they are genuinely well-intentioned about this, but I’ve had long disagreements with e.g. Rohin about this in the past). As a concrete example of this, whose effects I actually quite like, think of the specification gaming list. I think the second list is missing a bunch of names and instances, in-particular a lot of people in different parts of academia, and a lot of people who are less core “AINotKillEveryonism” flavored.
Like, let’s take “Anthropic Capabilities” for example, which is what the majority of people at Anthropic work on. Why are they working on it?
They are working on it partially because this gives Anthropic access to state of the art models to do alignment research on, but I think in even greater parts they are doing it because this gives them a seat at the table with the other AI capabilities orgs and makes their work seem legitimate to them, which enables them to both be involved in shaping how AI develops, and have influence over these other orgs.
I think this goal isn’t crazy, but I do get a sense that the overall strategy for Anthropic is very much not “we are trying to solve the alignment problem” and much more “we are trying to somehow get into a position of influence and power in the AI space so that we can then steer humanity in directions we care about” while also doing alignment research, but thinking that most of their effect on the world doesn’t come from the actual alignment research they produce (I do appreciate that Anthropic is less pretending to just do the first thing a bunch, which I think is better).
I also disagree with you on “most independent researchers”. I think the people you list definitely have that flavor, but at least in my LTFF work we’ve funded more people whose primary plan was something much closer to the “make it seem legit” branch. Indeed this is basically the most common reason I see people get PhDs, of which we funded a lot.
I feel confused about Conjecture. I had some specific run-ins with them that indeed felt among the worst offenders of trying to primarily optimize for influence, but some of the people seem genuinely motivated by making progress. I currently think it’s a mixed bag.
I could list more, but this feels like a weird context in which to give my takes on everyone’s AI Alignment research, and seems like it would benefit from some more dedicated space. Overall, my sense is in-terms of funding and full-time people, things are skewed around 70⁄30 in favor of “make legit”, and I do think there are a lot of great people who are trying to genuinely solve the problem.
(I realize this is straying pretty far from the intent of this post, so feel free to delete this comment)
I totally agree that a non-trivial portion of DeepMind’s work (and especially my work) is in the “make legit” category, and I stand by that as a good thing to do, but putting all of it there seems pretty wild. Going off of a list I previously wrote about DeepMind work (this comment):
We do a lot of stuff, e.g. of the things you’ve listed, the Alignment / Scalable Alignment Teams have done at least some work on the following since I joined in late 2020:
Eliciting latent knowledge (see ELK prizes, particularly the submission from Victoria Krakovna & Vikrant Varma & Ramana Kumar)
(Note that since then the mechanistic interpretability team published Tracr.)
Of this, I think “examples of goal misgeneralization” is primarily “make alignment legit”, while everything else is about making progress on alignment. (I see the conceptual progress towards specifically naming and describing goal misgeneralization as progress on alignment, but that was mostly finished within-the-community by the time we were working on the examples.)
(Some of the LLM alignment work and externalized reasoning oversight work has aspects of “making alignment legit” but it also seems like progress on alignment—in particular I think I learn new empirical facts about how well various techniques work from both.)
I think the actual crux here is how useful the various empirical projects are, where I expect you (and many others) think “basically useless” while I don’t.
In terms of fraction of effort allocated to “make alignment legit”, I think it’s currently about 10% of the Alignment and Scalable Alignment teams, and it was more like 20% while the goal misgeneralization project was going on. (This is not counting LLM alignment and externalized reasoning oversight as “make alignment legit”.)
I mean, I think my models here come literally from conversations with you, where I am pretty sure you have said things like (paraphrased) “basically all the work I do at Deepmind and the work of most other people I work with at Deepmind is about ‘trying to demonstrate the difficulty of the problem’ and ‘convincing other people at Deepmind the problem is real’”.
In as much as you are now claiming that is only 10%-20% of the work, that would be extremely surprising to me and I do think would really be in pretty direct contradiction with other things we have talked about.
Like, yes, of course if you want to do field-building and want to get people to think AI Alignment is real, you will also do some alignment research. But I am talking about the balance of motivations, not the total balance of work. My sense is most of the motivation for people at the Deepmind teams comes from people thinking about how to get other people at Deepmind to take AI Alignment seriously. I think that’s a potentially valuable goal, but indeed it is also the kind of goal that often gets represented as someone just trying to make direct progress on the problem.
Hmm, this is surprising. Some claims I might have made that could have led to this misunderstanding, in order of plausibility:
[While I was working on goal misgeneralization] “Basically all the work that I’m doing is about convincing other people that the problem is real”. I might have also said something like “and most people I work with” intending to talk about my collaborators on goal misgeneralization rather than the entire DeepMind safety team(s); for at least some of the time that I was working on goal misgeneralization I was an individual contributor so that would have been a reasonable interpretation.
“Most of my past work hasn’t made progress on the problem”—this would be referring to papers that I started working on before believing that scaled up deep learning could lead to AGI without additional insights, which I think ended up solving the wrong problem because I had a wrong model of what the problem was. (But I wouldn’t endorse “I did this to make alignment legit”, I was in fact trying to solve the problem as I saw it.) (I also did lots of conceptual work that I think did make progress but I have a bad habit of using phrases like “past work” to only mean papers.)
“[Particular past work] didn’t make progress on the problem, though it did explain a problem well”—seems very plausible that I said this about some past DeepMind work.
I do feel pretty surprised if, while I was at DeepMind, I ever intended to make the claim that most of the DeepMind safety team(s) were doing work based on a motivation that was primarily about demonstrating difficulty / convincing other people. (Perhaps I intentionally made such a claim while I wasn’t at DeepMind; seems a lot easier for me to have been mistaken about that before I was actually at DeepMind, but honestly I’d still be pretty surprised.)
My sense is most of the motivation for people at the Deepmind teams comes from people thinking about how to get other people at Deepmind to take AI Alignment seriously.
Idk how you would even theoretically define a measure for this that I could put numbers on, but I feel like if you somehow did do it, I’d probably think it was <50% and >10%.
[While I was working on goal misgeneralization] “Basically all the work that I’m doing is about convincing other people that the problem is real”. I might have also said something like “and most people I work with” intending to talk about my collaborators on goal misgeneralization rather than the entire DeepMind safety team(s); for at least some of the time that I was working on goal misgeneralization I was an individual contributor so that would have been a reasonable interpretation.
This seems like the most likely explanation. Decent chance I interpreted “and most people I work with” as referring to the rest of the Deepmind safety team.
I still feel confused about some stuff, but I am happy to let things stand here.
fyi your phrasing here is different from what I initially interpreted “make AI safety seem legit”.
like there’s maybe a few things someone might mean if they say “they’re working on AI Alignment research”
they are pushing forward the state of the art of deep alignment understanding
they are orienting to the existing field of alignment research / upskilling
they are conveying to other AI researchers “here is what the field of alignment is important and why”
they are trying to make AI alignment feel high status, so that they feel safe in their career and social network, while also getting to feel important
(and of course people can be doing a mixture of the above, or 5th options I didn’t lisT)
I interpreted you initially as saying #4, but it sounds like you/Rohin here are talking about #3. There are versions of #3 that are secretly just #4 without much theory-of-change, but, idk, I think Rohin’s stated goal here is just pretty reasonable and definitely something I want in my overall AI Alignment Field portfolio. I agree you should avoid accidentally conflating it with #1.
(i.e. this seems related to a form of research-debt, albeit focused on bridging the gap between one field and another, rather than improving intra-field research debt)
Yep, I am including 3 in this. I also think this is something pretty reasonable for someone in the field to do, but when most of your field is doing that I think quite crazy and bad things happen, and also it’s very easy to slip into doing 4 instead.
They are working on it partially because this gives Anthropic access to state of the art models to do alignment research on, but I think in even greater parts they are doing it because this gives them a seat at the table with the other AI capabilities orgs and makes their work seem legitimate to them, which enables them to both be involved in shaping how AI develops, and have influence over these other orgs.
...Am I crazy or is this discussion weirdly missing the third option of “They’re doing it because they want to build a God-AI and ‘beat the other orgs to the punch’”? That is completely distinct from signaling competence to other AGI orgs or getting yourself a “seat at the table” and it seems odd to categorize the majority of Anthropic’s aggslr8ing as such.
It seems to me like one (often obscured) reason for the disagreement between Thomas and Habryka is that they are thinking about different groups of people when they define “the field.”
To assess the % of “the field” that’s doing meaningful work, we’d want to do something like [# of people doing meaningful work]/[total # of people in the field].
Who “counts” in the denominator? Should we count anyone who has received a grant from the LTFF with the word “AI safety” in it? Only the ones who have contributed object-level work? Only the ones who have contributed object-level work that passes some bar? Should we count the Anthropic capabilities folks? Just the EAs who are working there?
My guess is that Thomas was using more narrowly defined denominator (e.g., not counting most people who got LTFF grants and went off to to PhDs without contributing object-level alignment stuff; not counting most Anthropic capabilities researchers who have never-or-minimally engaged with the AIS community) whereas Habryka was using a more broadly defined denominator.
I’m not certain about this, and even if it’s true, I don’t think it explains the entire effect size. But I wouldn’t be surprised if roughly 10-30% of the difference between Thomas and Habryka might come from unstated assumptions about who “counts” in the denominator.
(My guess is that this also explains “vibe-level” differences to some extent. I think some people who look out into the community and think “yeah, I think people here are pretty reasonable and actually trying to solve the problem and I’m impressed by some of their work” are often defining “the community” more narrowly than people who look out into the community and think “ugh, the community has so much low-quality work and has a bunch of people who are here to gain influence rather than actually try to solve the problem.”)
This sounds like a solid explanation for the difference for someone totally uninvolved with the Berkeley scene.
Though I’m surprised there’s no broad consensus on even basic things like this in 2023.
In game terms, if everyone keeps their own score separately then it’s no wonder a huge portion of effort will, in aggregate, go towards min-maxing the score tracking meta-game.
Something ~ like ‘make it legit’ has been and possibly will continue to be a personal interest of mine.
I’m posting this after Rohin entered this discussion—so Rohin, I hope you don’t mind me quoting you like this, but fwiw I was significantly influenced by this comment on Buck’s old talk transcript ‘My personal cruxes for working on AI safety’. (Rohin’s comment repeated here in full and please bear in mind this is 3 years old; his views I’m sure have developed and potentially moved a lot since then:)
I enjoyed this post, it was good to see this all laid out in a single essay, rather than floating around as a bunch of separate ideas.
That said, my personal cruxes and story of impact are actually fairly different: in particular, while this post sees the impact of research as coming from solving the technical alignment problem, I care about other sources of impact as well, including:
1. Field building: Research done now can help train people who will be able to analyze problems and find solutions in the future, when we have more evidence about what powerful AI systems will look like.
2. Credibility building: It does you no good to know how to align AI systems if the people who build AI systems don’t use your solutions. Research done now helps establish the AI safety field as the people to talk to in order to keep advanced AI systems safe.
3. Influencing AI strategy: This is a catch all category meant to include the ways that technical research influences the probability that we deploy unsafe AI systems in the future. For example, if technical research provides more clarity on exactly which systems are risky and which ones are fine, it becomes less likely that people build the risky systems (nobody _wants_ an unsafe AI system), even though this research doesn’t solve the alignment problem.
As a result, cruxes 3-5 in this post would not actually be cruxes for me (though 1 and 2 would be).
Yeah, all four of those are real things happening, and are exactly the sorts of things I think the post has in mind.
I take “make AI alignment seem legit” to refer to a bunch of actions that are optimized to push public discourse and perceptions around. Here’s a list of things that come to my mind:
Trying to get alignment research to look more like a mainstream field, by e.g. funding professors and PhD students who frame their work as alignment and giving them publicity, organizing conferences that try to rope in existing players who have perceived legitimacy, etc
Papers like Concrete Problems in AI Safety that try to tie AI risk to stuff that’s already in the overton window / already perceived as legitimate
Optimizing language in posts / papers to be perceived well, by e.g. steering clear of the part where we’re worried AI will literally kill everyone
Efforts to make it politically untenable for AI orgs to not have some narrative around safety
Each of these things seems like they have a core good thing, but according to me they’ve all backfired to the extend that they were optimized to avoid the thorny parts of AI x-risk, because this enables rampant goodharting. Specifically I think the effects of avoiding the core stuff have been bad, creating weird cargo cults around alignment research, making it easier for orgs to have fake narratives about how they care about alignment, and etc.
This doesn’t feel right to me, off the top of my head, it does seem like most of the field is just trying to make progress. For most of those that aren’t, it feels like they are pretty explicit about not trying to solve alignment, and also I’m excited about most of the projects. I’d guess like 10-20% of the field are in the “make alignment seem legit” camp. My rough categorization:
Make alignment progress:
Anthropic Interp
Redwood
ARC Theory
Conjecture
MIRI
Most independent researchers that I can think of (e.g. John, Vanessa, Steven Byrnes, the MATS people I know)
Some of the safety teams at OpenAI/DM
Aligned AI
Team Shard
make alignment seem legit:
CAISsafe.aiAnthropic scaring laws
ARC Evals (arguably, but it seems like this isn’t quite the main aim)
Some of the safety teams at OpenAI/DM
Open Phil (I think I’d consider Cold Takes to be doing this, but it doesn’t exactly brand itself as alignment research)
What am I missing? I would be curious which projects you feel this way about.
This list seems partially right, though I would basically put all of Deepmind in the “make legit” category (I think they are genuinely well-intentioned about this, but I’ve had long disagreements with e.g. Rohin about this in the past). As a concrete example of this, whose effects I actually quite like, think of the specification gaming list. I think the second list is missing a bunch of names and instances, in-particular a lot of people in different parts of academia, and a lot of people who are less core “AINotKillEveryonism” flavored.
Like, let’s take “Anthropic Capabilities” for example, which is what the majority of people at Anthropic work on. Why are they working on it?
They are working on it partially because this gives Anthropic access to state of the art models to do alignment research on, but I think in even greater parts they are doing it because this gives them a seat at the table with the other AI capabilities orgs and makes their work seem legitimate to them, which enables them to both be involved in shaping how AI develops, and have influence over these other orgs.
I think this goal isn’t crazy, but I do get a sense that the overall strategy for Anthropic is very much not “we are trying to solve the alignment problem” and much more “we are trying to somehow get into a position of influence and power in the AI space so that we can then steer humanity in directions we care about” while also doing alignment research, but thinking that most of their effect on the world doesn’t come from the actual alignment research they produce (I do appreciate that Anthropic is less pretending to just do the first thing a bunch, which I think is better).
I also disagree with you on “most independent researchers”. I think the people you list definitely have that flavor, but at least in my LTFF work we’ve funded more people whose primary plan was something much closer to the “make it seem legit” branch. Indeed this is basically the most common reason I see people get PhDs, of which we funded a lot.
I feel confused about Conjecture. I had some specific run-ins with them that indeed felt among the worst offenders of trying to primarily optimize for influence, but some of the people seem genuinely motivated by making progress. I currently think it’s a mixed bag.
I could list more, but this feels like a weird context in which to give my takes on everyone’s AI Alignment research, and seems like it would benefit from some more dedicated space. Overall, my sense is in-terms of funding and full-time people, things are skewed around 70⁄30 in favor of “make legit”, and I do think there are a lot of great people who are trying to genuinely solve the problem.
(I realize this is straying pretty far from the intent of this post, so feel free to delete this comment)
I totally agree that a non-trivial portion of DeepMind’s work (and especially my work) is in the “make legit” category, and I stand by that as a good thing to do, but putting all of it there seems pretty wild. Going off of a list I previously wrote about DeepMind work (this comment):
(Note that since then the mechanistic interpretability team published Tracr.)
Of this, I think “examples of goal misgeneralization” is primarily “make alignment legit”, while everything else is about making progress on alignment. (I see the conceptual progress towards specifically naming and describing goal misgeneralization as progress on alignment, but that was mostly finished within-the-community by the time we were working on the examples.)
(Some of the LLM alignment work and externalized reasoning oversight work has aspects of “making alignment legit” but it also seems like progress on alignment—in particular I think I learn new empirical facts about how well various techniques work from both.)
I think the actual crux here is how useful the various empirical projects are, where I expect you (and many others) think “basically useless” while I don’t.
In terms of fraction of effort allocated to “make alignment legit”, I think it’s currently about 10% of the Alignment and Scalable Alignment teams, and it was more like 20% while the goal misgeneralization project was going on. (This is not counting LLM alignment and externalized reasoning oversight as “make alignment legit”.)
I mean, I think my models here come literally from conversations with you, where I am pretty sure you have said things like (paraphrased) “basically all the work I do at Deepmind and the work of most other people I work with at Deepmind is about ‘trying to demonstrate the difficulty of the problem’ and ‘convincing other people at Deepmind the problem is real’”.
In as much as you are now claiming that is only 10%-20% of the work, that would be extremely surprising to me and I do think would really be in pretty direct contradiction with other things we have talked about.
Like, yes, of course if you want to do field-building and want to get people to think AI Alignment is real, you will also do some alignment research. But I am talking about the balance of motivations, not the total balance of work. My sense is most of the motivation for people at the Deepmind teams comes from people thinking about how to get other people at Deepmind to take AI Alignment seriously. I think that’s a potentially valuable goal, but indeed it is also the kind of goal that often gets represented as someone just trying to make direct progress on the problem.
Hmm, this is surprising. Some claims I might have made that could have led to this misunderstanding, in order of plausibility:
[While I was working on goal misgeneralization] “Basically all the work that I’m doing is about convincing other people that the problem is real”. I might have also said something like “and most people I work with” intending to talk about my collaborators on goal misgeneralization rather than the entire DeepMind safety team(s); for at least some of the time that I was working on goal misgeneralization I was an individual contributor so that would have been a reasonable interpretation.
“Most of my past work hasn’t made progress on the problem”—this would be referring to papers that I started working on before believing that scaled up deep learning could lead to AGI without additional insights, which I think ended up solving the wrong problem because I had a wrong model of what the problem was. (But I wouldn’t endorse “I did this to make alignment legit”, I was in fact trying to solve the problem as I saw it.) (I also did lots of conceptual work that I think did make progress but I have a bad habit of using phrases like “past work” to only mean papers.)
“[Particular past work] didn’t make progress on the problem, though it did explain a problem well”—seems very plausible that I said this about some past DeepMind work.
I do feel pretty surprised if, while I was at DeepMind, I ever intended to make the claim that most of the DeepMind safety team(s) were doing work based on a motivation that was primarily about demonstrating difficulty / convincing other people. (Perhaps I intentionally made such a claim while I wasn’t at DeepMind; seems a lot easier for me to have been mistaken about that before I was actually at DeepMind, but honestly I’d still be pretty surprised.)
Idk how you would even theoretically define a measure for this that I could put numbers on, but I feel like if you somehow did do it, I’d probably think it was <50% and >10%.
This seems like the most likely explanation. Decent chance I interpreted “and most people I work with” as referring to the rest of the Deepmind safety team.
I still feel confused about some stuff, but I am happy to let things stand here.
fyi your phrasing here is different from what I initially interpreted “make AI safety seem legit”.
like there’s maybe a few things someone might mean if they say “they’re working on AI Alignment research”
they are pushing forward the state of the art of deep alignment understanding
they are orienting to the existing field of alignment research / upskilling
they are conveying to other AI researchers “here is what the field of alignment is important and why”
they are trying to make AI alignment feel high status, so that they feel safe in their career and social network, while also getting to feel important
(and of course people can be doing a mixture of the above, or 5th options I didn’t lisT)
I interpreted you initially as saying #4, but it sounds like you/Rohin here are talking about #3. There are versions of #3 that are secretly just #4 without much theory-of-change, but, idk, I think Rohin’s stated goal here is just pretty reasonable and definitely something I want in my overall AI Alignment Field portfolio. I agree you should avoid accidentally conflating it with #1.
(i.e. this seems related to a form of research-debt, albeit focused on bridging the gap between one field and another, rather than improving intra-field research debt)
Yep, I am including 3 in this. I also think this is something pretty reasonable for someone in the field to do, but when most of your field is doing that I think quite crazy and bad things happen, and also it’s very easy to slip into doing 4 instead.
...Am I crazy or is this discussion weirdly missing the third option of “They’re doing it because they want to build a God-AI and ‘beat the other orgs to the punch’”? That is completely distinct from signaling competence to other AGI orgs or getting yourself a “seat at the table” and it seems odd to categorize the majority of Anthropic’s aggslr8ing as such.
It seems to me like one (often obscured) reason for the disagreement between Thomas and Habryka is that they are thinking about different groups of people when they define “the field.”
To assess the % of “the field” that’s doing meaningful work, we’d want to do something like [# of people doing meaningful work]/[total # of people in the field].
Who “counts” in the denominator? Should we count anyone who has received a grant from the LTFF with the word “AI safety” in it? Only the ones who have contributed object-level work? Only the ones who have contributed object-level work that passes some bar? Should we count the Anthropic capabilities folks? Just the EAs who are working there?
My guess is that Thomas was using more narrowly defined denominator (e.g., not counting most people who got LTFF grants and went off to to PhDs without contributing object-level alignment stuff; not counting most Anthropic capabilities researchers who have never-or-minimally engaged with the AIS community) whereas Habryka was using a more broadly defined denominator.
I’m not certain about this, and even if it’s true, I don’t think it explains the entire effect size. But I wouldn’t be surprised if roughly 10-30% of the difference between Thomas and Habryka might come from unstated assumptions about who “counts” in the denominator.
(My guess is that this also explains “vibe-level” differences to some extent. I think some people who look out into the community and think “yeah, I think people here are pretty reasonable and actually trying to solve the problem and I’m impressed by some of their work” are often defining “the community” more narrowly than people who look out into the community and think “ugh, the community has so much low-quality work and has a bunch of people who are here to gain influence rather than actually try to solve the problem.”)
This sounds like a solid explanation for the difference for someone totally uninvolved with the Berkeley scene.
Though I’m surprised there’s no broad consensus on even basic things like this in 2023.
In game terms, if everyone keeps their own score separately then it’s no wonder a huge portion of effort will, in aggregate, go towards min-maxing the score tracking meta-game.
Something ~ like ‘make it legit’ has been and possibly will continue to be a personal interest of mine.
I’m posting this after Rohin entered this discussion—so Rohin, I hope you don’t mind me quoting you like this, but fwiw I was significantly influenced by this comment on Buck’s old talk transcript ‘My personal cruxes for working on AI safety’. (Rohin’s comment repeated here in full and please bear in mind this is 3 years old; his views I’m sure have developed and potentially moved a lot since then:)
I still endorse that comment, though I’ll note that it argues for the much weaker claims of
I would not stop working on alignment research if it turned out I wasn’t solving the technical alignment problem
There are useful impacts of alignment research other than solving the technical alignment problem
(As opposed to something more like “the main thing you should work on is ‘make alignment legit’”.)
(Also I’m glad to hear my comments are useful (or at least influential), thanks for letting me know!)
Can we adopt a norm of calling this Safe.ai? When I see “CAIS”, I think of Drexler’s “Comprehensive AI Services”.
Oh now the original comment makes more sense, thanks for this clarification.
+1 I was really really upset safe.ai decided to use an established acronym for something very different
Could someone explain exactly what “make AI alignment seem legit” means in this thread? I’m having trouble understanding from context.
“Convince people building AI to utilize AI alignment research”?
“Make the field of AI alignment look serious/professional/high-status”?
“Make it look like your own alignment work is worthy of resources”?
“Make it look like you’re making alignment progress even if you’re not”?
A mix of these? Something else?
Yeah, all four of those are real things happening, and are exactly the sorts of things I think the post has in mind.
I take “make AI alignment seem legit” to refer to a bunch of actions that are optimized to push public discourse and perceptions around. Here’s a list of things that come to my mind:
Trying to get alignment research to look more like a mainstream field, by e.g. funding professors and PhD students who frame their work as alignment and giving them publicity, organizing conferences that try to rope in existing players who have perceived legitimacy, etc
Papers like Concrete Problems in AI Safety that try to tie AI risk to stuff that’s already in the overton window / already perceived as legitimate
Optimizing language in posts / papers to be perceived well, by e.g. steering clear of the part where we’re worried AI will literally kill everyone
Efforts to make it politically untenable for AI orgs to not have some narrative around safety
Each of these things seems like they have a core good thing, but according to me they’ve all backfired to the extend that they were optimized to avoid the thorny parts of AI x-risk, because this enables rampant goodharting. Specifically I think the effects of avoiding the core stuff have been bad, creating weird cargo cults around alignment research, making it easier for orgs to have fake narratives about how they care about alignment, and etc.
Personally, I think “Discovering Language Model Behaviors with Model-Written Evaluations” is most valuable because of what it demonstrates from a scientific perspective, namely that RLHF and scale make certain forms of agentic behavior worse.