> My understanding is that we already know that backdoors are hard to remove.
We don’t actually find that backdoors are always hard to remove!
We did already know that backdoors often (from the title) “Persist Through Safety Training.” This phenomenon studied here and elsewhere is being taken as the main update in favor of AI x-risk. This doesn’t establish probability of the hazard, but it reminds us that backdoor hazards can persist if present.
I think it’s very easy to argue the hazard could emerge from malicious actors poisoning pretraining data, and harder to argue it would arise naturally. AI security researchers such as Carlini et al. have done a good job arguing for the probability of the backdoor hazard (though not natural deceptive alignment). (I think malicious actors unleashing rogue AIs is a concern for the reasons bio GCRs are a concern; if one does it, it could be devastating.)
I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout’s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic. (I’ve been suggesting more attention to backdoors since maybe 2019; here’s a video from a few years ago about the topic; we’ve also run competitions at NeurIPS with thousands of submissions on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don’t have the window dressing.
I think AI security-related topics have a very good track record of being relevant for x-risk (backdoors, unlearning, adversarial robustness). It’s a been better portfolio than the EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.). At a high level its saying power is because AI security is largely about extreme reliability; extreme reliability is not automatically provided by scaling, but most other desiderata are (e.g., commonsense understanding of what people like and dislike).
A request: Could Anthropic employees not call supervised fine-tuning and related techniques “safety training?” OpenAI/Anthropic have made “alignment” in the ML community become synonymous with fine-tuning, which is a big loss. Calling this “alignment training” consistently would help reduce the watering down of the word “safety.”
I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout’s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic. (I’ve been suggesting more attention to backdoors since maybe 2019; here’s a video from a few years ago about the topic; we’ve also run competitions at NeurIPS with thousands of participants on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don’t have the window dressing.
I think AI security-related topics have a very good track record of being relevant for x-risk (backdoors, unlearning, adversarial robustness). It’s a been better portfolio than the EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.)
I basically agree with all of this, but seems important to distinguish between the community on LW (ETA: in terms of what gets karma), individual researchers or organizations, and various other more specific clusters.
More specifically, on:
EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.)
I think all this stuff does seem probably lame/bad/useless, but I think there was work which looks ok-ish over this period. In particular, I think some work on scalable oversight (debate, decomposition, etc.) looks pretty decent. I think mech interp also looks ok, though not very good (and feature visualization in particular seems pretty weak). Additionally, the AI x-risk community writ large was interested in various things related to finding adversarial attacks and establishing robustness.
Whether you think the portfolio looked good will really depend on who you’re looking at. I think that the positions of Open Phil, Paul Christiano, Jan Leike, and Jacob Steinhart all seem to look pretty good from my perspective. I would have said that this is the central AI x-risk community (though it might form a small subset of people interesting in AI safety on LW which drives various random engagement metrics).
I think a representative sample might be this 2021 OpenPhil RFP. I think stuff here has aged pretty well, though it still is missing a bunch of things which now seem pretty good to me.
My overall take is something like:
Almost all empirical work done to date by the AI x-risk community seems pretty lame/weak.
Some emprical work done by the AI x-risk is pretty reasonable. And the rate of good work directly targeting AI x-risk is probably increasing somewhat.
Academic work in various applicable fields (which isn’t specifically targeted at AI x-risk), looks ok for reducing x-risk, but not amazing. ETA: academic work which claims to be safety related seems notably weaker at reducing AI x-risk than historical AI x-risk focused work, though I think both suck at reducing AI x-risk in an absolute sense relative to what seems possible.
There is probably a decent amount of alpha in carefully thinking through AI x-risk and what empirical research can be done to mitigate this specifically, but this hasn’t clearly looked amazing historically. I expect that many adjacent-ish academic fields would be considerably better for reducing AI x-risk with better targeting based on careful thinking (but in practice, maybe most people are very bad at this type of conceptual thinking, so probably they should just ignore this and do things will seem sorta-related based on high level arguments).
The general AI x-risk community on LW (ETA: in terms of what gets karma) has pretty bad takes overall and also seems to engage in pretty sloppy stuff. Both sloppy ML research (by the standards of normal ML research) and sloppy reasoning. I think it’s basically intractable to make the AI x-risk community on LW good (or at least very hard), so I think we should mostly give up on this and try instead to carve out sub-groups with better views. It doesn’t seem intractible from my perspective to try to make the Anthropic/OpenAI alignment teams have reasonable views.
The general AI x-risk community on LW has pretty bad takes overall and also seems to engage in pretty sloppy stuff. Both sloppy ML research (by the standards of normal ML research) and sloppy reasoning.
There are few things that people seem as badly calibrated on than “the beliefs of the general LW community”. Mostly people cherry pick random low karma people they disagree with if they want to present it in a bad light, or cherry pick the people they work with every day if they want to present it in a good light.
You yourself are among the most active commenters in the “AI x-risk community on LW”. It seems very weird to ascribe a generic “bad takes overall” summary to that group, given that you yourself are directly part of it.
Seems fine for people to use whatever identifiers they want for a conversation like this, and I am not going to stop it, but the above sentences seemed like pretty confused generalizations.
You yourself are among the most active commenters in the “AI x-risk community on LW”.
Yeah, lol, I should maybe be commenting less.
It seems very weird to ascribe a generic “bad takes overall” summary to that group, given that you yourself are directly part of it.
I mean, I wouldn’t really want to identify as part of “the AI x-risk community on LW” in the same way I expect you wouldn’t want to identify as “an EA” despite relatively often doing thing heavily associated with EAs (e.g., posting on the EA forum).
I would broadly prefer people don’t use labels which place me in particular in any community/group that I seem vaguely associated with an I generally try to extend the same to other people (note that I’m talking about some claim about the aggregate attention of LW, not necessarily any specific person).
I mean, I wouldn’t really want to identify as part of “the AI x-risk community on LW” in the same way I expect you wouldn’t want to identify as “an EA” despite relatively often doing thing heavily associated with EAs (e.g., posting on the EA forum).
Yeah, to be clear, that was like half of my point. A very small fraction of top contributors identify as part of a coherent community. Trying to summarize their takes as if they did is likely to end up confused.
LW is very intentionally designed and shaped so that you don’t need to have substantial social ties or need to become part of a community to contribute (and I’ve made many pretty harsh tradeoffs in that direction over the years).
In as much as some people do, I don’t think it makes sense to give their beliefs outsized weight when trying to think about LW’s role as a discourse platform. The vast majority of top contributors are similarly allergic to labels as you are.
It seems very weird to ascribe a generic “bad takes overall” summary to that group, given that you yourself are directly part of it.
This sentence channels influence of an evaporative cooling norm (upon observing bad takes, either leave the group or conspicuously ignore the bad takes), also places weight on acting on the basis of one’s identity. (I’m guessing this is not in tune with your overall stance, but it’s evidence of presence of a generator for the idea.)
Makes sense. I think generalizing from “what gets karma on LW” to “what do the people thinking most about AI X-risk on LW is important” is pretty fraught (especially at the upper end karma is mostly a broad popularity measure).
I think using the results of the annual review is a lot better, and IMO the top alignment posts in past reviews have mostly pretty good takes in them (my guess is also by your lights), and the ones that don’t have reviews poking at the problems pretty well. My guess is you would still have lots of issues with posts scoring highly in the review, but I would be surprised if you would summarize the aggregate as “pretty bad takes”.
I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout’s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic.
At least for relative newcomers to the field, deciding what to pay attention to is a challenge, and using the window-dressed/EA-coded heuristic seems like a reasonable way to prune the search space. The base rate of relevance is presumably higher than in the set of all research areas.
Since a big proportion will always be newcomers this means the community will under or overweight various areas, but I’m not sure that newcomers dropping the heuristic would lead to better results.
Senior people directing the attention of newcomers towards relevant uncoded research areas is probably the only real solution.
We did already know that backdoors often (from the title) “Persist Through Safety Training.” This phenomenon studied here and elsewhere is being taken as the main update in favor of AI x-risk. This doesn’t establish probability of the hazard, but it reminds us that backdoor hazards can persist if present.
I think it’s very easy to argue the hazard could emerge from malicious actors poisoning pretraining data, and harder to argue it would arise naturally. AI security researchers such as Carlini et al. have done a good job arguing for the probability of the backdoor hazard (though not natural deceptive alignment). (I think malicious actors unleashing rogue AIs is a concern for the reasons bio GCRs are a concern; if one does it, it could be devastating.)
I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout’s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic. (I’ve been suggesting more attention to backdoors since maybe 2019; here’s a video from a few years ago about the topic; we’ve also run competitions at NeurIPS with thousands of submissions on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don’t have the window dressing.
I think AI security-related topics have a very good track record of being relevant for x-risk (backdoors, unlearning, adversarial robustness). It’s a been better portfolio than the EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.). At a high level its saying power is because AI security is largely about extreme reliability; extreme reliability is not automatically provided by scaling, but most other desiderata are (e.g., commonsense understanding of what people like and dislike).
A request: Could Anthropic employees not call supervised fine-tuning and related techniques “safety training?” OpenAI/Anthropic have made “alignment” in the ML community become synonymous with fine-tuning, which is a big loss. Calling this “alignment training” consistently would help reduce the watering down of the word “safety.”
I basically agree with all of this, but seems important to distinguish between the community on LW (ETA: in terms of what gets karma), individual researchers or organizations, and various other more specific clusters.
More specifically, on:
I think all this stuff does seem probably lame/bad/useless, but I think there was work which looks ok-ish over this period. In particular, I think some work on scalable oversight (debate, decomposition, etc.) looks pretty decent. I think mech interp also looks ok, though not very good (and feature visualization in particular seems pretty weak). Additionally, the AI x-risk community writ large was interested in various things related to finding adversarial attacks and establishing robustness.
Whether you think the portfolio looked good will really depend on who you’re looking at. I think that the positions of Open Phil, Paul Christiano, Jan Leike, and Jacob Steinhart all seem to look pretty good from my perspective. I would have said that this is the central AI x-risk community (though it might form a small subset of people interesting in AI safety on LW which drives various random engagement metrics).
I think a representative sample might be this 2021 OpenPhil RFP. I think stuff here has aged pretty well, though it still is missing a bunch of things which now seem pretty good to me.
My overall take is something like:
Almost all empirical work done to date by the AI x-risk community seems pretty lame/weak.
Some emprical work done by the AI x-risk is pretty reasonable. And the rate of good work directly targeting AI x-risk is probably increasing somewhat.
Academic work in various applicable fields (which isn’t specifically targeted at AI x-risk), looks ok for reducing x-risk, but not amazing. ETA: academic work which claims to be safety related seems notably weaker at reducing AI x-risk than historical AI x-risk focused work, though I think both suck at reducing AI x-risk in an absolute sense relative to what seems possible.
There is probably a decent amount of alpha in carefully thinking through AI x-risk and what empirical research can be done to mitigate this specifically, but this hasn’t clearly looked amazing historically. I expect that many adjacent-ish academic fields would be considerably better for reducing AI x-risk with better targeting based on careful thinking (but in practice, maybe most people are very bad at this type of conceptual thinking, so probably they should just ignore this and do things will seem sorta-related based on high level arguments).
The general AI x-risk community on LW (ETA: in terms of what gets karma) has pretty bad takes overall and also seems to engage in pretty sloppy stuff. Both sloppy ML research (by the standards of normal ML research) and sloppy reasoning. I think it’s basically intractable to make the AI x-risk community on LW good (or at least very hard), so I think we should mostly give up on this and try instead to carve out sub-groups with better views. It doesn’t seem intractible from my perspective to try to make the Anthropic/OpenAI alignment teams have reasonable views.
There are few things that people seem as badly calibrated on than “the beliefs of the general LW community”. Mostly people cherry pick random low karma people they disagree with if they want to present it in a bad light, or cherry pick the people they work with every day if they want to present it in a good light.
You yourself are among the most active commenters in the “AI x-risk community on LW”. It seems very weird to ascribe a generic “bad takes overall” summary to that group, given that you yourself are directly part of it.
Seems fine for people to use whatever identifiers they want for a conversation like this, and I am not going to stop it, but the above sentences seemed like pretty confused generalizations.
Yeah, lol, I should maybe be commenting less.
I mean, I wouldn’t really want to identify as part of “the AI x-risk community on LW” in the same way I expect you wouldn’t want to identify as “an EA” despite relatively often doing thing heavily associated with EAs (e.g., posting on the EA forum).
I would broadly prefer people don’t use labels which place me in particular in any community/group that I seem vaguely associated with an I generally try to extend the same to other people (note that I’m talking about some claim about the aggregate attention of LW, not necessarily any specific person).
Yeah, to be clear, that was like half of my point. A very small fraction of top contributors identify as part of a coherent community. Trying to summarize their takes as if they did is likely to end up confused.
LW is very intentionally designed and shaped so that you don’t need to have substantial social ties or need to become part of a community to contribute (and I’ve made many pretty harsh tradeoffs in that direction over the years).
In as much as some people do, I don’t think it makes sense to give their beliefs outsized weight when trying to think about LW’s role as a discourse platform. The vast majority of top contributors are similarly allergic to labels as you are.
This sentence channels influence of an evaporative cooling norm (upon observing bad takes, either leave the group or conspicuously ignore the bad takes), also places weight on acting on the basis of one’s identity. (I’m guessing this is not in tune with your overall stance, but it’s evidence of presence of a generator for the idea.)
I was just refering to “what gets karma on LW”. Obviously, unclear how much we should care.
Makes sense. I think generalizing from “what gets karma on LW” to “what do the people thinking most about AI X-risk on LW is important” is pretty fraught (especially at the upper end karma is mostly a broad popularity measure).
I think using the results of the annual review is a lot better, and IMO the top alignment posts in past reviews have mostly pretty good takes in them (my guess is also by your lights), and the ones that don’t have reviews poking at the problems pretty well. My guess is you would still have lots of issues with posts scoring highly in the review, but I would be surprised if you would summarize the aggregate as “pretty bad takes”.
At least for relative newcomers to the field, deciding what to pay attention to is a challenge, and using the window-dressed/EA-coded heuristic seems like a reasonable way to prune the search space. The base rate of relevance is presumably higher than in the set of all research areas.
Since a big proportion will always be newcomers this means the community will under or overweight various areas, but I’m not sure that newcomers dropping the heuristic would lead to better results.
Senior people directing the attention of newcomers towards relevant uncoded research areas is probably the only real solution.