This seems quite promising to me. My primary concern is that I feel like NIST is really not an institution well-suited to house alignment research.
My current understanding is that NIST is generally a very conservative organization, with the primary remit of establishing standards in a variety of industries and sciences. These standards are almost always about things that are extremely objective and very well-established science.
In contrast, AI Alignment seems to me to continue to be in a highly pre-paradigmatic state, and the standards that I can imagine us developing there seem to be qualitatively very different from the other standards that NIST has historically been in charge of developing. It seems to me that NIST is not a good vehicle for the kind of cutting edge research and extremely uncertain environment in which things like alignment and evals research have to happen.
Maybe other people have a different model of different US governmental institutions, but at least to me NIST seems like a bad fit for the kind of work that I expect Paul to be doing there.
Why are you talking about alignment research? I don’t see any evidence that he’s planning to do any alignment research in this role, so it seems misleading to talk about NIST being a bad place to do it.
Huh, I guess my sense is that in order to develop the standards talked about above you need to do a lot of cutting-edge research.
I guess my sense from the description above is that we would be talking about research pretty similar to what METR is doing, which seems pretty open-ended and pre-paradigmatic to me. But I might be misunderstanding his role.
I would call what METR does alignment research, but also fine to use a different term for it. Mostly using it synonymously with “AI Safety Research” which I know you object to, but I do think that’s how it’s normally used (and the relevant aspect here is the pre-paradigmaticity of the relevant research, which I continue to think applies independently of the bucket you put it into).
I do think it’s marginally good to make “AI Alignment Research” mean something narrower, so am supportive here of getting me to use something broader like “AI Safety Research”, but I don’t really think that changes my argument in any relevant way.
It seems like we have significant need for orgs like METR or the DeepMind dangerous capabilities evals team trying to operationalise these evals, but also regulators with authority building on that work to set them as explicit and objective standards. The latter feels maybe more practical for NIST to do, especially under Paul?
Buck, could you (or habryka) elaborate on this? What does Buck call the set of things that ARC theory and METR (formerly known as ARC evals) does, “AI control research”?
My understanding is that while Redwood clearly does control research, METR evals seem more of an attempt to demonstrate dangerous capabilities than help with control. I haven’t wrapped my head around ARC’s research philosophy and output to confidently state anything.
I normally use “alignment research” to mean “research into making models be aligned, e.g. not performing much worse than they’re ‘able to’ and not purposefully trying to kill you”. By this definition, ARC is alignment research, METR and Redwood isn’t.
An important division between Redwood and METR is that we focus a lot more on developing/evaluating countermeasures.
FWIW, I typically use “alignment research” to mean “AI research aimed at making it possible to safely do ambitious things with sufficiently-capable AI” (with an emphasis on “safely”). So I’d include things like Chris Olah’s interpretability research, even if the proximate impact of this is just “we understand what’s going on better, so we may be more able to predict and finely control future systems” and the proximate impact is not “the AI is now less inclined to kill you”.
Some examples: I wouldn’t necessarily think of “figure out how we want to airgap the AI” as “alignment research”, since it’s less about designing the AI, shaping its mind, predicting and controlling it, etc., and more about designing the environment around the AI.
But I would think of things like “figure out how to make this AI system too socially-dumb to come up with ideas like ‘maybe I should deceive my operators’, while keeping it superhumanly smart at nanotech research” as central examples of “alignment research”, even though it’s about controlling capabilities (‘make the AI dumb in this particular way’) rather than about instilling a particular goal into the AI.
And I’d also think of “we know this AI is trying to kill us; let’s figure out how to constrain its capabilities so that it keeps wanting that, but is too dumb to find a way to succeed in killing us, thereby forcing it to work with us rather than against us in order to achieve more of what it wants” as a pretty central example of alignment research, albeit not the sort of alignment research I feel optimistic about. The way I think about the field, you don’t have to specifically attack the “is it trying to kill you?” part of the system in order to be doing alignment research; there are other paths, and alignment researchers should consider all of them and focus on results rather than marrying a specific methodology.
It’s pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for “actually terminally has good intentions”. So, this makes me sad to call this alignment research. Of course, this type of research maybe instrumentally useful for making AIs more aligned, but so will a bunch of other stuff (e.g. earning to give).
Fair enough if you think we should just eat this terminology issue and then coin a new term like “actually real-alignment-targeting-directly alignment research”. Idk what the right term is obviously.
It’s pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for “actually terminally has good intentions”.
Aren’t there a lot of clearer words for this? “Well-intentioned”, “nice”, “benevolent”, etc.
(And a lot of terms, like “value loading” and “value learning”, that are pointing at the research project of getting good intentions into the AI.)
To my ear, “aligned person” sounds less like “this person wishes the best for me”, and more like “this person will behave in the right ways”.
If I hear that Russia and China are “aligned”, I do assume that their intentions play a big role in that, but I also assume that their circumstances, capabilities, etc. matter too. Alignment in geopolitics can be temporary or situational, and it almost never means that Russia cares about China as much as China cares about itself, or vice versa.
And if we step back from the human realm, an engineered system can be “aligned” in contexts that have nothing to do with goal-oriented behavior, but are just about ensuring components are in the right place relative to each other.
Cf. the history of the term “AI alignment”. From my perspective, a big part of why MIRI coordinated with Stuart Russell to introduce the term “AI alignment” was that we wanted to switch away from “Friendly AI” to a term that sounded more neutral. “Friendly AI research” had always been intended to subsume the full technical problem of making powerful AI systems safe and aimable; but emphasizing “Friendliness” made it sound like the problem was purely about value loading, so a more generic and low-content word seemed desirable.
But Stuart Russell (and later Paul Christiano) had a different vision in mind for what they wanted “alignment” to be, and MIRI apparently failed to communicate and coordinate with Russell to avoid a namespace collision. So we ended up with a messy patchwork of different definitions.
I’ve basically given up on trying to achieve uniformity on what “AI alignment” is; the best we can do, I think, is clarify whether we’re talking about “intent alignment” vs. “outcome alignment” when the distinction matters.
But I do want to push back against those who think outcome alignment is just an unhelpful concept — on the contrary, if we didn’t have a word for this idea I think it would be very important to invent one.
IMO it matters more that we keep our eye on the ball (i.e., think about the actual outcomes we want and keep researchers’ focus on how to achieve those outcomes) than that we define an extremely crisp, easily-packaged technical concept (that is at best a loose proxy for what we actually want). Especially now that ASI seems nearer at hand (so the need for this “keep our eye on the ball” skill is becoming less and less theoretical), and especially now that ASI disaster concerns have hit the mainstream (so the need to “sell” AI risk has diminished somewhat, and the need to direct research talent at the most important problems has increased).
And I also want to push back against the idea that a priori, before we got stuck with the current terminology mess, it should have been obvious that “alignment” is about AI systems’ goals and/or intentions, rather than about their behavior or overall designs. I think intent alignment took off because Stuart Russell and Paul Christiano advocated for that usage and encouraged others to use the term that way, not because this was the only option available to us.
Aren’t there a lot of clearer words for this? “Well-intentioned”, “nice”, “benevolent”, etc.
Fair enough. I guess it just seems somewhat incongruous to say. “Oh yes, the AI is aligned. Of course it might desperately crave murdering all of us in its heart (we certainly haven’t ruled this out with our current approach), but it is aligned because we’ve made it so that it wouldn’t get away with it if it tried.”
Sounds like a lot of political alliances! (And “these two political actors are aligned” is arguably an even weaker condition than “these two political actors are allies”.)
At the end of the day, of course, all of these analogies are going to be flawed. AI is genuinely a different beast.
Personally, I like mentally splitting the space into AI safety (emphasis on measurement and control), AI alignment (getting it to align to the operators purposes and actually do what the operators desire), and AI value-alignment (getting the AI to understand and care about what people need and want).
Feels like a Venn diagram with a lot of overlap, and yet some distinct non-overlap spaces.
By my framing, Redwood research and METR are more centrally AI safety. ARC/Paul’s research agenda more of a mix of AI safety and AI alignment. MIRI’s work to fundamentally understand and shape Agents is a mix of AI alignment and AI value-alignment. Obviously success there would have the downstream effect of robustly improving AI safety (reducing the need for careful evals and control), but is a more difficult approach in general with less immediate applicability.
I think we need all these things!
FWIW NIST, both at Gaithersburg and Boulder, is very well-regarded in AMO physics and slices of condensed matter. Bill Phillips (https://www.nist.gov/people/william-d-phillips) is one big name there, but AIUI they do a whole bunch of cool stuff. Which doesn’t say much one way or the other about how the AI Safety Institute will go! But NIST has a record of producing, or at least sheltering, very good work in the general vicinity of their remit.
Don’t knock metrology! It’s really, really hard, subtle, and creative, and you all of a sudden have to worry about things nobody has ever thought about before! I’m a condensed matter theory guy, more or less, but I go to metrology talks every once in a while and have my mind blown.
I agree metrology is cool! But I think units are mostly helpful for engineering insofar as they reflect fundamental laws of nature—see e.g. the metric units—and we don’t have those yet for AI. Until we do, I expect attempts to define them will be vague, high-level descriptions more than deep scientific understanding.
(And I think the former approach has a terrible track record, at least when used to define units of risk or controllability—e.g. BSL levels, which have failed so consistently and catastrophically they’ve induced an EA cause area, and which for some reason AI labs are starting to emulate).
In what sense did the BSL levels failed consistently or catastrophically?
Even if you think COVID-19 is a lab leak, the BSL levels would have suggested that the BSL 2 that Wuhan used for their Coronavirus gain-of-function research is not enough.
BSL levels, which have failed so consistently and catastrophically they’ve induced an EA cause area,
This is confused and wrong, in my view. The EA cause area around biorisk is mostly happy to rely on those levels, and unlike for AI, the (very useful) levels predate EA interest and give us something to build on. The questions are largely instead about whether to allow certain classes of research at all, the risks of those who intentionally do things that are forbiddn, and how new technology changes the risk.
The EA cause area around biorisk is mostly happy to rely on those levels
I disagree—I think nearly all EA’s focused on biorisk think gain of function research should be banned, since the risk management framework doesn’t work well enough to drive the expected risk below that of the expected benefit. If our framework for preventing lab accidents worked as well as e.g. our framework for preventing plane accidents, I think few EA’s would worry much about GoF.
(Obviously there are non-accidental sources of biorisk too, for which we can hardly blame the safety measures; but I do think the measures work sufficiently poorly that even accident risk alone would justify a major EA cause area).
[Added April 28th: In case someone reads my comment without this context: David has made a number of worthwhile contributions to discussions of biological existential risks (e.g. 1, 2, 3) as well as worked professionally in this area and his contributions on this topic are quite often well-worth engaging with. Here I just intended to add that in my opinion early on in the covid pandemic he messed up pretty badly in one or two critical discussions around mask effectiveness and censoring criticism of the CDC. Perhaps that’s not saying much because the base rate for relevant experts dealing with Covid is also that they were very off-the-mark. Furthermore David’s June 2020 post-mortem of his mistakes was a good public service even while I don’t agree with his self-assessment in all cases. Overall I think his arguments are often well-worth engaging with.]
I’m not in touch with the ground truth in this case, but for those reading along without knowing the context, I’ll mention that it wouldn’t be the first time that David has misrepresented what people in the Effective Altruism Biorisk professional network believe[1].
(I will mention that David later apologized for handling that situation poorly and wasting people’s time[2], which I think reflects positively on him.)
See Habryka’s response to Davidmanheim’s comment here from March 7th 2020, such as this quote.
Overall, my sense is that you made a prediction that people in biorisk would consider this post an infohazard that had to be prevented from spreading (you also reported this post to the admins, saying that we should “talk to someone who works in biorisk at at FHI, Openphil, etc. to confirm that this is a really bad idea”).
We have now done so, and in this case others did not share your assessment (and I expect most other experts would give broadly the same response).
My guess is more that we were talking past each other than that his intended claim was false/unrepresentative. I do think it’s true that EA’s mostly talk about people doing gain of function research as the problem, rather than about the insufficiency of the safeguards; I just think the latter is why the former is a problem.
The OP claimed a failure of BSL levels was the single thing that induced biorisk as a cause area, and I said that was a confused claim. Feel free to find someone who disagrees with me here, but the proximate causes of EAs worrying about biorisk has nothing to do with BSL lab designations. It’s not BSL levels that failed in allowing things like the soviet bioweapons program, or led to the underfunded and largely unenforceable BWC, or the way that newer technologies are reducing the barriers to terrorists and other being able to pursue bioweapons.
I think we must still be missing each other somehow. To reiterate, I’m aware that there is non-accidental biorisk, for which one can hardly blame the safety measures. But there is also accident risk, since labs often fail to contain pathogens even when they’re trying to.
Having written extensively about it, I promise you I’m aware. But please, tell me more about how this supports the original claim which I have been disagreeing with, that these class of incidents were or are the primary concern of the EA biosecurity community, the one that led to it being a cause area.
I agree there other problems the EA biosecurity community focuses on, but surely lab escapes are one of those problems, and part of the reason we need biosecurity measures? In any case, this disagreement seems beside the main point that I took Adam to be making, namely that the track record for defining appropriate units of risk for poorly understood, high attack surface domains is quite bad (as with BSL). This still seems true to me.
BSL isn’t the thing that defines “appropriate units of risk”, that’s pathogen risk-group levels, and I agree that those are are problem because they focus on pathogen lists rather than actual risks. I actually think BSL are good at what they do, and the problem is regulation and oversight, which is patchy, as well as transparency, of which there is far too little. But those are issues with oversight, not with the types of biosecurity measure that are available.
This thread isn’t seeming very productive to me, so I’m going to bow out after this. But yes, it is a primary concern—at least in the case of Open Philanthropy, it’s easy to check what their primary concerns are because they write them up. And accidental release from dual use research is one of them.
And you’ve now equivocated between “they’ve induced an EA cause area” and a list of the range of risks covered by biosecurity—not what their primary concerns are—and citing this as “one of them.” I certainly agree that biosecurity levels are one of the things biosecurity is about, and that “the possibility of accidental deployment of biological agents” is a key issue, but that’s incredibly far removed from the original claim that the failure of BSL levels induced the cause area!
I did not say that they didn’t want to ban things, I explicitly said “whether to allow certain classes of research at all,” and when I said “happy to rely on those levels, I meant that the idea that we should have “BSL-5” is the kind of silly thing that novice EAs propose that doesn’t make sense because there literally isn’t something significantly more restrictive other than just banning it.
I also think that “nearly all EA’s focused on biorisk think gain of function research should be banned” is obviously underspecified, and wrong because of the details. Yes, we all think that there is a class of work that should be banned, but tons of work that would be called gain of function isn’t in that class.
the idea that we should have “BSL-5” is the kind of silly thing that novice EAs propose that doesn’t make sense because there literally isn’t something significantly more restrictive
I mean, I’m sure something more restrictive is possible. But my issue with BSL levels isn’t that they include too few BSL-type restrictions, it’s that “lists of restrictions” are a poor way of managing risk when the attack surface is enormous. I’m sure someday we’ll figure out how to gain this information in a safer way—e.g., by running simulations of GoF experiments instead of literally building the dangerous thing—but at present, the best available safeguards aren’t sufficient.
I also think that “nearly all EA’s focused on biorisk think gain of function research should be banned” is obviously underspecified, and wrong because of the details.
I’m confused why you find this underspecified. I just meant “gain of function” in the standard, common-use sense—e.g., that used in the 2014 ban on federal funding for such research.
I mean, I’m sure something more restrictive is possible.
But what? Should we insist that the entire time someone’s inside a BSL-4 lab, we have a second person who is an expert in biosafety visually monitoring them to ensure they don’t make mistakes? Or should their air supply not use filters and completely safe PAPRs, and feed them outside air though a tube that restricts their ability to move around instead? (Edit to Add: These are already both requires in BSL-4 labs. When I said I don’t know of anything more restrictive they could do, I was being essentially literal—they do everything including quite a number of unreasonable things to prevent human infection, short of just not doing the research.)
Or do you have some new idea that isn’t just a ban with more words?
“lists of restrictions” are a poor way of managing risk when the attack surface is enormous
Sure, list-based approaches are insufficient, but they have relatively little to do with biosafety levels of labs, they have to do with risk groups, which are distinct, but often conflated. (So Ebola or Smallpox isn’t a “BSL-4” pathogen, because there is no such thing. )
I just meant “gain of function” in the standard, common-use sense—e.g., that used in the 2014 ban on federal funding for such research.
That ban didn’t go far enough, since it only applied to 3 pathogen types, and wouldn’t have banned what Wuhan was doing with novel viruses, since that wasn’t working with SARS or MERS, it was working with other species of virus. So sure, we could enforce a broader version of that ban, but getting a good definition that’s both extensive enough to prevent dangerous work and that doesn’t ban obviously useful research is very hard.
Is there another US governmental organization that you think would be better suited? My relatively uninformed sense is that there’s no real USFG organization that would be very well-suited for this—and given that NIST is probably one of the better choices.
I don’t know the landscape of US government institutions that well, but some guesses:
My sense is DARPA and sub-institutions like IARPA often have pursued substantially more open-ended research that seems more in-line with what I expect AI Alignment research to look like
It’s also not super clear to me that research like this needs to be hosted within a governmental institutions. Organizations like RAND or academic institutions seem well-placed to host it, and have existing high trust relationships with the U.S. government.
Something like the UK task force structure also seems reasonable to me, though I don’t think I have a super deep understanding of that either. Of course, creating a whole new structure for something like this is hard (and I do see people in-parallel trying to establish a new specialized institution)
The Romney, Reed, Moran and King framework whose summary I happened to read this morning suggests the following options:
It lists NIST together with the Department of Commerce as one of the options, but all the other options also seem reasonable to me, and I think better by my lights. Though I agree none of these seem ideal (besides the creation of a specialized new agency, though of course that will justifiably encounter a bunch of friction, since creating a new agency should have a pretty high bar for happening).
I think this should be broken down into two questions:
Before the EO, if we were asked to figure out where this kind of evals should happen, what institution would we pick & why?
After the EO, where does it make sense for evals-focused people to work?
I think the answer to #1 is quite unclear. I personally think that there was a strong case that a natsec-focused USAISI could have been given to DHS or DoE or some interagency thing. In addition to the point about technical expertise, it does seem relatively rare for Commerce/NIST to take on something that is so natsec-focused.
But I think the answer to #2 is pretty clear. The EO clearly tasks NIST with this role, and now I think our collective goal should be to try to make sure NIST can execute as effectively as possible. Perhaps there will be future opportunities to establish new places for evals work, alignment work, risk monitoring and forecasting work, emergency preparedness planning, etc etc. But for now, whether we think it was the best choice or not, NIST/USAISI are clearly the folks who are tasked with taking the lead on evals + standards.
This seems quite promising to me. My primary concern is that I feel like NIST is really not an institution well-suited to house alignment research.
My current understanding is that NIST is generally a very conservative organization, with the primary remit of establishing standards in a variety of industries and sciences. These standards are almost always about things that are extremely objective and very well-established science.
In contrast, AI Alignment seems to me to continue to be in a highly pre-paradigmatic state, and the standards that I can imagine us developing there seem to be qualitatively very different from the other standards that NIST has historically been in charge of developing. It seems to me that NIST is not a good vehicle for the kind of cutting edge research and extremely uncertain environment in which things like alignment and evals research have to happen.
Maybe other people have a different model of different US governmental institutions, but at least to me NIST seems like a bad fit for the kind of work that I expect Paul to be doing there.
Why are you talking about alignment research? I don’t see any evidence that he’s planning to do any alignment research in this role, so it seems misleading to talk about NIST being a bad place to do it.
Huh, I guess my sense is that in order to develop the standards talked about above you need to do a lot of cutting-edge research.
I guess my sense from the description above is that we would be talking about research pretty similar to what METR is doing, which seems pretty open-ended and pre-paradigmatic to me. But I might be misunderstanding his role.
I’m not saying you don’t need to do cutting-edge research, I’m just saying that it’s not what people usually call alignment research.
I would call what METR does alignment research, but also fine to use a different term for it. Mostly using it synonymously with “AI Safety Research” which I know you object to, but I do think that’s how it’s normally used (and the relevant aspect here is the pre-paradigmaticity of the relevant research, which I continue to think applies independently of the bucket you put it into).
I do think it’s marginally good to make “AI Alignment Research” mean something narrower, so am supportive here of getting me to use something broader like “AI Safety Research”, but I don’t really think that changes my argument in any relevant way.
Yeah I object to using the term “alignment research” to refer to research that investigates whether models can do particular things.
But all the terminology options here are somewhat fucked imo, I probably should have been more chill about you using the language you did, sorry.
It seems like we have significant need for orgs like METR or the DeepMind dangerous capabilities evals team trying to operationalise these evals, but also regulators with authority building on that work to set them as explicit and objective standards. The latter feels maybe more practical for NIST to do, especially under Paul?
Buck, could you (or habryka) elaborate on this? What does Buck call the set of things that ARC theory and METR (formerly known as ARC evals) does, “AI control research”?
My understanding is that while Redwood clearly does control research, METR evals seem more of an attempt to demonstrate dangerous capabilities than help with control. I haven’t wrapped my head around ARC’s research philosophy and output to confidently state anything.
I normally use “alignment research” to mean “research into making models be aligned, e.g. not performing much worse than they’re ‘able to’ and not purposefully trying to kill you”. By this definition, ARC is alignment research, METR and Redwood isn’t.
An important division between Redwood and METR is that we focus a lot more on developing/evaluating countermeasures.
FWIW, I typically use “alignment research” to mean “AI research aimed at making it possible to safely do ambitious things with sufficiently-capable AI” (with an emphasis on “safely”). So I’d include things like Chris Olah’s interpretability research, even if the proximate impact of this is just “we understand what’s going on better, so we may be more able to predict and finely control future systems” and the proximate impact is not “the AI is now less inclined to kill you”.
Some examples: I wouldn’t necessarily think of “figure out how we want to airgap the AI” as “alignment research”, since it’s less about designing the AI, shaping its mind, predicting and controlling it, etc., and more about designing the environment around the AI.
But I would think of things like “figure out how to make this AI system too socially-dumb to come up with ideas like ‘maybe I should deceive my operators’, while keeping it superhumanly smart at nanotech research” as central examples of “alignment research”, even though it’s about controlling capabilities (‘make the AI dumb in this particular way’) rather than about instilling a particular goal into the AI.
And I’d also think of “we know this AI is trying to kill us; let’s figure out how to constrain its capabilities so that it keeps wanting that, but is too dumb to find a way to succeed in killing us, thereby forcing it to work with us rather than against us in order to achieve more of what it wants” as a pretty central example of alignment research, albeit not the sort of alignment research I feel optimistic about. The way I think about the field, you don’t have to specifically attack the “is it trying to kill you?” part of the system in order to be doing alignment research; there are other paths, and alignment researchers should consider all of them and focus on results rather than marrying a specific methodology.
It’s pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for “actually terminally has good intentions”. So, this makes me sad to call this alignment research. Of course, this type of research maybe instrumentally useful for making AIs more aligned, but so will a bunch of other stuff (e.g. earning to give).
Fair enough if you think we should just eat this terminology issue and then coin a new term like “actually real-alignment-targeting-directly alignment research”. Idk what the right term is obviously.
Aren’t there a lot of clearer words for this? “Well-intentioned”, “nice”, “benevolent”, etc.
(And a lot of terms, like “value loading” and “value learning”, that are pointing at the research project of getting good intentions into the AI.)
To my ear, “aligned person” sounds less like “this person wishes the best for me”, and more like “this person will behave in the right ways”.
If I hear that Russia and China are “aligned”, I do assume that their intentions play a big role in that, but I also assume that their circumstances, capabilities, etc. matter too. Alignment in geopolitics can be temporary or situational, and it almost never means that Russia cares about China as much as China cares about itself, or vice versa.
And if we step back from the human realm, an engineered system can be “aligned” in contexts that have nothing to do with goal-oriented behavior, but are just about ensuring components are in the right place relative to each other.
Cf. the history of the term “AI alignment”. From my perspective, a big part of why MIRI coordinated with Stuart Russell to introduce the term “AI alignment” was that we wanted to switch away from “Friendly AI” to a term that sounded more neutral. “Friendly AI research” had always been intended to subsume the full technical problem of making powerful AI systems safe and aimable; but emphasizing “Friendliness” made it sound like the problem was purely about value loading, so a more generic and low-content word seemed desirable.
But Stuart Russell (and later Paul Christiano) had a different vision in mind for what they wanted “alignment” to be, and MIRI apparently failed to communicate and coordinate with Russell to avoid a namespace collision. So we ended up with a messy patchwork of different definitions.
I’ve basically given up on trying to achieve uniformity on what “AI alignment” is; the best we can do, I think, is clarify whether we’re talking about “intent alignment” vs. “outcome alignment” when the distinction matters.
But I do want to push back against those who think outcome alignment is just an unhelpful concept — on the contrary, if we didn’t have a word for this idea I think it would be very important to invent one.
IMO it matters more that we keep our eye on the ball (i.e., think about the actual outcomes we want and keep researchers’ focus on how to achieve those outcomes) than that we define an extremely crisp, easily-packaged technical concept (that is at best a loose proxy for what we actually want). Especially now that ASI seems nearer at hand (so the need for this “keep our eye on the ball” skill is becoming less and less theoretical), and especially now that ASI disaster concerns have hit the mainstream (so the need to “sell” AI risk has diminished somewhat, and the need to direct research talent at the most important problems has increased).
And I also want to push back against the idea that a priori, before we got stuck with the current terminology mess, it should have been obvious that “alignment” is about AI systems’ goals and/or intentions, rather than about their behavior or overall designs. I think intent alignment took off because Stuart Russell and Paul Christiano advocated for that usage and encouraged others to use the term that way, not because this was the only option available to us.
Fair enough. I guess it just seems somewhat incongruous to say. “Oh yes, the AI is aligned. Of course it might desperately crave murdering all of us in its heart (we certainly haven’t ruled this out with our current approach), but it is aligned because we’ve made it so that it wouldn’t get away with it if it tried.”
Sounds like a lot of political alliances! (And “these two political actors are aligned” is arguably an even weaker condition than “these two political actors are allies”.)
At the end of the day, of course, all of these analogies are going to be flawed. AI is genuinely a different beast.
Personally, I like mentally splitting the space into AI safety (emphasis on measurement and control), AI alignment (getting it to align to the operators purposes and actually do what the operators desire), and AI value-alignment (getting the AI to understand and care about what people need and want). Feels like a Venn diagram with a lot of overlap, and yet some distinct non-overlap spaces.
By my framing, Redwood research and METR are more centrally AI safety. ARC/Paul’s research agenda more of a mix of AI safety and AI alignment. MIRI’s work to fundamentally understand and shape Agents is a mix of AI alignment and AI value-alignment. Obviously success there would have the downstream effect of robustly improving AI safety (reducing the need for careful evals and control), but is a more difficult approach in general with less immediate applicability. I think we need all these things!
Couple of thoughts re: NIST as an institution:
FWIW NIST, both at Gaithersburg and Boulder, is very well-regarded in AMO physics and slices of condensed matter. Bill Phillips (https://www.nist.gov/people/william-d-phillips) is one big name there, but AIUI they do a whole bunch of cool stuff. Which doesn’t say much one way or the other about how the AI Safety Institute will go! But NIST has a record of producing, or at least sheltering, very good work in the general vicinity of their remit.
Don’t knock metrology! It’s really, really hard, subtle, and creative, and you all of a sudden have to worry about things nobody has ever thought about before! I’m a condensed matter theory guy, more or less, but I go to metrology talks every once in a while and have my mind blown.
I agree metrology is cool! But I think units are mostly helpful for engineering insofar as they reflect fundamental laws of nature—see e.g. the metric units—and we don’t have those yet for AI. Until we do, I expect attempts to define them will be vague, high-level descriptions more than deep scientific understanding.
(And I think the former approach has a terrible track record, at least when used to define units of risk or controllability—e.g. BSL levels, which have failed so consistently and catastrophically they’ve induced an EA cause area, and which for some reason AI labs are starting to emulate).
In what sense did the BSL levels failed consistently or catastrophically?
Even if you think COVID-19 is a lab leak, the BSL levels would have suggested that the BSL 2 that Wuhan used for their Coronavirus gain-of-function research is not enough.
There have been frequent and severe biosafety accidents for decades, many of which occurred at labs which were attempting to follow BSL protocol.
That doesn’t seem like “consistently and catastrophically,” it seems like “far too often, but with thankfully fairly limited local consequences.”
This is confused and wrong, in my view. The EA cause area around biorisk is mostly happy to rely on those levels, and unlike for AI, the (very useful) levels predate EA interest and give us something to build on. The questions are largely instead about whether to allow certain classes of research at all, the risks of those who intentionally do things that are forbiddn, and how new technology changes the risk.
I disagree—I think nearly all EA’s focused on biorisk think gain of function research should be banned, since the risk management framework doesn’t work well enough to drive the expected risk below that of the expected benefit. If our framework for preventing lab accidents worked as well as e.g. our framework for preventing plane accidents, I think few EA’s would worry much about GoF.
(Obviously there are non-accidental sources of biorisk too, for which we can hardly blame the safety measures; but I do think the measures work sufficiently poorly that even accident risk alone would justify a major EA cause area).
[Added April 28th: In case someone reads my comment without this context: David has made a number of worthwhile contributions to discussions of biological existential risks (e.g. 1, 2, 3) as well as worked professionally in this area and his contributions on this topic are quite often well-worth engaging with. Here I just intended to add that in my opinion early on in the covid pandemic he messed up pretty badly in one or two critical discussions around mask effectiveness and censoring criticism of the CDC. Perhaps that’s not saying much because the base rate for relevant experts dealing with Covid is also that they were very off-the-mark. Furthermore David’s June 2020 post-mortem of his mistakes was a good public service even while I don’t agree with his self-assessment in all cases. Overall I think his arguments are often well-worth engaging with.]
I’m not in touch with the ground truth in this case, but for those reading along without knowing the context, I’ll mention that it wouldn’t be the first time that David has misrepresented what people in the Effective Altruism Biorisk professional network believe[1].
(I will mention that David later apologized for handling that situation poorly and wasting people’s time[2], which I think reflects positively on him.)
See Habryka’s response to Davidmanheim’s comment here from March 7th 2020, such as this quote.
See David’s own June 25th reply to the same comment.
My guess is more that we were talking past each other than that his intended claim was false/unrepresentative. I do think it’s true that EA’s mostly talk about people doing gain of function research as the problem, rather than about the insufficiency of the safeguards; I just think the latter is why the former is a problem.
The OP claimed a failure of BSL levels was the single thing that induced biorisk as a cause area, and I said that was a confused claim. Feel free to find someone who disagrees with me here, but the proximate causes of EAs worrying about biorisk has nothing to do with BSL lab designations. It’s not BSL levels that failed in allowing things like the soviet bioweapons program, or led to the underfunded and largely unenforceable BWC, or the way that newer technologies are reducing the barriers to terrorists and other being able to pursue bioweapons.
I think we must still be missing each other somehow. To reiterate, I’m aware that there is non-accidental biorisk, for which one can hardly blame the safety measures. But there is also accident risk, since labs often fail to contain pathogens even when they’re trying to.
Having written extensively about it, I promise you I’m aware. But please, tell me more about how this supports the original claim which I have been disagreeing with, that these class of incidents were or are the primary concern of the EA biosecurity community, the one that led to it being a cause area.
I agree there other problems the EA biosecurity community focuses on, but surely lab escapes are one of those problems, and part of the reason we need biosecurity measures? In any case, this disagreement seems beside the main point that I took Adam to be making, namely that the track record for defining appropriate units of risk for poorly understood, high attack surface domains is quite bad (as with BSL). This still seems true to me.
BSL isn’t the thing that defines “appropriate units of risk”, that’s pathogen risk-group levels, and I agree that those are are problem because they focus on pathogen lists rather than actual risks. I actually think BSL are good at what they do, and the problem is regulation and oversight, which is patchy, as well as transparency, of which there is far too little. But those are issues with oversight, not with the types of biosecurity measure that are available.
This thread isn’t seeming very productive to me, so I’m going to bow out after this. But yes, it is a primary concern—at least in the case of Open Philanthropy, it’s easy to check what their primary concerns are because they write them up. And accidental release from dual use research is one of them.
If you’re appealing to OpenPhil, it might be useful to ask one of the people who was working with them on this as well.
And you’ve now equivocated between “they’ve induced an EA cause area” and a list of the range of risks covered by biosecurity—not what their primary concerns are—and citing this as “one of them.” I certainly agree that biosecurity levels are one of the things biosecurity is about, and that “the possibility of accidental deployment of biological agents” is a key issue, but that’s incredibly far removed from the original claim that the failure of BSL levels induced the cause area!
I did not say that they didn’t want to ban things, I explicitly said “whether to allow certain classes of research at all,” and when I said “happy to rely on those levels, I meant that the idea that we should have “BSL-5” is the kind of silly thing that novice EAs propose that doesn’t make sense because there literally isn’t something significantly more restrictive other than just banning it.
I also think that “nearly all EA’s focused on biorisk think gain of function research should be banned” is obviously underspecified, and wrong because of the details. Yes, we all think that there is a class of work that should be banned, but tons of work that would be called gain of function isn’t in that class.
I mean, I’m sure something more restrictive is possible. But my issue with BSL levels isn’t that they include too few BSL-type restrictions, it’s that “lists of restrictions” are a poor way of managing risk when the attack surface is enormous. I’m sure someday we’ll figure out how to gain this information in a safer way—e.g., by running simulations of GoF experiments instead of literally building the dangerous thing—but at present, the best available safeguards aren’t sufficient.
I’m confused why you find this underspecified. I just meant “gain of function” in the standard, common-use sense—e.g., that used in the 2014 ban on federal funding for such research.
But what? Should we insist that the entire time someone’s inside a BSL-4 lab, we have a second person who is an expert in biosafety visually monitoring them to ensure they don’t make mistakes? Or should their air supply not use filters and completely safe PAPRs, and feed them outside air though a tube that restricts their ability to move around instead? (Edit to Add: These are already both requires in BSL-4 labs. When I said I don’t know of anything more restrictive they could do, I was being essentially literal—they do everything including quite a number of unreasonable things to prevent human infection, short of just not doing the research.)
Or do you have some new idea that isn’t just a ban with more words?
Sure, list-based approaches are insufficient, but they have relatively little to do with biosafety levels of labs, they have to do with risk groups, which are distinct, but often conflated. (So Ebola or Smallpox isn’t a “BSL-4” pathogen, because there is no such thing. )
That ban didn’t go far enough, since it only applied to 3 pathogen types, and wouldn’t have banned what Wuhan was doing with novel viruses, since that wasn’t working with SARS or MERS, it was working with other species of virus. So sure, we could enforce a broader version of that ban, but getting a good definition that’s both extensive enough to prevent dangerous work and that doesn’t ban obviously useful research is very hard.
Is there another US governmental organization that you think would be better suited? My relatively uninformed sense is that there’s no real USFG organization that would be very well-suited for this—and given that NIST is probably one of the better choices.
I don’t know the landscape of US government institutions that well, but some guesses:
My sense is DARPA and sub-institutions like IARPA often have pursued substantially more open-ended research that seems more in-line with what I expect AI Alignment research to look like
The US government has many national laboratories that have housed a lot of great science and research. Many of those seem like decent fits: https://www.usa.gov/agencies/national-laboratories
It’s also not super clear to me that research like this needs to be hosted within a governmental institutions. Organizations like RAND or academic institutions seem well-placed to host it, and have existing high trust relationships with the U.S. government.
Something like the UK task force structure also seems reasonable to me, though I don’t think I have a super deep understanding of that either. Of course, creating a whole new structure for something like this is hard (and I do see people in-parallel trying to establish a new specialized institution)
The Romney, Reed, Moran and King framework whose summary I happened to read this morning suggests the following options:
It lists NIST together with the Department of Commerce as one of the options, but all the other options also seem reasonable to me, and I think better by my lights. Though I agree none of these seem ideal (besides the creation of a specialized new agency, though of course that will justifiably encounter a bunch of friction, since creating a new agency should have a pretty high bar for happening).
I think this should be broken down into two questions:
Before the EO, if we were asked to figure out where this kind of evals should happen, what institution would we pick & why?
After the EO, where does it make sense for evals-focused people to work?
I think the answer to #1 is quite unclear. I personally think that there was a strong case that a natsec-focused USAISI could have been given to DHS or DoE or some interagency thing. In addition to the point about technical expertise, it does seem relatively rare for Commerce/NIST to take on something that is so natsec-focused.
But I think the answer to #2 is pretty clear. The EO clearly tasks NIST with this role, and now I think our collective goal should be to try to make sure NIST can execute as effectively as possible. Perhaps there will be future opportunities to establish new places for evals work, alignment work, risk monitoring and forecasting work, emergency preparedness planning, etc etc. But for now, whether we think it was the best choice or not, NIST/USAISI are clearly the folks who are tasked with taking the lead on evals + standards.