How the AI safety technical landscape has changed in the last year, according to some practitioners
I asked the Constellation Slack channel how the technical AIS landscape has changed since I last spent substantial time in the Bay Area (September 2023), and I figured it would be useful to post this (with the permission of the contributors to either post with or without attribution). Curious if commenters agree or would propose additional changes!
This conversation has been lightly edited to preserve anonymity.
Me: One reason I wanted to spend a few weeks in Constellation was to sort of absorb-through-osmosis how the technical AI safety landscape has evolved since I last spent substantial time here in September 2023, but it seems more productive to just ask here “how has the technical AIS landscape evolved since September 2023?” and then have conversations armed with that knowledge. The flavor of this question is like, what are the technical directions and strategies people are most excited about, do we understand any major strategic considerations differently, etc—interested both in your own updates and your perceptions of how the consensus has changed!
Zach Stein-Perlman: Control is on the rise
Anonymous 1: There are much better “model organisms” of various kinds of misalignment, e.g. the stuff Anthropic has published, some unpublished Redwood work, and many other things
Neel Nanda: Sparse Autoencoders are now a really big deal in mech interp and where a lot of the top teams are focused, and I think are very promising, but have yet to conclusively prove themselves at beating baselines in a fair fight on a real world task
Neel Nanda: Dangerous capability evals are now a major focus of labs, governments and other researchers, and there’s clearer ways that technical work can directly feed into governance
(I think this was happening somewhat pre September, but feels much more prominent now)
Anonymous 2: Lots of people (particularly at labs/AISIs) are working on adversarial robustness against jailbreaks, in part because of RSP commitments/commercial motivations. I think there’s more of this than there was in September.
Anonymous 1: Anthropic and GDM are both making IMO very sincere and reasonable efforts to plan for how they’ll make safety cases for powerful AI.
Anonymous 1: In general, there’s substantially more discussion of safety cases
Anonymous 2: Since September, a bunch of many-author scalable oversight papers have been published, e.g. this, this, this. I haven’t been following this work closely enough to have a sense of what update one should make from this, and I’ve heard rumors of unsuccessful scalable oversight experiments that never saw the light of day, which further muddies things
Anonymous 3: My impression is that infosec flavoured things are a top ~3 priority area a few more people in Constellation than last year (maybe twice as many people as last year??).
Building cyberevals and practically securing model weights at frontier labs seem to be the main project areas people are excited about (followed by various kinds of threat modelling and security standards).
It feels to me like one of the biggest changes has been something like “governments seem much more concerned about AI risks than they were last year, and this shift happened somewhat suddenly and unexpectedly”.
A more subjective take is something like “the major labs do not seem to be pushing for policies that would meaningfully curb race dynamics. Instead, they seem to be rallying around voluntary commitments to engage in dangerous capability evaluations & apply safeguards in which lab leadership determines if such safeguards are sufficient.” (I think the steelman of this is “but this could help us get binding legislation in which EG governments or third-party auditors end up evaluating safe cases”, but I think in the absence of any public calls for this, my default assumption is that labs would oppose such a scheme.)
It’s somewhat interesting to me that no one mentioned these (though in fairness the sample size is pretty low). I wonder if part of this reflects the fact that Constellation is geographically/culturally in the Bay Area (whereas the major centers of “government governance” are DC and London), and also that Constellation has (I think?) maintained more of a “work with labs, maintain good relationships with lab, and focus on plans that could inform labs” vibe.
I agree that that’s the most important change and that there’s reason to think people in Constellation/the Bay Area in general might systematically under-attend to policy developments, but I think the most likely explanation for the responses concentrating on other things is that I explicitly asked about technical developments that I missed because I wasn’t in the Bay, and the respondents generally have the additional context that I work in policy and live in DC, so responses that centered policy change would have been off-target.
I’d be interested in more information about how Anthropic and GDM are thinking about safety cases (or more details from people who are excited or unexcited about this work).
I don’t know the exact dates, but: a)proof-based methods seem to be receiving a lot of attention b) def/acc is becoming more of a thing c) more focus on concentration of power risk (tbh, while there are real risks here, I suspect most work here is net-negative)
My question is why do you consider most work on concentration of power risk net-negative?
Super terse answer:
Because most people do stuff like try to increase the number of companies in the space.
And even though AI isn’t like nukes yet, at one point it will be.
Just like you wouldn’t want as many companies building nukes as possible—you’d either want a few highly vetted companies or a government effort—you don’t want as many companies building AGI as possible.