AI strategy & governance. ailabwatch.org.
Zach Stein-Perlman(Zachary Stein-Perlman)
Update with two new responses:
I think this is 10 generals, 1 petrov, and one other person (either the other petrov or a citizen, not sure, wasn’t super rigorous)
The post says generals’ names will be published tomorrow.
No. I noticed ~2 more subtle infohazards and I was wishing for nobody to post them and I realized I can decrease that probability by making an infohazard warning.
I ask that you refrain from being the reason that security-by-obscurity fails, if you notice subtle infohazards.
Some true observations are infohazards, making destruction more likely. Please think carefully before posting observations. Even if you feel clever. You can post hashes here instead to later reveal how clever you were, if you need.
LOOSE LIPS SINK SHIPS
I think it’s better to be angry at the team that launched the nukes?
No. But I’m skeptical: seems hard to imagine provable safety, much less competitive with the default path to powerful AI, much less how post-hoc evals are relevant.
I ~always want to see the outline when I first open a post and when I’m reading/skimming through it. I wish the outline appeared when-not-hover-over-ing for me.
Figuring out whether an RSP is good is hard.[1] You need to consider what high-level capabilities/risks it applies to, and for each of them determine whether the evals successfully measure what they’re supposed to and whether high-level risk thresholds trigger appropriate responses and whether high-level risk thresholds are successfully operationalized in terms of low-level evals. Getting one of these things wrong can make your assessment totally wrong. Except that’s just for fully-fleshed-out RSPs — in reality the labs haven’t operationalized their high-level thresholds and sometimes don’t really have responses planned. And different labs’ RSPs have different structures/ontologies.
Quantitatively comparing RSPs in a somewhat-legible manner is even harder.
I am not enthusiastic about a recent paper outlining a rubric for evaluating RSPs. Mostly I worry that it crams all of the crucial things—is the response to reaching high capability-levels adequate? are the capability-levels low enough that we’ll reach them before it’s too late?—into a single criterion, “Credibility.” Most of my concern about labs’ RSPs comes from those things just being inadequate; again, if your response is too weak or your thresholds are too high or your evals are bad, it just doesn’t matter how good the rest of your RSP is. (Also, minor: the “Difficulty” indicator punishes a lab for making ambitious commitments; this seems kinda backwards.)
(I gave up on making an RSP rubric myself because it seemed impossible to do decently well unless it’s quite complex and has some way to evaluate hard-to-evaluate things like eval-quality and planned responses.)
(And maybe it’s reasonable for labs to not do so much specific-committing-in-advance.)
- ^
Well, sometimes figuring out that in RSP is bad is easy. So: determining that an RSP is good is hard. (Being good requires lots of factors to all be good; being bad requires just one to be bad.)
- ^
Stronger scaffolding could skew evaluation results.
Stronger scaffolding makes evals better.
I think labs should at least demonstrate that their scaffolding is at least as good as some baseline. If there’s no established baseline scaffolding for the eval, they can say we did XYZ and we got n%, our secret scaffolding does better; when there is an established baseline scaffolding, they can compare to that (e.g. the best scaffold for SWE-bench Verified is called Agentless; in the o1 system card, OpenAI reported results from running its models in this scaffold.)
Footnote to table cells D18 and D19:
My reading of Anthropic’s ARA threshold, which nobody has yet contradicted:
The RSP defines/operationalizes 50% of ARA tasks (10% of the time) as a sufficient condition for ASL-3.
(Sidenote: I think the literal reading is that this is an ASL-3 definition, but I assume it’s supposed to be an operationalization of the safety buffer, 6x below ASL-3.[1])The May RSP evals report suggests 50% of ARA tasks (10% of the time) is merely a yellow line. (Pages 2 and 6, plus page 4 says “ARA Yellow Lines are clearly defined in the RSP” but the RSP’s ARA threshold was not just a yellow line.)
This is not kosher; Anthropic needs to formally amend the RSP to [raise the threshold / make the old threshold no longer binding].
(It’s totally plausible that the old threshold was too low, but the solution is to raise it officially, not pretend that it’s just a yellow line.)
(The forthcoming RSP update will make old thresholds moot; I’m just concerned that Anthropic ignored the RSP in the past.)
(Anthropic didn’t cross the threshold and so didn’t violate the RSP — it just ignored the RSP by implying that if it crossed the threshold it wouldn’t necessarily respond as required by the RSP.)
- ^
Update: another part of the RSP says this threshold implements the safety buffer.
It’s “o1” or “OpenAI o1,” not “GPT-o1.”
They try to notice and work around spurious failures. Apparently 10 days was not long enough to resolve o1′s spurious failures.
METR could not confidently upper-bound the capabilities of the models during the period they had model access, given the qualitatively strong reasoning and planning capabilities, substantial performance increases from a small amount of iteration on the agent scaffold, and the high rate of potentially fixable failures even after iteration.
(Plus maybe they try to do good elicitation in other ways that require iteration — I don’t know.)
Benchmark scores are here.
This shortform is relevant to e.g. understanding what’s going on and considerations on the value of working on safety at Anthropic, not just pressuring Anthropic.
There’s a selection effect in what gets posted about. Maybe someone should write the “ways Anthropic is better than others” list to combat this.
Edit: there’s also a selection effect in what you see, since negative stuff gets more upvotes…
I tentatively think this is a high-priority ask
If you’re right, I think the upshot is (a) Anthropic should figure out whether to publish stuff rather than let it languish and/or (b) it would be better for lots of Anthropic safety researchers to instead do research that’s safe to share (rather than research that only has value if Anthropic wins the race)
I was recently surprised to notice that Anthropic doesn’t seem to have a commitment to publish its safety research.[1] It has a huge safety team but has only published ~5 papers this year (plus an evals report). So probably it has lots of safety research it’s not publishing. E.g. my impression is that it’s not publishing its scalable oversight and especially adversarial robustness and inference-time misuse prevention research.
Not-publishing-safety-research is consistent with Anthropic prioritizing the win the race goal over the help all labs improve safety goal, insofar as the research is commercially advantageous. (Insofar as it’s not, not-publishing-safety-reseach is baffling.)
Maybe it would be better if Anthropic published ~all of its safety research, including product-y stuff like adversarial robustness such that others can copy its practices.
(I think this is not a priority for me to investigate but I’m interested in info and takes.)
[Edit: in some cases you can achieve most of the benefit with little downside except losing commercial advantage by sharing your research/techniques with other labs, nonpublicly.]
- ^
I failed to find good sources saying Anthropic publishes its safety research. I did find:
https://www.anthropic.com/research says “we . . . share what we learn [on safety].”
President Daniela Amodei said “we publish our safety research” on a podcast once.
Cofounder Nick Joseph said this on a podcast recently (seems false but it’s just a podcast so that’s not so bad):
> we publish our safety research, so in some ways we’re making it as easy as we can for [other labs]. We’re like, “Here’s all the safety research we’ve done. Here’s as much detail as we can give about it. Please go reproduce it.”
Edit: also cofounder Chris Olah said “we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk.” But he seems to be saying that safety benefit > social cost is a necessary condition for publishing, not necessarily that the policy is to publish all such research.
- ^
OpenAI has never (to my knowledge) made public statements about not using AI to automate AI research
I agree.
OpenAI intends to use Strawberry to perform research. . . .
Among the capabilities OpenAI is aiming Strawberry at is performing long-horizon tasks (LHT), the document says, referring to complex tasks that require a model to plan ahead and perform a series of actions over an extended period of time, the first source explained.
To do so, OpenAI is creating, training and evaluating the models on what the company calls a “deep-research” dataset, according to the OpenAI internal documentation. . . .
OpenAI specifically wants its models to use these capabilities to conduct research by browsing the web autonomously with the assistance of a “CUA,” or a computer-using agent, that can take actions based on its findings, according to the document and one of the sources. OpenAI also plans to test its capabilities on doing the work of software and machine learning engineers.
I think I was thinking:
The war room transcripts will leak publicly
Generals can secretly DM each other, while keeping up appearances in the shared channels
If a general believes that all of their communication with their team will leak, we’re be back to a unilateralist’s curse situation: if a general thinks they should nuke, obviously they shouldn’t say that to their team, so maybe they nuke unilaterally
(Not obvious whether this is an infohazard)
[Probably some true arguments about the payoff matrix and game theory increase P(mutual destruction). Also some false arguments about game theory — but maybe an infohazard warning makes those less likely to be posted too.]
(Also after I became a general I observed that I didn’t know what my “launch code” was; I was hoping the LW team forgot to give everyone launch codes and this decreased P(nukes); saying this would would cause everyone to know their launch codes and maybe scare the other team.)
I don’t think this is very relevant to real-world infohazards, because this is a game with explicit rules and because in the real world the low-hanging infohazards have been shared, but it seems relevant to mechanism design.