Zach Stein-Perlman(Zachary Stein-Perlman)

Karma: 7,297

AI strategy & governance. ailabwatch.org.

Zach Stein-Perlman 28 Sep 2024 18:22 UTC
3 points
0
in reply to: Martin Randall’s comment on: [Completed] The 2024 Petrov Day Scenario
I think I was thinking:
1. The war room transcripts will leak publicly
2. Generals can secretly DM each other, while keeping up appearances in the shared channels
3. If a general believes that all of their communication with their team will leak, we’re be back to a unilateralist’s curse situation: if a general thinks they should nuke, obviously they shouldn’t say that to their team, so maybe they nuke unilaterally
  1. (Not obvious whether this is an infohazard)
4. [Probably some true arguments about the payoff matrix and game theory increase P(mutual destruction). Also some false arguments about game theory — but maybe an infohazard warning makes those less likely to be posted too.]
(Also after I became a general I observed that I didn’t know what my “launch code” was; I was hoping the LW team forgot to give everyone launch codes and this decreased P(nukes); saying this would would cause everyone to know their launch codes and maybe scare the other team.)
I don’t think this is very relevant to real-world infohazards, because this is a game with explicit rules and because in the real world the low-hanging infohazards have been shared, but it seems relevant to mechanism design.

Zach Stein-Perlman 27 Sep 2024 18:30 UTC
3 points
0
in reply to: GeneralBelov’s comment on: [Completed] The 2024 Petrov Day Scenario
Update with two new responses:
I think this is 10 generals, 1 petrov, and one other person (either the other petrov or a citizen, not sure, wasn’t super rigorous)

Zach Stein-Perlman 26 Sep 2024 18:51 UTC
4 points
0
in reply to: Logan Riggs’s comment on: [Completed] The 2024 Petrov Day Scenario
The post says generals’ names will be published tomorrow.

Zach Stein-Perlman 26 Sep 2024 18:50 UTC
7 points
1
in reply to: aphyer’s comment on: [Completed] The 2024 Petrov Day Scenario
No. I noticed ~2 more subtle infohazards and I was wishing for nobody to post them and I realized I can decrease that probability by making an infohazard warning.
I ask that you refrain from being the reason that security-by-obscurity fails, if you notice subtle infohazards.

Zach Stein-Perlman 26 Sep 2024 18:29 UTC
15 points
0
on: [Completed] The 2024 Petrov Day Scenario
Some true observations are infohazards, making destruction more likely. Please think carefully before posting observations. Even if you feel clever. You can post hashes here instead to later reveal how clever you were, if you need.
LOOSE LIPS SINK SHIPS

Zach Stein-Perlman 26 Sep 2024 18:06 UTC
4 points
2
in reply to: aphyer’s comment on: [Completed] The 2024 Petrov Day Scenario
I think it’s better to be angry at the team that launched the nukes?

Zach Stein-Perlman 26 Sep 2024 2:45 UTC
19 points
0
on: Mira Murati leaves OpenAI/ OpenAI to remove non-profit control
Two other leaders are also leaving

Zach Stein-Perlman 24 Sep 2024 4:51 UTC
8 points
7
in reply to: rajathsalegame’s comment on: Model evals for dangerous capabilities
No. But I’m skeptical: seems hard to imagine provable safety, much less competitive with the default path to powerful AI, much less how post-hoc evals are relevant.

Zach Stein-Perlman 24 Sep 2024 2:09 UTC
2 points
0
in reply to: habryka’s comment on: Habryka’s Shortform Feed
I ~always want to see the outline when I first open a post and when I’m reading/skimming through it. I wish the outline appeared when-not-hover-over-ing for me.

Zach Stein-Perlman 24 Sep 2024 0:30 UTC
11 points
2
on: Zach Stein-Perlman’s Shortform
Figuring out whether an RSP is good is hard.^[1] You need to consider what high-level capabilities/risks it applies to, and for each of them determine whether the evals successfully measure what they’re supposed to and whether high-level risk thresholds trigger appropriate responses and whether high-level risk thresholds are successfully operationalized in terms of low-level evals. Getting one of these things wrong can make your assessment totally wrong. Except that’s just for fully-fleshed-out RSPs — in reality the labs haven’t operationalized their high-level thresholds and sometimes don’t really have responses planned. And different labs’ RSPs have different structures/ontologies.
Quantitatively comparing RSPs in a somewhat-legible manner is even harder.
I am not enthusiastic about a recent paper outlining a rubric for evaluating RSPs. Mostly I worry that it crams all of the crucial things—is the response to reaching high capability-levels adequate? are the capability-levels low enough that we’ll reach them before it’s too late?—into a single criterion, “Credibility.” Most of my concern about labs’ RSPs comes from those things just being inadequate; again, if your response is too weak or your thresholds are too high or your evals are bad, it just doesn’t matter how good the rest of your RSP is. (Also, minor: the “Difficulty” indicator punishes a lab for making ambitious commitments; this seems kinda backwards.)
(I gave up on making an RSP rubric myself because it seemed impossible to do decently well unless it’s quite complex and has some way to evaluate hard-to-evaluate things like eval-quality and planned responses.)
(And maybe it’s reasonable for labs to not do so much specific-committing-in-advance.)
1. ^
  Well, sometimes figuring out that in RSP is bad is easy. So: determining that an RSP is good is hard. (Being good requires lots of factors to all be good; being bad requires just one to be bad.)

Zach Stein-Perlman 23 Sep 2024 17:33 UTC
6 points
3
in reply to: sepiatone’s comment on: Model evals for dangerous capabilities
Stronger scaffolding could skew evaluation results.
Stronger scaffolding makes evals better.
I think labs should at least demonstrate that their scaffolding is at least as good as some baseline. If there’s no established baseline scaffolding for the eval, they can say we did XYZ and we got n%, our secret scaffolding does better; when there is an established baseline scaffolding, they can compare to that (e.g. the best scaffold for SWE-bench Verified is called Agentless; in the o1 system card, OpenAI reported results from running its models in this scaffold.)

Zach Stein-Perlman 23 Sep 2024 11:00 UTC
15 points
0
on: Model evals for dangerous capabilities
Footnote to table cells D18 and D19:
My reading of Anthropic’s ARA threshold, which nobody has yet contradicted:
1. The RSP defines/operationalizes 50% of ARA tasks (10% of the time) as a sufficient condition for ASL-3. ~~(Sidenote: I think the literal reading is that this is an ASL-3 definition, but I assume it’s supposed to be an operationalization of the safety buffer, 6x below ASL-3.~~^[1])
2. The May RSP evals report suggests 50% of ARA tasks (10% of the time) is merely a yellow line. (Pages 2 and 6, plus page 4 says “ARA Yellow Lines are clearly defined in the RSP” but the RSP’s ARA threshold was not just a yellow line.)
3. This is not kosher; Anthropic needs to formally amend the RSP to [raise the threshold / make the old threshold no longer binding].
(It’s totally plausible that the old threshold was too low, but the solution is to raise it officially, not pretend that it’s just a yellow line.)
(The forthcoming RSP update will make old thresholds moot; I’m just concerned that Anthropic ignored the RSP in the past.)
(Anthropic didn’t cross the threshold and so didn’t violate the RSP — it just ignored the RSP by implying that if it crossed the threshold it wouldn’t necessarily respond as required by the RSP.)
1. ^
  Update: another part of the RSP says this threshold implements the safety buffer.

Zach Stein-Perlman 16 Sep 2024 20:34 UTC
13 points
7
in reply to: Zvi’s comment on: GPT-4o1
It’s “o1” or “OpenAI o1,” not “GPT-o1.”

Zach Stein-Perlman 12 Sep 2024 23:51 UTC
15 points
6
in reply to: Joseph Miller’s comment on: OpenAI o1
They try to notice and work around spurious failures. Apparently 10 days was not long enough to resolve o1′s spurious failures.
METR could not confidently upper-bound the capabilities of the models during the period they had model access, given the qualitatively strong reasoning and planning capabilities, substantial performance increases from a small amount of iteration on the agent scaffold, and the high rate of potentially fixable failures even after iteration.
(Plus maybe they try to do good elicitation in other ways that require iteration — I don’t know.)

Zach Stein-Perlman 12 Sep 2024 19:03 UTC
6 points
0
in reply to: exmateriae’s comment on: OpenAI o1
Benchmark scores are here.

Zach Stein-Perlman 7 Sep 2024 23:50 UTC
3 points
−1
in reply to: davekasten’s comment on: Zach Stein-Perlman’s Shortform
This shortform is relevant to e.g. understanding what’s going on and considerations on the value of working on safety at Anthropic, not just pressuring Anthropic.
@Neel Nanda

Zach Stein-Perlman 7 Sep 2024 16:52 UTC
7 points
8
in reply to: Shankar Sivarajan’s comment on: Zach Stein-Perlman’s Shortform
There’s a selection effect in what gets posted about. Maybe someone should write the “ways Anthropic is better than others” list to combat this.
Edit: there’s also a selection effect in what you see, since negative stuff gets more upvotes…

Zach Stein-Perlman 6 Sep 2024 20:11 UTC
4 points
2
in reply to: Raemon’s comment on: Zach Stein-Perlman’s Shortform
1. I tentatively think this is a high-priority ask
2. Capabilities research isn’t a monolith and improving capabilities without increasing spooky black-box reasoning seems pretty fine
3. If you’re right, I think the upshot is (a) Anthropic should figure out whether to publish stuff rather than let it languish and/or (b) it would be better for lots of Anthropic safety researchers to instead do research that’s safe to share (rather than research that only has value if Anthropic wins the race)

Zach Stein-Perlman 6 Sep 2024 19:30 UTC
28 points
12
on: Zach Stein-Perlman’s Shortform
I was recently surprised to notice that Anthropic doesn’t seem to have a commitment to publish its safety research.^[1] It has a huge safety team but has only published ~5 papers this year (plus an evals report). So probably it has lots of safety research it’s not publishing. E.g. my impression is that it’s not publishing its scalable oversight and especially adversarial robustness and inference-time misuse prevention research.
Not-publishing-safety-research is consistent with Anthropic prioritizing the win the race goal over the help all labs improve safety goal, insofar as the research is commercially advantageous. (Insofar as it’s not, not-publishing-safety-reseach is baffling.)
Maybe it would be better if Anthropic published ~all of its safety research, including product-y stuff like adversarial robustness such that others can copy its practices.
(I think this is not a priority for me to investigate but I’m interested in info and takes.)
[Edit: in some cases you can achieve most of the benefit with little downside except losing commercial advantage by sharing your research/techniques with other labs, nonpublicly.]
1. ^
  I failed to find good sources saying Anthropic publishes its safety research. I did find:
  https://www.anthropic.com/research says “we . . . share what we learn [on safety].”
  President Daniela Amodei said “we publish our safety research” on a podcast once.
  Cofounder Nick Joseph said this on a podcast recently (seems false but it’s just a podcast so that’s not so bad):
  > we publish our safety research, so in some ways we’re making it as easy as we can for [other labs]. We’re like, “Here’s all the safety research we’ve done. Here’s as much detail as we can give about it. Please go reproduce it.”
  Edit: also cofounder Chris Olah said “we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk.” But he seems to be saying that safety benefit > social cost is a necessary condition for publishing, not necessarily that the policy is to publish all such research.

Zach Stein-Perlman 4 Sep 2024 22:09 UTC
9 points
3
in reply to: nikola’s comment on: nikola’s Shortform
OpenAI has never (to my knowledge) made public statements about not using AI to automate AI research
I agree.
Another source:
OpenAI intends to use Strawberry to perform research. . . .
Among the capabilities OpenAI is aiming Strawberry at is performing long-horizon tasks (LHT), the document says, referring to complex tasks that require a model to plan ahead and perform a series of actions over an extended period of time, the first source explained.
To do so, OpenAI is creating, training and evaluating the models on what the company calls a “deep-research” dataset, according to the OpenAI internal documentation. . . .
OpenAI specifically wants its models to use these capabilities to conduct research by browsing the web autonomously with the assistance of a “CUA,” or a computer-using agent, that can take actions based on its findings, according to the document and one of the sources. OpenAI also plans to test its capabilities on doing the work of software and machine learning engineers.