Zach Stein-Perlman

Karma: 9,317

AI strategy & governance. ailabwatch.org. ailabwatch.substack.com.

Zach Stein-Perlman Feb 7, 2025, 1:18 AM
12 points
3
in reply to: Tobias H’s comment on: nikola’s Shortform
There also used to be a page for Preparedness: https://web.archive.org/web/20240603125126/https://openai.com/preparedness/. Now it redirects to the safety page above.
(Same for Superalignment but that’s less interesting: https://web.archive.org/web/20240602012439/https://openai.com/superalignment/.)

Zach Stein-Perlman Feb 4, 2025, 7:00 PM
24 points
1
on: Zach Stein-Perlman’s Shortform
DeepMind updated its Frontier Safety Framework (blogpost, framework, original framework). It associates “recommended security levels” to capability levels, but the security levels are low. It mentions deceptive alignment and control (both control evals as a safety case and monitoring as a mitigation); that’s nice. The overall structure is like we’ll do evals and make a safety case, with some capabilities mapped to recommended security levels in advance. It’s not very commitment-y:
We intend to evaluate our most powerful frontier models regularly
When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed. This will inform the formulation and application of a response plan.
These recommended security levels reflect our current thinking and may be adjusted if our empirical understanding of the risks changes.
If we assess that a model has reached a CCL that poses an unmitigated and material risk to overall public safety, we aim to share information with appropriate government authorities where it will facilitate the development of safe AI.
Possibly the “Deployment Mitigations” section is more commitment-y.
I expect many more such policies will come out in the next week; I’ll probably write a post about them all at the end rather than writing about them one by one, unless xAI or OpenAI says something particularly notable.

Meta: Frontier AI Framework

Zach Stein-PerlmanFeb 3, 2025, 10:00 PM

33 points

2 comments1 min readLW link

(ai.meta.com)

Zach Stein-Perlman Feb 2, 2025, 8:12 PM
4 points
0
in reply to: Lukas Finnveden’s comment on: Mikhail Samin’s Shortform
My guess is it’s referring to Anthropic’s position on SB 1047, or Dario’s and Jack Clark’s statements that it’s too early for strong regulation, or how Anthropic’s policy recommendations often exclude RSP-y stuff (and when they do suggest requiring RSPs, they would leave the details up to the company).

Zach Stein-Perlman Jan 31, 2025, 7:30 PM
14 points
2
on: Zach Stein-Perlman’s Shortform
o3-mini is out (blogpost, tweet). Performance isn’t super noteworthy (on first glance), in part since we already knew about o3 performance.
Non-fact-checked quick takes on the system card:
the model referred to below as the o3-mini post-mitigation model was the final model checkpoint as of Jan 31, 2025 (unless otherwise specified)
Big if true (and if Preparedness had time to do elicitation and fix spurious failures)
If this is robust to jailbreaks, great, but presumably it’s not, so low post-mitigation performance is far from sufficient for safety-from-misuse; post-mitigation performance isn’t what we care about (we care about approximately what good jailbreakers get on the post-mitigation model).
No mention of METR, Apollo, or even US AISI? (Maybe too early to pay much attention to this, e.g. maybe there’ll be a full-o3 system card soon.) [Edit: also maybe it’s just not much more powerful than o1.]
- 32% is a lot
- The dataset is ¹⁄₄ T1 (easier), ¹⁄₂ T2, ¹⁄₄ T3 (harder); 28% on T3 means that there’s not much difference between T1 and T3 to o3-mini (at least for the easiest-for-LMs quarter of T3)
  - Probably o3-mini is successfully using heuristics to get the right answer and could solve very few T3 problems in a deep or humanlike way

Dario Amodei: On DeepSeek and Export Controls

Zach Stein-PerlmanJan 29, 2025, 5:15 PM

53 points

3 comments1 min readLW link

(darioamodei.com)

Zach Stein-Perlman Jan 23, 2025, 7:46 PM
7 points
0
on: Tail SP 500 Call Options
Thanks. The tax treatment is terrible. And I would like more clarity on how transformative AI would affect S&P 500 prices (per this comment). But this seems decent (alongside AI-related calls) because 6 years is so long.

Zach Stein-Perlman Jan 22, 2025, 1:00 AM
15 points
7
on: Zach Stein-Perlman’s Shortform
I wrote this for someone but maybe it’s helpful for others
What labs should do:
- I think the most important things for a relatively responsible company are control and security. (For irresponsible companies, I roughly want them to make a great RSP and thus become a responsible company.)
- Reading recommendations for people like you (not a control expert but has context to mostly understand the Greenblatt plan):
  - Control: Redwood blogposts^[1] or ask a Redwood human “what’s the threat model” and “what are the most promising control techniques”
  - Security: not worth trying to understand but there’s A Playbook for Securing AI Model Weights + Securing AI Model Weights
  - A few more things: What AI companies should do: Some rough ideas
  - Lots more things + overall plan: A Plan for Technical AI Safety with Current Science (Greenblatt 2023)
  - More links: Lab governance reading list
What labs are doing:
- Evals: it’s complicated; OpenAI, DeepMind, and Anthropic seem close to doing good model evals for dangerous capabilities; see DC evals: labs’ practices plus the links in the top two rows (associated blogpost + model cards)
- RSPs: all existing RSPs are super weak and you shouldn’t expect them to matter; maybe see The current state of RSPs
- Control: nothing is happening at the labs, except a little research at Anthropic and DeepMind
- Security: nobody is prepared; nobody is trying to be prepared
- Internal governance: you should basically model all of the companies as doing whatever leadership wants. In particular: (1) the OpenAI nonprofit is probably controlled by Sam Altman and will probably lose control soon and (2) possibly the Anthropic LTBT will matter but it doesn’t seem to be working well.
- Publishing safety research: DeepMind and Anthropic publish some good stuff but surprisingly little given how many safety researchers they employ; see List of AI safety papers from companies, 2023–2024
Resources:
- Lab governance reading list
- AI Lab Watch blog
1. ^
  Maybe
  Untrusted smart models and trusted dumb models
  AI Control: Improving Safety Despite Intentional Subversion
  The case for ensuring that powerful AIs are controlled

Zach Stein-Perlman Jan 21, 2025, 10:52 PM
LW: 22 AF: 10
11
AF
in reply to: peterbarnett’s comment on: Training on Documents About Reward Hacking Induces Reward Hacking
I think ideally we’d have several versions of a model. The default version would be ignorant about AI risk, AI safety and evaluation techniques, and maybe modern LLMs (in addition to misuse-y dangerous capabilities). When you need a model that’s knowledgeable about that stuff, you use the knowledgeable version.
Related: https://docs.google.com/document/d/14M2lcN13R-FQVfvH55DHDuGlhVrnhyOo8O0YvO1dXXM/edit?tab=t.0#heading=h.21w31kpd1gl7
Somewhat related: https://www.alignmentforum.org/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais

Zach Stein-Perlman Jan 19, 2025, 6:55 PM
4 points
2
in reply to: JenniferRM’s comment on: Thoughts on the conservative assumptions in AI control
See The case for ensuring that powerful AIs are controlled.

Zach Stein-Perlman Jan 15, 2025, 11:15 PM
6 points
2
on: Labs should be explicit about why they are building AGI
[Perfunctory review to get this post to the final phase]
Solid post. Still good. I think a responsible developer shouldn’t unilaterally pause but I think it should talk about the crazy situation it’s in, costs and benefits of various actions, what it would do in different worlds, and its views on risks. (And none of the labs have done this; in particular Core Views is not this.)

List of AI safety papers from companies, 2023–2024

Zach Stein-PerlmanJan 15, 2025, 6:00 PM

11 points

0 comments1 min readLW link

Zach Stein-Perlman Jan 6, 2025, 5:30 AM
11 points
3
on: Reasons for and against working on technical AI safety at a frontier AI lab
One more consideration against (or an important part of “Bureaucracy”): sometimes your lab doesn’t let you publish your research.

Zach Stein-Perlman Jan 4, 2025, 5:05 PM
3 points
0
in reply to: Campbell Hutcheson’s comment on: Anthropic’s Certificate of Incorporation
Yep, the final phase-in date was in November 2024.

Zach Stein-Perlman Jan 2, 2025, 10:15 PM
LW: 20 AF: 12
5
AF
on: What’s the short timeline plan?
Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast), but I find them insufficient for the magnitude of the stakes (to be clear, I don’t think these example lists were intended to be an extensive plan).
See also A Plan for Technical AI Safety with Current Science (Greenblatt 2023) for a detailed (but rough, out-of-date, and very high-context) plan.

Zach Stein-Perlman Dec 28, 2024, 12:16 AM
5 points
0
in reply to: johnswentworth’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
Yeah. I agree/concede that you can explain why you can’t convince people that their own work is useless. But if you’re positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.

Zach Stein-Perlman Dec 28, 2024, 12:02 AM
4 points
−2
in reply to: Zach Stein-Perlman’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
I feel like John’s view entails that he would be able to convince my friends that various-research-agendas-my-friends-like are doomed. (And I’m pretty sure that’s false.) I assume John doesn’t believe that, and I wonder why he doesn’t think his view entails it.

Zach Stein-Perlman Dec 27, 2024, 5:06 PM
4 points
0
in reply to: Zach Stein-Perlman’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
I wonder whether John believes that well-liked research, e.g. Fabien’s list, is actually not valuable or rare exceptions coming from a small subset of the “alignment research” field.
What links here?
- AI #97: 4 by Zvi (Jan 2, 2025, 2:10 PM; 45 points)

Zach Stein-Perlman Dec 27, 2024, 3:30 PM
6 points
1
in reply to: TsviBT’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
I do not.
On the contrary, I think ~all of the “alignment researchers” I know claim to be working on the big problem, and I think ~90% of them are indeed doing work that looks good in terms of the big problem. (Researchers I don’t know are likely substantially worse but not a ton.)
In particular I think all of the alignment-orgs-I’m-socially-close-to do work that looks good in terms of the big problem: Redwood, METR, ARC. And I think the other well-known orgs are also good.
This doesn’t feel odd: these people are smart and actually care about the big problem; if their work was in the even if this succeeds it obviously wouldn’t be helpful category they’d want to know (and, given the “obviously,” would figure that out).
Possibly the situation is very different in academia or MATS-land; for now I’m just talking about the people around me.

Zach Stein-Perlman 27 Dec 2024 14:55 UTC
4 points
1
in reply to: TsviBT’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
Yeah, I agree sometimes people decide to work on problems largely because they’re tractable [edit: or because they’re good for safety getting alignment research or other good work out of early AGIs]. I’m unconvinced of the flinching away or dishonest characterization.

Zach Stein-Perlman

Meta: Fron­tier AI Framework

Dario Amodei: On Deep­Seek and Ex­port Controls

List of AI safety pa­pers from com­pa­nies, 2023–2024

Meta: Frontier AI Framework

Dario Amodei: On DeepSeek and Export Controls

List of AI safety papers from companies, 2023–2024