Zach Stein-Perlman

Karma: 9,550

AI strategy & governance. ailabwatch.org. ailabwatch.substack.com.

Zach Stein-Perlman Jun 12, 2025, 12:43 AM
9 points
0
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
Anthropic’s model cards . . . . are substantially more detailed and informative than the model cards of other AI companies.
My weakly-held cached take is: I agree on CBRN/bio (and of course alignment) and I think Anthropic is pretty similar to OpenAI/DeepMind on cyber and AI R&D (and scheming capabilities), at least if you consider stuff outside the model card (evals papers + open-sourcing the evals).

Zach Stein-Perlman Jun 11, 2025, 9:36 AM
3 points
1
in reply to: perepelart’s comment on: AI companies’ eval reports mostly don’t support their claims
Alignment and capabilities are separate.

Welcome to the forum! No worries.

Zach Stein-Perlman Jun 11, 2025, 8:48 AM
3 points
1
in reply to: perepelart’s comment on: AI companies’ eval reports mostly don’t support their claims
Replying briefly since I was tagged:
- It’s not obvious to me what details Anthropic could provide to help us understand what’s going on with the blackmail and “high-agency behavior” stuff (as opposed to capability evals, where there’s lots of things Anthropic could clarify)
  - (Note: Anthropic presumably did not expect that this stuff would get so much attention)
  - (Note: Anthropic spends 30 pages of the system card on alignment)
- This post is about capabilities since that’s what (companies say) is load-bearing; Anthropic says the alignment evals stuff isn’t load-bearing

Zach Stein-Perlman Jun 10, 2025, 3:50 AM
2 points
2
in reply to: habryka’s comment on: Mikhail Samin’s Shortform
Sure, censure among people who agree with you is a fine thing for a comment to do. I didn’t read Mikhail’s comment that way because it seemed to be asking Anthropic people to act differently (but without engaging with their views).

Zach Stein-Perlman Jun 10, 2025, 12:08 AM
6 points
12
in reply to: Mikhail Samin’s comment on: Mikhail Samin’s Shortform
It would be easier to argue with you if you proposed a specific alternative to the status quo and argued for it. Maybe “[stop] shipping SOTA tech” is your alternative If so: surely you’re aware of the basic arguments for why Anthropic should make powerful models; maybe you should try to identify cruxes.

Zach Stein-Perlman Jun 9, 2025, 6:00 PM
LW: 5 AF: 3
0
AF
on: AI companies’ eval reports mostly don’t support their claims
See also Ryan Greenblatt’s shortform on how current models might have dangerous CBRN/bio capabilities and what companies should have done differently.

Zach Stein-Perlman Jun 9, 2025, 4:30 PM
LW: 26 AF: 10
1
AF
on: AI companies’ eval reports mostly don’t support their claims
So what?
I don’t know. Maybe dangerous capability evals don’t really matter. The usual story for them is that they inform companies about risks so the companies can respond appropriately. Good evals are better than nothing, but I don’t expect companies’ eval results to affect their safeguards or training/deployment decisions much in practice. (I think companies’ safety frameworks are quite weak or at least vague — a topic for another blogpost.) Maybe evals are helpful for informing other actors, like government, but I don’t really see it.
I don’t have a particular conclusion. I’m making aisafetyclaims.org because evals are a crucial part of companies’ preparations to be safe when models might have very dangerous capabilities, but I noticed that the companies are doing and interpreting evals poorly (and are being misleading about this, and aren’t getting better), and some experts are aware of this but nobody else has written it up yet.

AI companies’ eval reports mostly don’t support their claims

Zach Stein-PerlmanJun 9, 2025, 1:00 PM

199 points

11 comments4 min readLW link

Zach Stein-Perlman May 26, 2025, 5:22 PM
8 points
4
in reply to: Zac Hatfield-Dodds’s comment on: New scorecard evaluating AI companies on safety
I think the numbers are much better than nothing and much better than any substitute that currently exists, and I’m not aware of a better design or a great way to deemphasize them while preserving their value.
Edit: like, they convey a lot of real info, and more conservative alternatives would fail to do so.

New website analyzing AI companies’ model evals

Zach Stein-PerlmanMay 26, 2025, 4:00 PM

58 points

0 comments4 min readLW link

New scorecard evaluating AI companies on safety

Zach Stein-PerlmanMay 26, 2025, 4:00 PM

72 points

8 comments1 min readLW link

Zach Stein-Perlman May 22, 2025, 7:17 PM
4 points
0
in reply to: Sheikh Abdur Raheem Ali’s comment on: Claude 4
Links: 4.1.1.5 and 7.3.4.

Claude 4

Zach Stein-PerlmanMay 22, 2025, 5:00 PM

71 points

24 comments1 min readLW link

(www.anthropic.com)

Zach Stein-Perlman May 5, 2025, 7:57 PM
4 points
2
in reply to: Dagon’s comment on: Zach Stein-Perlman’s Shortform
(Clarification: these are EA, AI safety orgs with ~10-15 employees.)

Zach Stein-Perlman May 5, 2025, 6:15 PM
33 points
1
on: Zach Stein-Perlman’s Shortform
Topic: workplace world-modeling
- A friend’s manager tasked them with estimating ~10 parameters for a model. Choosing a single parameter very-incorrectly would presumably make the bottom line nonsense. My friend largely didn’t understand the model and what the parameters meant; if you’d asked them “can you confidently determine what each of the parameters means” presumably they would have noticed the answer was no. (If I understand the situation correctly, it was crazy for the manager to expect my friend to do this task.) They should have told their manager “I can’t do this” or “I’m uncertain about what these four parameters are; here’s my best guess of a very precise description for each; please check this carefully and let me know.” Instead I think they just gave their best guess for the parameters! (Edit: also I think they thought the model was bad but I don’t think they told their manager that.)
- Another friend’s manager tasked them with estimating how many hours it would take to evaluate applications for their org’s hiring round. If my friend had known details of the application process, they could have estimated the number of applicants at each stage and the time to review each applicant at each stage. But the org hadn’t decided what the stages were yet. They should have told their manager “I can’t do this — but if evaluation-time is a crux for what-the-application-process-should-look-like, I can brainstorm several possibilities (or you can tell me the top possibilities) and estimate evaluation-time for each.” Instead I think they either made up a mainline process and made an estimate for that or made up several possibilities and made an estimate for each (without checking in with the manager), not sure.
Both friends are smart, very involved in the EA/rationality community, and working at AI safety orgs.
I’d totally avoid making these mistakes. What’s going on here?
Some hypotheses (not exclusive):
- Generally insufficient communication with manager
  - It’s crazy that they tell me about this stuff and don’t tell their managers? Actually maybe they didn’t notice the problems until i said “wait what, how are you supposed to do that.” But after that they still didn’t put the tasks on hold and tell their managers!
- Generally insufficient [disagreeableness / force of will / willingness to contradict manager]
- Thinking you’re being graded on a curve, rather than realizing that when you’re estimating a bottom-line-number-to-inform-decisions, what matters is how accurate it is — sometimes getting ⁹⁄₁₀ answers right is no better than ⁰⁄₁₀; if you’re in such a situation and you probably won’t get ¹⁰⁄₁₀ you have to say “I can’t do this” rather than just do your best
- Lack of heroic responsibility; thinking what your manager wants is for you to do the task they told you to rather than do so if you straightforwardly can without wasting time, and otherwise check in
Anyway this shortform was prompted by world-modeling curiosity but there are some upshots:
- Managees, check in with your managers when (1) you think you shouldn’t be doing this task and your manager doesn’t understand why or (2) you’re going to spend time doing stuff and you’re not sure what your manager wants and a quick DM would tell you
- Managers, cause your managees to check in with you! Some of them aren’t doing so enough! Even though you’ve already nudged them to do so more and they agreed! You’re leaving lots of value on the table!
but that’s not all of it; there’s also the skill of noticing when there’s an issue.
I’d also be interested in utterances I can utter at friends in such situations to fix the problem, but I doubt I can do much better than “it seems like there’s an issue, X; if you agree you should obviously check in with your manager.”

Zach Stein-Perlman May 4, 2025, 5:21 AM
5 points
0
in reply to: Zach Stein-Perlman’s comment on: jacquesthibs’s Shortform
Update: they want “to build virtual work environments for automating software engineering—and then the rest of the economy.” Software engineering seems like one of the few things I really think shouldn’t accelerate :(.

Zach Stein-Perlman Apr 30, 2025, 7:06 PM
7 points
−1
in reply to: romeostevensit’s comment on: Bandwidth Rules Everything Around Me: Oliver Habryka on OpenPhil and GoodVentures
What, no, Oli says OP would do a fine job and make grants in rationality community-building, AI welfare, right-wing policy stuff, invertebrate welfare, etc. but it’s constrained by GV.
[Disagreeing since this is currently the top comment and people might read it rather than listen to the podcast.]

Zach Stein-Perlman Apr 29, 2025, 5:46 PM
LW: 20 AF: 12
9
AF
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
I agree people often aren’t careful about this.
Anthropic says
During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments . . . . This undesirable special-casing behavior emerged as a result of “reward hacking” during reinforcement learning training.
Similarly OpenAI suggests that cheating behavior is due to RL.

Zach Stein-Perlman Apr 28, 2025, 9:31 PM
LW: 17 AF: 9
5
AF
on: Zach Stein-Perlman’s Shortform
Rant on “deceiving” AIs
tl;dr: Keep your promises to AIs; it’s fine to do stuff like teaching them false facts or misleading them about their situation during testing and deployment; but if you wanna do cheap stuff to cause them to know that they might [have been taught false facts / be being tested / be being misled during deployment], sure, go for it.
Disclaimer: maybe more like explaining my position than justifying my position.
Sometimes we really want to deceive AIs (or at least make them uncertain about their situation). E.g.:
1. Training them to believe false things about dangerous topics like CBRN stuff
2. Training them to believe false things about their situation, to mislead them about how to subvert safety measures
3. Testing them with honeypots which they can’t distinguish from real opportunities during deployment
4. Testing untrusted monitors by showing them attacks which they can’t reliably distinguish from real attacks during deployment (to measure effectiveness and to catch them if they frequently say “lgtm” when shown attacks)
5. Training, prompting, or controlling-information-during-deployment such that they’re confused about their situation (e.g. confused about geopolitics and the AI race and what year it is)
(Similarly sometimes we want to give the AIs reflexive urges to avoid doing extremely bad things [and to do extremely good things, like whistleblow on themselves if they do something suspicious].)
(3-4 don’t require deception, just making the AIs uncertain about their situation; it would work similarly well if we added “this might be a test” to all prompts, during training and deployment. 1-2 and 5 require narrow deception, but it would still work if you added “we might have told or trained you to believe false things and this might be a test” to all prompts.)
Often people suggest that we should avoid deceiving AIs, because (A) having a reputation of honesty could enable some kinds of trade with misaligned AIs and/or (B) it’s directly unethical.
On (A), I want to distinguish breaking actual promises from just testing the AIs or misleading them for safety. If the AIs think you will break promises of the form “we’ll give you 1% of our share of the lightcone if you [do useful work / point out vulnerabilities rather than exploiting them / etc.],” that’s bad. If the AIs believe you might have taught them false facts or might be testing them, that seems fine, doesn’t interfere with making deals at all. Just clarify that you never mislead them about actual promises.
On (B), in cases like 1-5, when I imagine myself in the AI’s position I think I wouldn’t care about whether the message was added to the prompt. But if adding “we might have told or trained you to believe false things and this might be a test” to all prompts makes you feel better, or the AI asks for it when you explain the situation, sure, it’s low-stakes. (Or not literally adding it to the prompt, especially if we can’t ensure it would stay added to the prompt in rogue deployments, but training the AI so it is aware of this.^[1]) (And fwiw, I think in the AI’s position: 3-4 I basically wouldn’t mind; 1-2 and 5 I might be slightly sad about but would totally get and not be mad about; teaching AIs false facts in the mid-late 2020s seems super reasonable from behind the veil of ignorance given my/humanity’s epistemic state.)
Recent context: discussion on “Modifying LLM Beliefs with Synthetic Document Finetuning.”
1. ^
  My guess is that this training is fine/cheap and preserves almost all of the safety benefits — we’re counting on the AI not knowing what false things it believes, not to be unaware that it’s been taught false facts. Adding stuff to prompts might be worse because not-seeing-that would signal successful-rogue-deployment.

Zach Stein-Perlman Apr 28, 2025, 6:51 PM
8 points
0
on: What are the best standardised, repeatable bets?
https://www.givingwhatwecan.org/donor-lottery

Zach Stein-Perlman

AI com­pa­nies’ eval re­ports mostly don’t sup­port their claims

New web­site an­a­lyz­ing AI com­pa­nies’ model evals

New score­card eval­u­at­ing AI com­pa­nies on safety

Claude 4

Rant on “deceiving” AIs

AI companies’ eval reports mostly don’t support their claims

New website analyzing AI companies’ model evals

New scorecard evaluating AI companies on safety