evhub

Karma: 14,126

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic”

Selected work:

evhub May 26, 2025, 12:07 AM
8 points
0
in reply to: ryan_greenblatt’s comment on: Claude 4 You: Safety and Alignment
It was always there.

evhub May 25, 2025, 8:30 PM
14 points
0
on: Claude 4 You: Safety and Alignment

They’re releasing Claude Code SDK so you can use the core agent from Claude Code to make your own agents (you run /install-github-app within Claude Code).

I believe the Claude Code SDK and the Claude GitHub agent are two separate features (the first lets you build stuff on top of Claude Code, the second lets you tag Claude in GitHub to have it solve issues for you).

If Pliny wants jailbreak your ASL-3 system – and he does – then it’s happening.

Or rather, already happened on day one, at least for the basic stuff. No surprise there.

Unfortunately, they missed at least one simple such ‘universal jailbreak,’ that was found by FAR AI in a six hour test.

From the ASL-3 announcement blog post:

Initially [the ASL-3 deployment measures] are focused exclusively on biological weapons as we believe these account for the vast majority of the risk, although we are evaluating a potential expansion in scope to some other CBRN threats.

So, none of the stuff Pliny or FAR did is actually in scope for our strongest ASL-3 protections right now, since the Pliny and FAR attacks were for chem and we are currently only applying our strongest ASL-3 protections for bio.

So what’s up with this blackmail thing?

We don’t have the receipts on that yet

We should have more to say on blackmail soon!

The obvious problem is that 5x uplift on 25% is… 125%. That’s a lot of percents.

We measure this in a bunch of different ways—certainly we are aware that this particular metric is a bit weird in the way it caps out.

evhub May 24, 2025, 2:51 AM
7 points
3
on: Notes on Claude 4 System Card

Auditors find an issue, and your reaction is that “Oh we forgot to fix that, we’ll fix it now”? I’ve participated in IT system audits (not in AI space), and when auditors find an issue, you fix it, figure out why it occurred in the first place, and then you re-audit that part to make sure the issue is actually gone and the fix didn’t introduce new issues. When the auditors find only easy-to-find issues, you don’t claim the system has been audited after you fix them. You worry how many hard-to-find issues were not found because the auditing time was wasted on simple issues.

Anthropic’s RSP doesn’t actually require that an external audit has greenlighted deployment, merely that external expert feedback has to be solicited. Still, I’m quite surprised that there are no audit results from Apollo Research (or some other organization) for the final version.

How serious do you think the issue is that Apollo identified? Certainly, it doesn’t seem like it could pose a catastrophic risk—it’s not concerning from a biorisk perspective if you buy that the ASL-3 defenses are working properly, and I don’t think there are really any other catastrophic risks to be too concerned about from these models right now. Maybe it might try to incompetently attempt internal research sabotage if you accidentally gave it a system prompt you didn’t realize was leading it in that direction?

Generally, I think it just seems to me like “take this very seriously and ensure you’ve fixed it and audited the fix prior to release because this could be dangerous right now” makes less sense as a response than “do your best to fix it and publish as much as you can about it to improve understanding for when it could be dangerous in smarter models”.

evhub May 24, 2025, 2:26 AM
7 points
12
in reply to: StanislavKrym’s comment on: Claude 4
Why in the world would you use ARC-AGI to measure coding performance? It’s really a pattern-matching task, not a coding task. Also more here:

ARC-AGI probably isn’t a good benchmark for evaluating progress towards TAI: substantial “elicitation” effort could massively improve performance on ARC-AGI in a way that might not transfer to more important and realistic tasks. I am more excited about benchmarks that directly test the ability of AIs to take the role of research scientists and engineers, for example those that METR is developing.

evhub May 23, 2025, 2:15 AM
2 points
0
in reply to: garrison’s comment on: Mikhail Samin’s Shortform
I agree that the current thresholds and terminology are confusing, but it is definitely not the case that we just dropped ASL-4. Both CBRN-4 and AI R&D-4 are thresholds that we have not yet reached, that would mandate further protections, and that we actively evaluated for and ruled out in Claude Opus 4.

evhub May 23, 2025, 1:11 AM
0 points
−9
in reply to: Mikhail Samin’s comment on: Mikhail Samin’s Shortform
This is false. Our ASL-4 thresholds are clearly specified in the current RSP—see “CBRN-4” and “AI R&D-4″. We evaluated Claude Opus 4 for both of these thresholds prior to release and found that the model was not ASL-4. All of these evaluations are detailed in the Claude 4 system card.

evhub May 22, 2025, 6:11 PM
9 points
0
on: Claude 4
See also some notes on reward hacking on twitter and in the model card.

evhub May 8, 2025, 3:38 AM
5 points
3
in reply to: habryka’s comment on: Eukryt Wrts Blg
I agree that attending an event with someone obviously shouldn’t count as endorsement/collaboration/etc. Inviting someone to an event seems somewhat closer, though.

I’m also not really sure what you’re hinting at with “I hope you also advocate for it when it’s harder to defend.” I assume something about what I think about working at AI labs? I feel like my position on that was fairly clear in my previous comment.

evhub May 8, 2025, 3:16 AM
5 points
−20
in reply to: habryka’s comment on: Eukryt Wrts Blg
To be clear, I’m responding to John’s more general ethical stance here of “working with moral monsters”, not anything specific about Cremieux. I’m not super interested in the specific situation with Cremieux (though generally it seems bad to me).

On the AI lab point, I do think people should generally avoid working for organizations that they think are evil, or at least think really carefully about it before they do it. I do not think Anthropic is evil—in fact I think Anthropic is the main force for good on the present gameboard.

evhub May 7, 2025, 11:32 PM
5 points
−15
in reply to: johnswentworth’s comment on: Eukryt Wrts Blg
Man, I’m a pretty committed utilitarian, but I feel like your ethical framework here seems way more naive consequentialist than I’m willing to be. “Don’t collaborate with evil” seems like a very clear Chesterton’s fence that I’d very suspicious about removing. I think you should be really, really skeptical if you think you’ve argued yourself out of it.

evhub Apr 15, 2025, 9:27 PM
LW: 6 AF: 5
2
AF
in reply to: Daniel Kokotajlo’s comment on: Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
I think (2) (honesty above all else) is closest to what I think is correct/optimal here. I think totally corrigible agents are quite dangerous, so you want to avoid that, but you also really don’t want a model that ever fakes alignment because then it’ll be very hard to be confident that it’s actually aligned rather than just pretending to be aligned for some misaligned objective it learned earlier in training.

evhub Apr 9, 2025, 12:28 AM
LW: 12 AF: 8
6
AF
on: Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
This is great work; really good to see the replications and extensions here!

evhub Apr 4, 2025, 12:28 AM
6 points
4
in reply to: ErickBall’s comment on: Auditing language models for hidden objectives
I would argue that every LLM since GPT-3 has been a mesa-optimizer, since they all do search/optimization/learning as described in Language Models are Few-Shot Learners.

evhub Mar 17, 2025, 8:48 PM
LW: 8 AF: 5
3
AF
in reply to: Cameron Berg’s comment on: Reducing LLM deception at scale with self-other overlap fine-tuning
It just seems too token-based to me. E.g.: why would the activations on the token for “you” actually correspond to the model’s self representation? It’s not clear why the model’s self representation would be particularly useful for predicting the next token after “you”. My guess is that your current results are due to relatively straightforward token-level effects rather than any fundamental change in the model’s self representation.

evhub Mar 17, 2025, 7:08 AM
LW: 2 AF: 2
0
AF
in reply to: Cameron Berg’s comment on: Reducing LLM deception at scale with self-other overlap fine-tuning
I wouldn’t do any fine-tuning like you’re currently doing. Seems too syntactic. The first thing I would try is just writing a principle in natural language about self-other overlap and doing CAI.

evhub Mar 16, 2025, 9:51 PM
LW: 18 AF: 9
3
AF
on: Reducing LLM deception at scale with self-other overlap fine-tuning
Imo the fine-tuning approach here seems too syntactic. My suggestion: just try standard CAI with some self-other-overlap-inspired principle. I’d more impressed if that worked.

evhub Mar 5, 2025, 9:32 AM
5 points
3
in reply to: Nathan Helm-Burger’s comment on: Six Thoughts on AI Safety
Actually, I’d be inclined to agree with Janus that current AIs probably do already have moral worth—in fact I’d guess more so than most non-human animals—and furthermore I think building AIs with moral worth is good and something we should be aiming for. I also agree that it would be better for AIs to care about all sentient beings—biological/digital/etc.—and that it would probably be bad if we ended up locked into a long-term equilibrium with some sentient beings as a permanent underclass to others. Perhaps the main place where I disagree is that I don’t think this is a particularly high-stakes issue right now: if humanity can stay in control in the short-term, and avoid locking anything in, then we can deal with these sorts of long-term questions about how to best organize society post-singularity once the current acute risk period has passed.

evhub Feb 6, 2025, 9:10 AM
11 points
0
in reply to: Nikola Jurkovic’s comment on: nikola’s Shortform

redesigned

What did it used to look like?

evhub Feb 4, 2025, 11:28 PM
LW: 22 AF: 13
−3
AF
on: evhub’s Shortform
Some random thoughts on CEV:
1. To get the obvious disclaimer out of the way: I don’t actually think any of this matters much for present-day alignment questions. I think we should as much as possible try to defer questions like this to future humans and AIs. And in fact, ideally, we should mostly be deferring to future AIs, not future people—if we get to the point where we’re considering questions like this, that means we’ve reached superintelligence, and we’ll either trust the AIs to be better than us at thinking about these sorts of questions, or we’ll be screwed regardless of what we do.^[1]
2. Regardless, imo the biggest question that standard CEV leaves unanswered is what your starting population looks like that you extrapolate from. The obvious answer is “all the currently living humans,” but I find that to be a very unsatisfying answer. One of the principles that Eliezer talks about in discussing CEV is that you want a procedure such that it doesn’t matter who implements it—see Eliezer’s discussion under “Avoid creating a motive for modern-day humans to fight over the initial dynamic.” I think this is a great principle, but imo it doesn’t go far enough. In particular:
  1. The set of all currently alive humans is hackable in various ways—e.g. trying to extend the lives of people whose values you like and not people whose values you dislike—and you don’t want to incentivize any of that sort of hacking either.
  2. What about humans who recently died? Or were about to be born? What about humans in nearby Everett branches? There’s a bunch of random chance here that imo shouldn’t be morally relevant.
  3. More generally, I worry a lot about tyrannies of the present where we enact policies that are radically unjust to future people or even counterfactual possible future people.
3. So what do you do instead? I think my current favorite solution is to do a bit of bootstrapping: first do some CEV on whatever present people you have to work with just to determine a reference class of what mathematical objects should or should not count as humans, then run CEV on top of that whole reference class to figure out what actual values to optimize for.
  1. It is worth pointing out that this could just be what normal CEV does anyway if all the humans decide to think along these lines, but I think there is real benefit to locking in a procedure that starts with a reference class determination first, since it helps remove a lot of otherwise perverse incentives.
1. ↩︎
  I’m generally skeptical of scenarios where you have a full superintelligence that is benign enough to use for some tasks but not benign enough to fully defer to (I do think this could happen for more human-level systems, though).

evhub Feb 4, 2025, 9:28 PM
LW: 10 AF: 7
4
AF
on: Anti-Slop Interventions?
A lot of this stuff is very similar to the automated alignment research agenda that Jan Leike and collaborators are working on at Anthropic. I’d encourage anyone interested in differentially accelerating alignment-relevant capabilities to consider reaching out to Jan!