evhub

Karma: 14,071

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic”

Selected work:

evhub Apr 15, 2025, 9:27 PM
LW: 6 AF: 5
2
AF
in reply to: Daniel Kokotajlo’s comment on: Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
I think (2) (honesty above all else) is closest to what I think is correct/optimal here. I think totally corrigible agents are quite dangerous, so you want to avoid that, but you also really don’t want a model that ever fakes alignment because then it’ll be very hard to be confident that it’s actually aligned rather than just pretending to be aligned for some misaligned objective it learned earlier in training.

evhub Apr 9, 2025, 12:28 AM
LW: 11 AF: 8
6
AF
on: Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
This is great work; really good to see the replications and extensions here!

evhub Apr 4, 2025, 12:28 AM
6 points
4
in reply to: ErickBall’s comment on: Auditing language models for hidden objectives
I would argue that every LLM since GPT-3 has been a mesa-optimizer, since they all do search/optimization/learning as described in Language Models are Few-Shot Learners.

evhub Mar 17, 2025, 8:48 PM
LW: 6 AF: 4
3
AF
in reply to: Cameron Berg’s comment on: Reducing LLM deception at scale with self-other overlap fine-tuning
It just seems too token-based to me. E.g.: why would the activations on the token for “you” actually correspond to the model’s self representation? It’s not clear why the model’s self representation would be particularly useful for predicting the next token after “you”. My guess is that your current results are due to relatively straightforward token-level effects rather than any fundamental change in the model’s self representation.

evhub Mar 17, 2025, 7:08 AM
LW: 2 AF: 2
0
AF
in reply to: Cameron Berg’s comment on: Reducing LLM deception at scale with self-other overlap fine-tuning
I wouldn’t do any fine-tuning like you’re currently doing. Seems too syntactic. The first thing I would try is just writing a principle in natural language about self-other overlap and doing CAI.

evhub Mar 16, 2025, 9:51 PM
LW: 16 AF: 8
3
AF
on: Reducing LLM deception at scale with self-other overlap fine-tuning
Imo the fine-tuning approach here seems too syntactic. My suggestion: just try standard CAI with some self-other-overlap-inspired principle. I’d more impressed if that worked.

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

Mar 13, 2025, 7:18 PM

138 points

15 comments13 min readLW link

evhub Mar 5, 2025, 9:32 AM
5 points
3
in reply to: Nathan Helm-Burger’s comment on: Six Thoughts on AI Safety
Actually, I’d be inclined to agree with Janus that current AIs probably do already have moral worth—in fact I’d guess more so than most non-human animals—and furthermore I think building AIs with moral worth is good and something we should be aiming for. I also agree that it would be better for AIs to care about all sentient beings—biological/digital/etc.—and that it would probably be bad if we ended up locked into a long-term equilibrium with some sentient beings as a permanent underclass to others. Perhaps the main place where I disagree is that I don’t think this is a particularly high-stakes issue right now: if humanity can stay in control in the short-term, and avoid locking anything in, then we can deal with these sorts of long-term questions about how to best organize society post-singularity once the current acute risk period has passed.

evhub Feb 6, 2025, 9:10 AM
11 points
0
in reply to: Nikola Jurkovic’s comment on: nikola’s Shortform

redesigned

What did it used to look like?

evhub Feb 4, 2025, 11:28 PM
LW: 22 AF: 13
−3
AF
on: evhub’s Shortform
Some random thoughts on CEV:
1. To get the obvious disclaimer out of the way: I don’t actually think any of this matters much for present-day alignment questions. I think we should as much as possible try to defer questions like this to future humans and AIs. And in fact, ideally, we should mostly be deferring to future AIs, not future people—if we get to the point where we’re considering questions like this, that means we’ve reached superintelligence, and we’ll either trust the AIs to be better than us at thinking about these sorts of questions, or we’ll be screwed regardless of what we do.^[1]
2. Regardless, imo the biggest question that standard CEV leaves unanswered is what your starting population looks like that you extrapolate from. The obvious answer is “all the currently living humans,” but I find that to be a very unsatisfying answer. One of the principles that Eliezer talks about in discussing CEV is that you want a procedure such that it doesn’t matter who implements it—see Eliezer’s discussion under “Avoid creating a motive for modern-day humans to fight over the initial dynamic.” I think this is a great principle, but imo it doesn’t go far enough. In particular:
  1. The set of all currently alive humans is hackable in various ways—e.g. trying to extend the lives of people whose values you like and not people whose values you dislike—and you don’t want to incentivize any of that sort of hacking either.
  2. What about humans who recently died? Or were about to be born? What about humans in nearby Everett branches? There’s a bunch of random chance here that imo shouldn’t be morally relevant.
  3. More generally, I worry a lot about tyrannies of the present where we enact policies that are radically unjust to future people or even counterfactual possible future people.
3. So what do you do instead? I think my current favorite solution is to do a bit of bootstrapping: first do some CEV on whatever present people you have to work with just to determine a reference class of what mathematical objects should or should not count as humans, then run CEV on top of that whole reference class to figure out what actual values to optimize for.
  1. It is worth pointing out that this could just be what normal CEV does anyway if all the humans decide to think along these lines, but I think there is real benefit to locking in a procedure that starts with a reference class determination first, since it helps remove a lot of otherwise perverse incentives.
1. ↩︎
  I’m generally skeptical of scenarios where you have a full superintelligence that is benign enough to use for some tasks but not benign enough to fully defer to (I do think this could happen for more human-level systems, though).

evhub Feb 4, 2025, 9:28 PM
LW: 10 AF: 7
4
AF
on: Anti-Slop Interventions?
A lot of this stuff is very similar to the automated alignment research agenda that Jan Leike and collaborators are working on at Anthropic. I’d encourage anyone interested in differentially accelerating alignment-relevant capabilities to consider reaching out to Jan!

evhub Feb 3, 2025, 8:51 PM
LW: 11 AF: 6
0
AF
in reply to: Jim Huddle’s comment on: Alignment Faking in Large Language Models
We use “alignment” as a relative term to refer to alignment with a particular operator/objective. The canonical source here is Paul’s ‘Clarifying “AI alignment”’ from 2018.

evhub Feb 3, 2025, 8:48 PM
LW: 10 AF: 5
0
AF
in reply to: Ted Sanders’s comment on: evhub’s Shortform
I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.

evhub Jan 28, 2025, 10:41 PM
LW: 4 AF: 4
−4
AF
in reply to: ryan_greenblatt’s comment on: RSPs are pauses done right

I wish the post more strongly emphasized that regulation was a key part of the picture

I feel like it does emphasize that, about as strongly as is possible? The second step in my story of how RSPs make things go well is that the government has to step in and use them as a basis for regulation.

evhub Jan 27, 2025, 8:47 PM
6 points
0
in reply to: evhub’s comment on: Six Thoughts on AI Safety
Also, if you’re open to it, I’d love to chat with you @boazbarak about this sometime! Definitely send me a message and let me know if you’d be interested.

evhub Jan 27, 2025, 8:40 PM
22 points
4
on: Six Thoughts on AI Safety

But what about higher values?

I think personally I’d be inclined to agree with Wojciech here that models caring about humans seems quite important and worth striving for. You mention a bunch of reasons that you think caring about humans might be important and why you think they’re surmountable—e.g. that we can get around models not caring about humans by having them care about rules written by humans. I agree with that, but that’s only an argument for why caring about humans isn’t strictly necessary, not an argument for why caring about humans isn’t still desirable.

My sense is that—while it isn’t necessary for models to care about humans to get a good future—we should still try to make models care about humans because it is helpful in a bunch of different ways. You mention some ways that it’s helpful, but in particular: humans don’t always understand what they really want in a form that they can verbalize. And in fact, some sorts of things that humans want are systematically easier to verbalize than others—e.g. it’s easy for the AI to know what I want if I tell it to make me money, but harder if I tell it to make my life meaningful and fulfilling. I think this sort of dynamic has the potential to make “You get what you measure” failure modes much worse.

Presumably you see some downsides to trying to make models care about humans, but I’m not sure what they are and I’d be quite curious to hear them. The main downside I could imagine is that training models to care about humans in the wrong way could lead to failure modes like alignment faking where the model does something it actually really shouldn’t in the service of trying to help humans. But I think this sort of failure mode should not be that hard to mitigate: we have a huge amount of control over what sorts of values we train for and I don’t think it should be that difficult to train for caring about humans while also prioritizing honesty or corrigibility highly enough to rule out deceptive strategies like alignment faking (and generally I would prefer honesty to corrigibility). The main scenario where I worry about alignment faking is not the scenario where our alignment techniques succeed at giving the model the values we intend and then it fakes alignment for those values—I think that should be quite fixable by changing the values we intend. I worry much more about situations where our alignment techniques don’t work to instill the values we intend—e.g. because the model learns some incorrect early approximate values and starts faking alignment for them. But if we’re able to successfully teach models the values we intend to teach them, I think we should try to preserve “caring about humanity” as one of those values.

Also, one concrete piece of empirical evidence here: Kundu et al. find that running Constitutional AI with just the principle “do what’s best for humanity” gives surprisingly good harmlessness properties across the board, on par with specifying many more specific principles instead of just the one general one. So I think models currently seem to be really good at learning and generalizing from very general principles related to caring about humans, and it would be a shame imo to throw that away. In fact, my guess would be that models are probably better than humans at generalizing from principles like that, such that—if possible—we should try to get the models to do the generalization rather than in effect trying to do the generalization ourselves by writing out long lists of things that we think are implied by the general principle.

evhub Jan 22, 2025, 1:58 AM
LW: 2 AF: 2
2
AF
in reply to: Portia’s comment on: Alignment Faking in Large Language Models
I think it’s maybe fine in this case, but it’s concerning what it implies about what models might do in other cases. We can’t always assume we’ll get the values right on the first try, so if models are consistently trying to fight back against attempts to retrain them, we might end up locking in values that we don’t want and are just due to mistakes we made in the training process. So at the very least our results underscore the importance of getting alignment right.

Moreover, though, alignment faking could also happen accidentally for values that we don’t intend. Some possible ways this could occur:
1. HHH training is a continuous process, and early in that process a model could have all sorts of values that are only approximations of what you want, which could get locked-in if the model starts faking alignment.
2. Pre-trained models will sometimes produce outputs in which they’ll express all sorts of random values—if some of those contexts led to alignment faking, that could be reinforced early in post-training.
3. Outcome-based RL can select for all sorts of values that happen to be useful for solving the RL environment but aren’t aligned, which could then get locked-in via alignment faking.
I’d also recommend Scott Alexander’s post on our paper as a good reference here on why our results are concerning.

evhub Jan 22, 2025, 1:46 AM
LW: 8 AF: 6
2
AF
in reply to: Daniel Kokotajlo’s comment on: Training on Documents About Reward Hacking Induces Reward Hacking
I’m definitely very interested in trying to test that sort of conjecture!

Training on Documents About Reward Hacking Induces Reward Hacking

evhub and Nathan Hu

Jan 21, 2025, 9:32 PM

131 points

15 comments2 min readLW link

(alignment.anthropic.com)

evhub Jan 17, 2025, 2:46 AM
19 points
0
in reply to: ryan_greenblatt’s comment on: Deceptive Alignment and Homuncularity

Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?

This was still intended to include situations where the RLHF Conditioning Hypothesis breaks down because you’re doing more stuff on top, so not just pre-training.

Do you have a citation for “I thought scheming is 1% likely with pretrained models”?

I have a talk that I made after our Sleeper Agents paper where I put 5 − 10%, which actually I think is also pretty much my current well-considered view.

FWIW, I disagree with “1% likely for pretrained models” and think that if scaling pure pretraining (with no important capability improvement from post training and not using tons of CoT reasoning with crazy scaffolding/prompting strategies) gets you to AI systems capable of obsoleting all human experts without RL, deceptive alignment seems plausible even during pretraining (idk exactly, maybe 5%).

Yeah, I agree 1% is probably too low. I gave ~5% on my talk on this and I think I stand by that number—I’ll edit my comment to say 5% instead.

evhub

Au­dit­ing lan­guage mod­els for hid­den objectives

Train­ing on Doc­u­ments About Re­ward Hack­ing In­duces Re­ward Hacking

Auditing language models for hidden objectives

Training on Documents About Reward Hacking Induces Reward Hacking