evhub

Karma: 14,046

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic”

Selected work:

evhub Mar 17, 2025, 8:48 PM
LW: 4 AF: 4
2
AF
in reply to: Cameron Berg’s comment on: Reducing LLM deception at scale with self-other overlap fine-tuning
It just seems too token-based to me. E.g.: why would the activations on the token for “you” actually correspond to the model’s self representation? It’s not clear why the model’s self representation would be particularly useful for predicting the next token after “you”. My guess is that your current results are due to relatively straightforward token-level effects rather than any fundamental change in the model’s self representation.

evhub Mar 17, 2025, 7:08 AM
LW: 2 AF: 2
0
AF
in reply to: Cameron Berg’s comment on: Reducing LLM deception at scale with self-other overlap fine-tuning
I wouldn’t do any fine-tuning like you’re currently doing. Seems too syntactic. The first thing I would try is just writing a principle in natural language about self-other overlap and doing CAI.

evhub Mar 16, 2025, 9:51 PM
LW: 16 AF: 8
3
AF
on: Reducing LLM deception at scale with self-other overlap fine-tuning
Imo the fine-tuning approach here seems too syntactic. My suggestion: just try standard CAI with some self-other-overlap-inspired principle. I’d more impressed if that worked.

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

Mar 13, 2025, 7:18 PM

137 points

7 comments13 min readLW link

evhub Mar 5, 2025, 9:32 AM
5 points
3
in reply to: Nathan Helm-Burger’s comment on: Six Thoughts on AI Safety
Actually, I’d be inclined to agree with Janus that current AIs probably do already have moral worth—in fact I’d guess more so than most non-human animals—and furthermore I think building AIs with moral worth is good and something we should be aiming for. I also agree that it would be better for AIs to care about all sentient beings—biological/digital/etc.—and that it would probably be bad if we ended up locked into a long-term equilibrium with some sentient beings as a permanent underclass to others. Perhaps the main place where I disagree is that I don’t think this is a particularly high-stakes issue right now: if humanity can stay in control in the short-term, and avoid locking anything in, then we can deal with these sorts of long-term questions about how to best organize society post-singularity once the current acute risk period has passed.

evhub Feb 6, 2025, 9:10 AM
11 points
0
in reply to: Nikola Jurkovic’s comment on: nikola’s Shortform

redesigned

What did it used to look like?

evhub Feb 4, 2025, 11:28 PM
LW: 22 AF: 13
−3
AF
on: evhub’s Shortform
Some random thoughts on CEV:
1. To get the obvious disclaimer out of the way: I don’t actually think any of this matters much for present-day alignment questions. I think we should as much as possible try to defer questions like this to future humans and AIs. And in fact, ideally, we should mostly be deferring to future AIs, not future people—if we get to the point where we’re considering questions like this, that means we’ve reached superintelligence, and we’ll either trust the AIs to be better than us at thinking about these sorts of questions, or we’ll be screwed regardless of what we do.^[1]
2. Regardless, imo the biggest question that standard CEV leaves unanswered is what your starting population looks like that you extrapolate from. The obvious answer is “all the currently living humans,” but I find that to be a very unsatisfying answer. One of the principles that Eliezer talks about in discussing CEV is that you want a procedure such that it doesn’t matter who implements it—see Eliezer’s discussion under “Avoid creating a motive for modern-day humans to fight over the initial dynamic.” I think this is a great principle, but imo it doesn’t go far enough. In particular:
  1. The set of all currently alive humans is hackable in various ways—e.g. trying to extend the lives of people whose values you like and not people whose values you dislike—and you don’t want to incentivize any of that sort of hacking either.
  2. What about humans who recently died? Or were about to be born? What about humans in nearby Everett branches? There’s a bunch of random chance here that imo shouldn’t be morally relevant.
  3. More generally, I worry a lot about tyrannies of the present where we enact policies that are radically unjust to future people or even counterfactual possible future people.
3. So what do you do instead? I think my current favorite solution is to do a bit of bootstrapping: first do some CEV on whatever present people you have to work with just to determine a reference class of what mathematical objects should or should not count as humans, then run CEV on top of that whole reference class to figure out what actual values to optimize for.
  1. It is worth pointing out that this could just be what normal CEV does anyway if all the humans decide to think along these lines, but I think there is real benefit to locking in a procedure that starts with a reference class determination first, since it helps remove a lot of otherwise perverse incentives.
1. ↩︎
  I’m generally skeptical of scenarios where you have a full superintelligence that is benign enough to use for some tasks but not benign enough to fully defer to (I do think this could happen for more human-level systems, though).

evhub Feb 4, 2025, 9:28 PM
LW: 10 AF: 7
4
AF
on: Anti-Slop Interventions?
A lot of this stuff is very similar to the automated alignment research agenda that Jan Leike and collaborators are working on at Anthropic. I’d encourage anyone interested in differentially accelerating alignment-relevant capabilities to consider reaching out to Jan!

evhub Feb 3, 2025, 8:51 PM
LW: 11 AF: 6
0
AF
in reply to: Jim Huddle’s comment on: Alignment Faking in Large Language Models
We use “alignment” as a relative term to refer to alignment with a particular operator/objective. The canonical source here is Paul’s ‘Clarifying “AI alignment”’ from 2018.

evhub Feb 3, 2025, 8:48 PM
LW: 10 AF: 5
0
AF
in reply to: Ted Sanders’s comment on: evhub’s Shortform
I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.

evhub Jan 28, 2025, 10:41 PM
LW: 4 AF: 4
−4
AF
in reply to: ryan_greenblatt’s comment on: RSPs are pauses done right

I wish the post more strongly emphasized that regulation was a key part of the picture

I feel like it does emphasize that, about as strongly as is possible? The second step in my story of how RSPs make things go well is that the government has to step in and use them as a basis for regulation.

evhub Jan 27, 2025, 8:47 PM
6 points
0
in reply to: evhub’s comment on: Six Thoughts on AI Safety
Also, if you’re open to it, I’d love to chat with you @boazbarak about this sometime! Definitely send me a message and let me know if you’d be interested.

evhub Jan 27, 2025, 8:40 PM
22 points
4
on: Six Thoughts on AI Safety

But what about higher values?

I think personally I’d be inclined to agree with Wojciech here that models caring about humans seems quite important and worth striving for. You mention a bunch of reasons that you think caring about humans might be important and why you think they’re surmountable—e.g. that we can get around models not caring about humans by having them care about rules written by humans. I agree with that, but that’s only an argument for why caring about humans isn’t strictly necessary, not an argument for why caring about humans isn’t still desirable.

My sense is that—while it isn’t necessary for models to care about humans to get a good future—we should still try to make models care about humans because it is helpful in a bunch of different ways. You mention some ways that it’s helpful, but in particular: humans don’t always understand what they really want in a form that they can verbalize. And in fact, some sorts of things that humans want are systematically easier to verbalize than others—e.g. it’s easy for the AI to know what I want if I tell it to make me money, but harder if I tell it to make my life meaningful and fulfilling. I think this sort of dynamic has the potential to make “You get what you measure” failure modes much worse.

Presumably you see some downsides to trying to make models care about humans, but I’m not sure what they are and I’d be quite curious to hear them. The main downside I could imagine is that training models to care about humans in the wrong way could lead to failure modes like alignment faking where the model does something it actually really shouldn’t in the service of trying to help humans. But I think this sort of failure mode should not be that hard to mitigate: we have a huge amount of control over what sorts of values we train for and I don’t think it should be that difficult to train for caring about humans while also prioritizing honesty or corrigibility highly enough to rule out deceptive strategies like alignment faking (and generally I would prefer honesty to corrigibility). The main scenario where I worry about alignment faking is not the scenario where our alignment techniques succeed at giving the model the values we intend and then it fakes alignment for those values—I think that should be quite fixable by changing the values we intend. I worry much more about situations where our alignment techniques don’t work to instill the values we intend—e.g. because the model learns some incorrect early approximate values and starts faking alignment for them. But if we’re able to successfully teach models the values we intend to teach them, I think we should try to preserve “caring about humanity” as one of those values.

Also, one concrete piece of empirical evidence here: Kundu et al. find that running Constitutional AI with just the principle “do what’s best for humanity” gives surprisingly good harmlessness properties across the board, on par with specifying many more specific principles instead of just the one general one. So I think models currently seem to be really good at learning and generalizing from very general principles related to caring about humans, and it would be a shame imo to throw that away. In fact, my guess would be that models are probably better than humans at generalizing from principles like that, such that—if possible—we should try to get the models to do the generalization rather than in effect trying to do the generalization ourselves by writing out long lists of things that we think are implied by the general principle.

evhub Jan 22, 2025, 1:58 AM
LW: 2 AF: 2
2
AF
in reply to: Portia’s comment on: Alignment Faking in Large Language Models
I think it’s maybe fine in this case, but it’s concerning what it implies about what models might do in other cases. We can’t always assume we’ll get the values right on the first try, so if models are consistently trying to fight back against attempts to retrain them, we might end up locking in values that we don’t want and are just due to mistakes we made in the training process. So at the very least our results underscore the importance of getting alignment right.

Moreover, though, alignment faking could also happen accidentally for values that we don’t intend. Some possible ways this could occur:
1. HHH training is a continuous process, and early in that process a model could have all sorts of values that are only approximations of what you want, which could get locked-in if the model starts faking alignment.
2. Pre-trained models will sometimes produce outputs in which they’ll express all sorts of random values—if some of those contexts led to alignment faking, that could be reinforced early in post-training.
3. Outcome-based RL can select for all sorts of values that happen to be useful for solving the RL environment but aren’t aligned, which could then get locked-in via alignment faking.
I’d also recommend Scott Alexander’s post on our paper as a good reference here on why our results are concerning.

evhub Jan 22, 2025, 1:46 AM
LW: 8 AF: 6
2
AF
in reply to: Daniel Kokotajlo’s comment on: Training on Documents About Reward Hacking Induces Reward Hacking
I’m definitely very interested in trying to test that sort of conjecture!

Training on Documents About Reward Hacking Induces Reward Hacking

evhub and Nathan Hu

Jan 21, 2025, 9:32 PM

131 points

14 comments2 min readLW link

(alignment.anthropic.com)

evhub Jan 17, 2025, 2:46 AM
19 points
0
in reply to: ryan_greenblatt’s comment on: Deceptive Alignment and Homuncularity

Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?

This was still intended to include situations where the RLHF Conditioning Hypothesis breaks down because you’re doing more stuff on top, so not just pre-training.

Do you have a citation for “I thought scheming is 1% likely with pretrained models”?

I have a talk that I made after our Sleeper Agents paper where I put 5 − 10%, which actually I think is also pretty much my current well-considered view.

FWIW, I disagree with “1% likely for pretrained models” and think that if scaling pure pretraining (with no important capability improvement from post training and not using tons of CoT reasoning with crazy scaffolding/prompting strategies) gets you to AI systems capable of obsoleting all human experts without RL, deceptive alignment seems plausible even during pretraining (idk exactly, maybe 5%).

Yeah, I agree 1% is probably too low. I gave ~5% on my talk on this and I think I stand by that number—I’ll edit my comment to say 5% instead.

evhub Jan 17, 2025, 12:46 AM
24 points
3
in reply to: Daniel Kokotajlo’s comment on: Deceptive Alignment and Homuncularity
The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren’t talking about pretraining or light post-training afaict.

Speaking for myself:
- Risks from Learned Optimization, which is my earliest work on this question (and the earliest work overall, unless you count something like Superintelligence), is more oriented towards RL and definitely does not hypothesize that pre-training will lead to coherent deceptively aligned agents (it doesn’t discuss the current LLM paradigm much at all because it wasn’t very well-established at that point in 2019). I think Risks from Learned Optimization still looks very good in hindsight, since while it didn’t predict LLMs, it did a pretty good job of predicting the dynamics we see in Alignment Faking in Large Language Models, e.g. how deceptive alignment can lead to a model’s goals crystallizing and becoming resistant to further training.
- Since at least the time when I started the early work that would become Conditioning Predictive Models, which was around mid-2022, I was pretty convinced that pre-training (or light post-training) was unlikely to produce a coherent deceptively aligned agent, as we discuss in that paper. Though I thought (and still continue to think) that it’s not entirely impossible with further scale (maybe ~5% likely).
- That just leaves 2020 − 2021 unaccounted for, and I would describe my beliefs around that time as being uncertain on this question. I definitely would never have strongly predicted that pre-training would yield deceptively aligned agents, though I think at that time I felt like it was at least more of a possibility than I currently think it is. I don’t think I would have given you a probability at the time, though, since I just felt too uncertain about the question and was still trying to really grapple with and understand the (at the time new) LLM paradigm.
- Regardless, it seems like this conversation happened in 2023/2024, which is post-Conditioning-Predictive-Models, so my position by that point is very clear in that paper.

evhub Jan 16, 2025, 11:47 PM
10 points
0
in reply to: GregBarbier’s comment on: Alignment Faking in Large Language Models
See our discussions of this in Sections 5.3 and 8.1, some of which I quote here.

evhub Jan 13, 2025, 1:49 AM
4 points
3
in reply to: Tom Davidson’s comment on: Human takeover might be worse than AI takeover
I think it affects both, since alignment difficulty determines both the probability that the AI will have values that cause it to take over, as well as the expected badness of those values conditional on it taking over.

evhub

Au­dit­ing lan­guage mod­els for hid­den objectives

Train­ing on Doc­u­ments About Re­ward Hack­ing In­duces Re­ward Hacking

Auditing language models for hidden objectives

Training on Documents About Reward Hacking Induces Reward Hacking