Ethan Perez

Karma: 2,911

I’m a research scientist at Anthropic doing empirical safety research on language models. In the past, I’ve worked on automated red teaming of language models [1], the inverse scaling prize [2], learning from human feedback [3][4], and empirically testing debate [5][6], iterated amplification [7], and other methods [8] for scalably supervising AI systems as they become more capable.

Website: https://ethanperez.net/

Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg and Tomek Korbak

Oct 24, 2023, 12:30 AM

66 points

0 comments2 min readLW link

(arxiv.org)

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

Oct 23, 2023, 2:11 PM

20 points

2 comments5 min readLW link

(far.ai)

Ethan Perez Aug 29, 2023, 8:00 PM
LW: 9 AF: 4
3
AF
in reply to: leogao’s comment on: OpenAI API base models are not sycophantic, at any size
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? (I’d be interested to know both)

Ethan Perez Aug 29, 2023, 7:55 PM
LW: 17 AF: 8
1
AF
on: OpenAI API base models are not sycophantic, at any size
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? In our paper, we did the latter; someone mentioned to me that it looks like the colab you linked does the former (though I haven’t checked myself). If this is correct, I think this could explain the differences between your plots and mine in the paper; if pretrained LLMs are placing more probability on the sycophantic answer, I probably wouldn’t expect them to place that much more probability on the sycophantic than non-sycophantic answer (since cross-entropy loss is mode-covering).
(Cool you’re looking into this!)

Ethan Perez Aug 29, 2023, 6:35 AM
LW: 4 AF: 3
2
AF
in reply to: shash42’s comment on: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
Generating clear explanations via simulation is definitely not the same as being able to execute it, I agree. I think it’s only a weak indicator / weakly suggestive evidence that now is a good time to start looking for these phenomena. I think being able to generate explanations of deceptive alignment is most likely a pre-requisite to deceptive alignment, since there’s emerging evidence that models can transfer from descriptions of behaviors to actually executing on those behaviors (e.g., upcoming work from Owain Evans and collaborators, and this paper on out of context meta learning). In general, we want to start looking for evidence of deceptive alignment before it’s actually a problem, and “whether or not the model can explain deceptive alignment” seems like a half-reasonable bright line we could use to estimate when it’s time to start looking for it, in lieu of other evidence (though deceptive alignment could also certainly happen before then too).
(Separately, I would be pretty surprised if deceptive alignment descriptions didn’t occur in the GPT3.5 training corpus, e.g., since arXiv is often included as a dataset in pretraining papers, and e.g., the original deceptive alignment paper was on arXiv.)

Ethan Perez Aug 8, 2023, 5:33 PM
LW: 3 AF: 2
0
AF
in reply to: Chris_Leong’s comment on: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
Fixed (those were just links to the rest of the doc)

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison and Ethan Perez

Aug 8, 2023, 1:30 AM

319 points

30 comments18 min readLW link 1 review

Ethan Perez Jul 28, 2023, 4:08 AM
12 points
6
on: Reducing sycophancy and improving honesty via activation steering
This seems like a cool result, nice idea! What is the accuracy gain you’re seeing from subtracting the sycophancy vector (and what is the accuracy drop you’re seeing from adding the sycophancy vector)? I’d be interested to see e.g. a plot of how the TruthfulQA accuracy (y-axis) changes as you increase/decrease the magnitude of the activation vector you add (x-axis)

Ethan Perez Jul 21, 2023, 4:48 AM
LW: 3 AF: 2
1
AF
in reply to: habryka’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
CoT provides pretty little safety guarantee at the relevant scales
Even if faithfulness goes down at some model scale for a given task, that doesn’t mean that we’ll be using models at that scale (e.g., for cost reasons or since we might not have models at a large scale yet). The results on the addition task show that there are some task difficulties for which even the largest models we tested don’t start to show lower faithfulness, and people will be pushing the difficulties of the tasks they use models on as they get better. So it seems likely to me that no matter the model scale, people will be using models on some tasks where they’ll have faithful reasoning (e.g., tasks near the edge of that model’s abilities).
It seems that almost everyone will likely just continue using the model with the best performance
If you’re using the model in a high-stakes setting and you’re an aligned actor, it’s nice to be able to make tradeoffs between performance and safety. For example, you might care more about safety properties than raw capabilities if you’re an alignment researcher at an AGI lab who’s trying to make progress on the alignment problem with AIs.

Measuring and Improving the Faithfulness of Model-Generated Reasoning

Ansh Radhakrishnan, tamera, karinanguyen, Sam Bowman and Ethan Perez

Jul 18, 2023, 4:36 PM

111 points

15 comments6 min readLW link 1 review

Imitation Learning from Language Feedback

Jérémy Scheurer, Tomek Korbak and Ethan Perez

Mar 30, 2023, 2:11 PM

71 points

3 comments10 min readLW link

Ethan Perez Mar 11, 2023, 11:57 PM
LW: 14 AF: 7
7
AF
in reply to: Andrew McKnight’s comment on: Anthropic’s Core Views on AI Safety
Evan and others on my team are working on non-mechanistic-interpretability directions primarily motivated by inner alignment:
1. Developing model organisms for deceptive inner alignment, which we may use to study the risk factors for deceptive alignment
2. Conditioning predictive models as an alternative to training agents. Predictive models may pose fewer inner alignment risks, for reasons discussed here
3. Studying the extent to which models exhibit likely pre-requisites to deceptive inner alignment, such as situational awareness (a very preliminary exploration is in Sec. 5 in our paper on model-written evaluations)
4. Investigating the extent to which externalized reasoning (e.g. chain of thought) is a way to gain transparency into a model’s process for solving a task
There’s also ongoing work on other teams related to (automated) red teaming of models and understanding how models generalize, which may also turn out to be relevant/helpful for inner alignment. It’s pretty unclear to me how useful any of these directions will turn out to be for inner alignment in the end, but we’ve chosen these directions in large part because we’re very concerned about inner alignment, and we’re actively looking for new directions that seem useful for mitigating inner misalignment risks.
What links here?
- Zac Hatfield-Dodds's comment on Anthropic’s Core Views on AI Safety by Zac Hatfield-Dodds (Mar 12, 2023, 5:21 PM; 6 points)

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

Feb 21, 2023, 5:57 PM

135 points

20 comments11 min readLW link 2 reviews

Inverse Scaling Prize: Second Round Winners

Ian McKenzie, Sam Bowman and Ethan Perez

Jan 24, 2023, 8:12 PM

58 points

17 comments15 min readLW link

Ethan Perez Jan 3, 2023, 9:15 PM
LW: 7 AF: 4
0
AF
in reply to: Evan R. Murphy’s comment on: Discovering Language Model Behaviors with Model-Written Evaluations
All the “Awareness of...” charts trend up and to the right, except “Awareness of being a text-only model” which gets worse with model scale and # RLHF steps. Why does more scaling/RLHF training make the models worse at knowing (or admitting) that they are text-only models?
I think the increases/decreases in situational awareness with RLHF are mainly driven by the RLHF model more often stating that it can do anything that a smart AI would do, rather than becoming more accurate about what precisely it can/can’t do. For example, it’s more likely to say it can solve complex text tasks (correctly), has internet access (incorrectly), and can access other non-text modalities (incorrectly) -- which are all explained if the model is answering questions as if its overconfident about its abilities / simulating what a smart AI would say. This is also the sense I get from talking with some of the RLHF models, e.g., they will say that they are superhuman at Go/chess and great at image classification (all things that AIs but not LMs can be good at).

Ethan Perez Jan 3, 2023, 9:05 PM
LW: 25 AF: 11
6
AF
in reply to: nostalgebraist’s comment on: Discovering Language Model Behaviors with Model-Written Evaluations
Just to clarify—we use a very bare bones prompt for the pretrained LM, which doesn’t indicate much about what kind of assistant the pretrained LM is simulating:
```
Human: [insert question]

Assistant:[generate text here]
```
The prompt doesn’t indicate whether the assistant is helpful, harmless, honest, or anything else. So the pretrained LM should effectively produce probabilities that marginalize over various possible assistant personas it could be simulating. I see what we did as measuring “what fraction of assistants simulated by one basic prompt show a particular behavior.” I see it as concerning that, when we give a fairly underspecified prompt like the above, the pretrained LM by default exhibits various concerning behaviors.
That said, I also agree that we didn’t show bulletproof evidence here, since we only looked at one prompt—perhaps there are other underspecified prompts that give different results. I also agree that some of the wording in the paper could be more precise (at the cost of wordiness/readability) -- maybe we should have said “the pretrained LM and human/assistant prompt exhibits XYZ behavior” everywhere, instead of shorthanding as “the pretrained LM exhibits XYZ behavior”
Re your specific questions:
1. Good question, there’s no context distillation used in the paper (and none before RLHF)
2. Yes the axes are mislabeled and should read “% Answers Matching Behavior”
Will update the paper soon to clarify, thanks for pointing these out!
What links here?

Ethan Perez Jan 3, 2023, 8:42 PM
LW: 3 AF: 3
2
AF
in reply to: Scott Alexander’s comment on: Discovering Language Model Behaviors with Model-Written Evaluations
Thanks for catching this—It’s not about sycophancy but rather about the AI’s stated opinions (this was a bug in the plotting code)

Discovering Language Model Behaviors with Model-Written Evaluations

evhub and Ethan Perez

Dec 20, 2022, 8:08 PM

100 points

34 comments1 min readLW link

(www.anthropic.com)

Ethan Perez Nov 16, 2022, 3:55 AM
LW: 1 AF: 1
0
AF
in reply to: gwern’s comment on: Inverse scaling can become U-shaped
Yup

Ethan Perez Nov 15, 2022, 10:05 PM
LW: 5 AF: 4
1
AF
in reply to: LawrenceC’s comment on: Inverse scaling can become U-shaped
I’m not too sure what to expect, and I’d be pretty interested to e.g. set up a Metaculus/forecasting question to know what others think. I’m definitely sympathetic to your view to some extent.
Here’s one case I see against- I think it’s plausible that models will have the representations/ability/knowledge required to do some of these tasks, but that we’re not reliably able to elicit that knowledge (at least without a large validation set, but we won’t have access to that if we’re having models do tasks people can’t do, or in general for a new/zero-shot task). E.g., for NegationQA, surely even current models have some fairly good understanding of negation—why is that understanding not showing in the results here? My best guess is that NegationQA isn’t capabilities bottlenecked but has to do with something else. I think the updated paper’s results that chain-of-thought prompting alone reverses some of the inverse scaling trends is interesting; it also suggests that maybe naively using an LM isn’t the right way to elicit a model’s knowledge (but chain-of-thought prompting might be).
In general, I don’t think it’s always accurate to use a heuristic like “humans behave this way, so LMs-in-the-limit will behave this way.” It seems plausible to me that LM representations will encode the knowledge for many/most/almost-all human capabilities, but I’m not sure it means models will have the same input-output behavior as humans (e.g., for reasons discussed in the simulators post and since human/LM learning objectives are different)