Sam Bowman

Karma: 938

23 Apr 2024 21:10 UTC

117 points

15 comments1 min readLW link

(www.anthropic.com)

LLM Evaluators Recognize and Favor Their Own Generations

Arjun Panickssery, Sam Bowman and Shi Feng

17 Apr 2024 21:09 UTC

43 points

1 comment3 min readLW link

(tiny.cc)

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Sam Bowman and Ethan Perez

7 Feb 2024 21:28 UTC

87 points

14 comments9 min readLW link

(arxiv.org)

Sam Bowman 8 Dec 2023 2:11 UTC
LW: 1 AF: 1
0
AF
in reply to: dsj’s comment on: Anthropic Fall 2023 Debate Progress Update
Is there anything you’d be especially excited to use them for? This should be possible, but cumbersome enough that we’d default to waiting until this grows into a full paper (date TBD). My NYU group’s recent paper on a similar debate setup includes a data release, FWIW.

Sam Bowman 25 Aug 2023 18:29 UTC
LW: 9 AF: 6
2
AF
on: Reducing sycophancy and improving honesty via activation steering
Possible confound: Is it plausible that the sycophancy vector is actually just adjusting how much the model conditions its responses on earlier parts of the conversation, beyond the final 10–20 tokens? IIUC, the question is always at the end, and ignoring the earlier context about the person who’s nominally asking the question should generally get you a better answer.

Sam Bowman 6 Aug 2023 17:36 UTC
LW: 1 AF: 1
0
AF
in reply to: cfoster0’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
That makes sense, though what’s at stake with that question? In almost every safety-relevant context I can think of, ‘scale’ is just used as a proxy for ‘the best loss I can realistically achieve in a training run’, rather than as something we care about directly.

Sam Bowman 6 Aug 2023 17:34 UTC
LW: 2 AF: 2
1
AF
in reply to: RobertKirk’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
Yep, that sounds right! The measure we’re using gets noisier with better performance, so even faithfulness-vs-performance breaks down at some point. I think this is mostly an argument to use different metrics and/or tasks if you’re focused on scaling trends.

Sam Bowman 21 Jul 2023 5:05 UTC
LW: 10 AF: 7
4
AF
in reply to: Seth Herd’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
Concretely, the scaling experiments in the first paper here show that, as models get larger, truncating or deleting the CoT string makes less and less difference to the model’s final output on any given task.
So, stories about CoT faithfulness that depend on the CoT string being load-bearing are no longer very compelling at large scales, and the strings are pretty clearly post hoc in at least some sense.
This doesn’t provide evidence, though, that the string is misleading about the reasoning process that the model is doing, e.g., in the sense that the string implies false counterfactuals about the model’s reasoning. Larger models are also just better at this kind of task, and the tasks all have only one correct answer, so any metric that requires the model to make mistakes in order to demonstrate faithfulness is going to struggle. I think at least for intuitive readings of a term like ‘faithfulness’, this all adds up to the claim in the comment above.
Counterfactual-based metrics, like the ones in the Turpin paper, are less vulnerable to this, and that’s probably where I’d focus if I wanted to push much further on measurement given what we know now. Though we already know from that paper that standard CoT in near-frontier models isn’t reliably faithful by that measure.
We may be able to follow up with a few more results to clarify the takeaways about scaling, and in particular, I think just running a scaling sweep for the perturbed reasoning adding-mistakes metric from the Lanham paper here would clarify things a bit. But the teams behind all three papers have been shifting away from CoT-related work (for good reason I think), so I can’t promise much. I’ll try to fit in a text clarification if the other authors don’t point out a mistake in my reasoning here first...

Sam Bowman 20 Jul 2023 2:44 UTC
LW: 7 AF: 2
−1
AF
in reply to: tamera’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
I agree, though I’ll also add:

- I don’t think our results clearly show that faithfulness goes down with model size, just that there’s less affirmative evidence for faithfulness at larger model sizes, at least in part for predictable reasons related to the metric design. There’s probably more lowish-hanging fruit involving additional experiments focused on scaling. (I realize this disagrees with a point in the post!)

- Between the good-but-not-perfect results here and the alarming results in the Turpin ‘Say What They Think’ paper, I think this paints a pretty discouraging picture of standard CoT as a mechanism for oversight. This isn’t shocking! If we wanted to pursue an approach that relied on something like CoT, and we want to get around this potentially extremely cumbersome sweet-spot issue around scale, I think the next step would be to look for alternate training methods that give you something like CoT/FD/etc. but have better guarantees of faithfulness.

Measuring and Improving the Faithfulness of Model-Generated Reasoning

Ansh Radhakrishnan, tamera, karinanguyen, Sam Bowman and Ethan Perez

18 Jul 2023 16:36 UTC

109 points

13 comments6 min readLW link

Sam Bowman 18 Apr 2023 20:19 UTC
LW: 2 AF: 2
0
AF
on: Externalized reasoning oversight: a research direction for language model alignment
I’d like to avoid that document being crawled by a web scraper which adds it to a language model’s training corpus.
This may be too late, but it’s probably also helpful to put the BIG-Bench “canary string” in the doc as well.

Sam Bowman 31 Mar 2023 23:23 UTC
LW: 4 AF: 4
2
AF
in reply to: evhub’s comment on: Towards understanding-based safety evaluations
Assuming we’re working with near-frontier models (s.t., the cost of training them once is near the limit of what any institution can afford), we presumably can’t actually retrain a model without the data. Are there ways to approximate this technique that preserve its appeal?

(Just to check my understanding, this would be a component of a sufficient-but-not-necessary solution, right?)

Sam Bowman 9 Mar 2023 23:58 UTC
6 points
3
on: Anthropic: Core Views on AI Safety: When, Why, What, and How
Just flagging that another cross-post has been collecting some comments: https://www.lesswrong.com/posts/xhKr5KtvdJRssMeJ3/anthropic-s-core-views-on-ai-safety

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

21 Feb 2023 17:57 UTC

133 points

18 comments11 min readLW link

Sam Bowman 17 Feb 2023 0:45 UTC
4 points
0
in reply to: Dunning K.’s comment on: Qualities that alignment mentors value in junior researchers
I mostly agree, but it’s messy. I don’t think it’s obvious that a PhD is anywhere near the ideal way to pick up some of these skills, or that earning a PhD definitely means that you’ve picked them up, but PhD programs do include lots of nudges in these directions, and PhD-holders are going to be much stronger than average at most of this.
In particular, like Johannes said, doing a PhD is notoriously hard on mental health for a number of reasons, even at a more-supportive-than-average lab. So to the extent that they teach ‘taking care of your mental health’ and ‘staying motivated when you’re lost’, it’s often by throwing you into stressful, confusing work situations without great resources and giving you the degree if you figure out how to navigate them.

Sam Bowman 15 Feb 2023 17:03 UTC
14 points
4
on: Qualities that alignment mentors value in junior researchers
When I converse with junior folks about what qualities they’re missing, they often focus on things like “not being smart enough” or “not being a genius” or “not having a PhD.” It’s interesting to notice differences between what junior folks think they’re missing & what mentors think they’re missing.

This issue is real, it’s the thing that frustrates me most about alignment pipeline-building work in general right now. There are very likely some important formal/theoretical areas of alignment research that really do need to recruit mostly for something like ‘genius’. But a lot more of the active work that’s getting done (and a way more of the hard-to-fill open jobs) depend much, much more on skills 1–5 here much more than on intelligence in that sense.
(This is on the margin. Here I’m focused on the actual population of people who tend to be interested in ML alignment research, so I’m baking in the assumption that all of the candidates could, say, get above-average grades in a STEM undergrad degree at a top-100 university if they tried.)
As someone who’s supervised/trained ML researchers for ~8 years now, I’d pretty much always hire someone who’s 90th-percentile on two or three of these skills than someone who’s no better than 70th percentile but has world-class IMO (or IOI) performance or a verified IQ of 160 or some other classic raw intelligence signal.

Inverse Scaling Prize: Second Round Winners

Ian McKenzie, Sam Bowman and Ethan Perez

24 Jan 2023 20:12 UTC

58 points

17 comments15 min readLW link

Sam Bowman 8 Dec 2022 3:01 UTC
2 points
1
on: Probably good projects for the AI safety ecosystem
- A New York-based alignment hub that aims to provide talent search and logistical support for NYU Professor Sam Bowman’s planned AI safety research group.
:D
I think my lab is bottlenecked on things other than talent and outside support for now, but there probably is more that could be done to help build/coordinate an alignment research scene in NYC more broadly.

Sam Bowman 8 Dec 2022 3:00 UTC
10 points
11
on: Probably good projects for the AI safety ecosystem
More organizations like CAIS that aim to recruit established ML talent into alignment research
This is somewhat risky, and should get a lot of oversight. One of the biggest obstacles to discussing safety in academic settings is that academics are increasingly turned off by clumsy, arrogant presentations of the basic arguments for concern.

Sam Bowman 23 Nov 2022 23:02 UTC
LW: 8 AF: 5
13
AF
in reply to: JanB’s comment on: Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility
+1. The combination of the high dollar amount, the subjective criteria, and the panel drawn from the relatively small/insular ‘core’ AI safety research community mean that I expect this to look pretty fishy to established researchers. Even if the judgments are fair (I think they probably will be!) and the contest yields good work (it might!), I expect the benefit of that to be offset to a pretty significant degree by the red flags this raises about how the AI safety scene deals with money and its connection to mainstream ML research.
(To be fair, I think the Inverse Scaling Prize, which I’m helping with, raises some of these concerns, but the more precise/partially-quantifiable prize rubric, bigger/more diverse panel, and use of additional reviewers outside the panel mitigates them at least partially.)

Sam Bowman

Sim­ple probes can catch sleeper agents

LLM Eval­u­a­tors Rec­og­nize and Fa­vor Their Own Generations

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

Pre­train­ing Lan­guage Models with Hu­man Preferences

In­verse Scal­ing Prize: Se­cond Round Winners

Simple probes can catch sleeper agents

LLM Evaluators Recognize and Favor Their Own Generations

Debating with More Persuasive LLMs Leads to More Truthful Answers

Measuring and Improving the Faithfulness of Model-Generated Reasoning

Pretraining Language Models with Human Preferences

Inverse Scaling Prize: Second Round Winners