Buck

Karma: 12,875

CEO at Redwood Research.

AI safety is a highly collaborative field—almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I’m saying this here because it would feel repetitive to say “these ideas were developed in collaboration with various people” in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.

Buck Aug 4, 2025, 11:28 PM
8 points
0
on: Towards Alignment Auditing as a Numbers-Go-Up Science
Thanks for this post.
There are two senses in which control or alignment could be a number-go-up science:

1: Right now, we have a metric that we can optimize on to direct our research.
2: At the point where we have systems in front of us that actively pose misalignment risk, we will have a metric that we can optimize.

I think you’re mostly talking about 1 here. The property of control that drew us to it is that it has that second sense much more than alignment research does. I think its advantage is smaller in property 1.

In order for 2 to go as well as possible, in the present, we should do research that fills some combination of two roles:

1. We try to improve our techniques, using a proxy for the future methodology.
2. We try to improve our methodology, so that we’ll be able to use it better at the point where we can directly assess risk.

Research Areas in AI Control (The Alignment Project by UK AISI)

Julian Stastny, Tomek Korbak, Mojmir, Buck and Alan Cooney

Aug 1, 2025, 10:27 AM

22 points

0 comments18 min readLW link

(alignmentproject.aisi.gov.uk)

Buck Jul 31, 2025, 3:00 PM
6 points
0
in reply to: Thane Ruthenis’s comment on: Jemist’s Shortform
optimizing experimental setups and results for legibility to policymakers, rather than for convincingness to other AI researchers.
People who do research like this are definitely optimizing for legibility to policymakers (always at least a bit, and usually a lot).
One problem is that if AI researchers think your work is misleading/scientifically suspect, they get annoyed at you and tell people that your research sucks and you’re a dishonest ideologue. This is IMO often a healthy immune response, though it’s a bummer when you think that the researchers are wrong and your work is fine. So I think it’s pretty costly to give up on convincingness to AI researchers.

Buck Jul 31, 2025, 2:55 PM
73 points
5
on: Buck’s Shortform
@ryan_greenblatt and I are going to try out recording a podcast together tomorrow, as an experiment in trying to express our ideas more cheaply. I’d love to hear if there are questions or topics you’d particularly like us to discuss.

Buck Jul 31, 2025, 2:47 PM
22 points
3
in reply to: J Bostock’s comment on: Jemist’s Shortform
I’d say that work like our Alignment Faking in Large Language Models paper (and the model organisms/alignment stress-testing field more generally) is pretty similar to this (including the “present this clearly to policymakers” part).
A few issues:
- AI companies don’t actually have specific plans, they mostly just hope that they’ll be able to iterate. (See Sam Bowman’s bumper post for an articulation of a plan like this.) I think this is a reasonable approach in principle: this is how progress happens in a lot of fields. For example, the AI companies don’t have plans for all kinds of problems that will arise with their capabilities research in the next few years, they just hope to figure it out as they get there. But the lack of specific proposals makes it harder to demonstrate particular flaws.
- A lot of my concerns about alignment proposals are that when AIs are sufficiently smart, the plan won’t work anymore. But in many cases, the plan does actually work fine right now at ensuring particular alignment properties. (Most obviously, right now, AIs are so bad at reasoning about training processes that scheming isn’t that much of an active concern.) So you can’t directly demonstrate that current plans will fail later without making analogies to future systems; and observers (reasonably enough) are less persuaded by evidence that requires you to assess the analogousness of a setup.

Buck Jul 29, 2025, 9:13 PM
6 points
4
in reply to: leogao’s comment on: ryan_greenblatt’s Shortform
Yeah, maybe I’m using the wrong word here. I do think there is a really important difference between people who are scope-sensitively altruistically motivated and who are in principle willing to make decisions based on abstract reasoning about the future (which I probably include you in), and people who aren’t.

Buck Jul 29, 2025, 5:08 PM
15 points
−3
in reply to: leogao’s comment on: ryan_greenblatt’s Shortform
I think there are some really big advantages to having people who are motivated by longtermism and doing good in a scope-sensitive way, rather than just by trying to prevent AI takeover even more broadly “help with AI safety”.
AI safety field building has been popular in part because there is a very broad set of perspectives from which it makes sense to worry about technical problems related to societal risks from powerful AI. (See e.g. Simplify EA Pitches to “Holy Shit, X-Risk”. This kind of field building gets you lots of people who are worried about AI takeover risk, or more broadly, problems related to powerful AI. But it doesn’t get you people who have a lot of other parts of the EA/longtermist worldview, like:
- Being scope-sensitive
- Being altruistic/cosmopolitan
- Being concerned about the moral patienthood of a wide variety of different minds
- Being interested in philosophical questions about acausal trade
People who do not have the longtermist worldview and who work on AI safety are useful allies and I’m grateful to have them, but they have some extreme disadvantages compared to people who are on board with more parts of my worldview. And I think it would be pretty sad to have the proportion of people working on AI safety who have the longtermist perspective decline further.

Buck Jul 29, 2025, 12:08 AM
4 points
0
in reply to: habryka’s comment on: ryan_greenblatt’s Shortform
I’d be interested to hear what kind of things you’d want to do with funding; this does seem like a potentially good use of funds

Buck Jul 26, 2025, 7:26 PM
4 points
0
in reply to: Cole Wyeth’s comment on: Buck’s Shortform
I don’t really have a better suggestion than reading the obvious books. For the Bush presidency, I read/listened to both “Days of Fire”, a book by Peter Baker (a well-regarded journalist), and “Decision Points” by Bush. And I watched/listened to a bunch of interviews with various people involved with the admin.

Buck Jul 19, 2025, 4:21 AM
3 points
0
in reply to: Hastings’s comment on: Why it’s hard to make settings for high-stakes control research
Can you clarify what you mean by “break the unit test so that it falsely passed”?

Buck Jul 19, 2025, 4:20 AM
LW: 4 AF: 2
0
AF
in reply to: Igor Ivanov’s comment on: Why it’s hard to make settings for high-stakes control research
For control evaluations, it doesn’t matter whether the models can tell it’s an eval or not—we’re not trying to evaluate their propensity to do bad things, we’re evaluating their capability to do so, so we tell the AIs to try to do the side tasks (which makes it obvious to them that they’re in some kind of eval).

Why it’s hard to make settings for high-stakes control research

BuckJul 18, 2025, 4:33 PM

49 points

6 comments4 min readLW link

Recent Redwood Research project proposals

ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman , Tyler Tracy, Aryan Bhatt and Joey Yudelson

Jul 14, 2025, 10:27 PM

91 points

0 comments3 min readLW link

Buck Jul 14, 2025, 5:07 PM
15 points
4
in reply to: Vladimir_Nesov’s comment on: Vladimir_Nesov’s Shortform
This is a reasonable point in principle, but I don’t know how important it is in practice. My sense is that most things identified as algorithmic improvements continue to be algorithmic improvements over the previously-done thing at higher scales? E.g. transformers beating LSTMs, Chinchilla scaling, GeLU over ReLU, probably RL to train reasoning, etc.

Buck Jul 11, 2025, 10:42 PM
8 points
−3
in reply to: cousin_it’s comment on: Lessons from the Iraq War for AI policy
I think that the Iraq war seems unusual in that it was entirely proactive. Like, the war was not in response to a particular provocation, it was an entrepreneurial war aimed at preventing a future problem. In contrast, the wars in Korea, the Gulf, and (arguably) Vietnam were all responsive to active aggression.

Buck Jul 11, 2025, 10:38 PM
5 points
−6
in reply to: jmh’s comment on: Lessons from the Iraq War for AI policy
My understanding is:
The admin claimed that the evidence in favor of WMD presence was much stronger than it actually was. This was partially because they were confused/groupthinky, and partially because they were aiming to persuade. I agree that it was reasonable to think Iraq had WMDs on priors.

Buck Jul 10, 2025, 11:51 PM
38 points
11
on: Buck’s Shortform
I think that I’ve historically underrated learning about historical events that happened in the last 30 years, compared to reading about more distant history.
For example, I recently spent time learning about the Bush presidency, and found learning about the Iraq war quite thought-provoking. I found it really easy to learn about things like the foreign policy differences among factions in the Bush admin, because e.g. I already knew the names of most of the actors and their stances are pretty intuitive/easy to understand. But I still found it interesting to understand the dynamics; my background knowledge wasn’t good enough for me to feel like I’d basically heard this all before.

Buck Jul 10, 2025, 8:22 PM
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
Ryan discusses this at more length in his 80K podcast.

Lessons from the Iraq War for AI policy

BuckJul 10, 2025, 6:52 PM

188 points

25 comments4 min readLW link

Buck Jul 10, 2025, 4:36 PM
LW: 4 AF: 3
2
AF
in reply to: Jarrah’s comment on: What’s worse, spies or schemers?
You can undeploy them, if you want!
One difficulty is again that the scheming is particularly correlated. Firing a single spy might not be traumatic for your organization’s productivity, but ceasing all deployment of untrusted models plausibly grinds you to a halt.
And in terms of fixing them, note that it’s pretty hard for fix spies! I think you’re in a better position for fixing schemers than spies, e.g. see here.

Buck

Re­search Areas in AI Con­trol (The Align­ment Pro­ject by UK AISI)

Why it’s hard to make set­tings for high-stakes con­trol research

Re­cent Red­wood Re­search pro­ject proposals

Les­sons from the Iraq War for AI policy

Research Areas in AI Control (The Alignment Project by UK AISI)

Why it’s hard to make settings for high-stakes control research

Recent Redwood Research project proposals

Lessons from the Iraq War for AI policy