Buck

Karma: 10,839

CEO at Redwood Research.

AI safety is a highly collaborative field—almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I’m saying this here because it would feel repetitive to say “these ideas were developed in collaboration with various people” in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Buck Apr 11, 2025, 2:00 PM
LW: 2 AF: 2
2
AF
on: Notes on countermeasures for exploration hacking (aka sandbagging)
I think we should probably say that exploration hacking is a strategy for sandbagging, rather than using them as synonyms.

Buck Apr 5, 2025, 7:02 PM
6 points
4
in reply to: Garrett Baker’s comment on: How much progress actually happens in theoretical physics?
Isn’t the answer that the low hanging fruit of explaining unexplained observations has been picked?

Buck Apr 5, 2025, 3:43 PM
40 points
2
on: Buck’s Shortform
I appeared on the 80,000 Hours podcast. I discussed a bunch of points on misalignment risk and AI control that I don’t think I’ve heard discussed publicly before.
Transcript + links + summary here; it’s also available as a podcast in many places.

Buck Apr 2, 2025, 7:59 PM
4 points
2
in reply to: avturchin’s comment on: Consider showering
I love that I can guess the infohazard from the comment

Buck Apr 2, 2025, 12:39 AM
LW: 28 AF: 14
0
AF
in reply to: Thomas Kwa’s comment on: Thomas Kwa’s Shortform
A few months ago, I accidentally used France as an example of a small country that it wouldn’t be that catastrophic for AIs to take over, while giving a talk in France 😬

Buck Apr 1, 2025, 12:05 AM
4 points
0
in reply to: Richard_Ngo’s comment on: Third-wave AI safety needs sociopolitical thinking
No problem, my comment was pretty unclear and I can see from the other comments why you’d be on edge!

Buck Mar 31, 2025, 8:46 PM
3 points
0
in reply to: ank’s comment on: ank’s Shortform
It seems extremely difficult to make a blacklist of models in a way that isn’t trivially breakable. (E.g. what’s supposed to happen when someone adds a tiny amount of noise to the weights of a blacklisted model, or rotates them along a gauge invariance?)

Buck Mar 31, 2025, 7:49 PM
9 points
0
in reply to: Richard_Ngo’s comment on: Third-wave AI safety needs sociopolitical thinking
I agree that this isn’t what I’d call “direct written evidence”; I was just (somewhat jokingly) making the point that the linked articles are Bayesian evidence that Musk tries to censor, and that the articles are pieces of text.

Buck Mar 31, 2025, 5:24 PM
1 point
0
in reply to: Richard_Ngo’s comment on: Third-wave AI safety needs sociopolitical thinking
It is definitely evidence that was literally written

Buck Mar 31, 2025, 4:32 PM
6 points
2
in reply to: MichaelDickens’s comment on: Why do many people who care about AI Safety not clearly endorse PauseAI?
I disagree that you have to believe those four things in order to believe what I said. I believe some of those and find others too ambiguously phrased to evaluate.
Re your model: I think your model is basically just: if we race, we go from 70% chance that US “wins” to a 75% chance the US wins, and we go from a 50% chance of “solving alignment” to a 25% chance? Idk how to apply that here: isn’t your squiggle model talking about whether racing is good, rather than whether unilaterally pausing is good? Maybe you’re using “race” to mean “not pause” and “not race” to mean “pause”; if so, that’s super confusing terminology. If we unilaterally paused indefinitely, surely we’d have less than 70% chance of winning.
In general, I think you’re modeling this extremely superficially in your comments on the topic. I wish you’d try modeling this with more granularity than “is alignment hard” or whatever. I think that if you try to actually make such a model, you’ll likely end up with a much better sense of where other people are coming from. If you’re trying to do this, I recommend reading posts where people explain strategies for passing safely through the singularity, e.g. like this.

Buck Mar 31, 2025, 4:15 PM
14 points
0
in reply to: Mo Putera’s comment on: Mo Putera’s Shortform
I have this experience with @ryan_greenblatt—he’s got an incredible ability to keep really large and complicated argument trees in his head, so he feels much less need to come up with slightly-lossy abstractions and categorizations than e.g. I do. This is part of why his work often feels like huge, mostly unstructured lists. (The lists are more unstructured before his pre-release commenters beg him to structure them more.) (His code often also looks confusing to me, for similar reasons.)

Buck Mar 31, 2025, 2:21 PM
9 points
4
on: Why do many people who care about AI Safety not clearly endorse PauseAI?
Some quick takes:
- “Pause AI” could refer to many different possible policies.
- I think that if humanity avoided building superintelligent AI, we’d massively reduce the risk of AI takeover and other catastrophic outcomes.
- I suspect that at some point in the future, AI companies will face a choice between proceeding more slowly with AI development than they’re incentivized to, and proceeding more quickly while imposing huge risks. In particular, I suspect it’s going to be very dangerous to develop ASI.
- I don’t think that it would be clearly good to pause AI development now. This is mostly because I don’t think that the models being developed literally right now pose existential risk.
- Maybe it would be better to pause AI development right now because this will improve the situation later (e.g. maybe we should pause until frontier labs implement good enough security that we can be sure their Slack won’t be hacked again, leaking algorithmic secrets). But this is unclear and I don’t think it immediately follows from “we could stop AI takeover risk by pausing AI development before the AIs are able to take over”.
- Many of the plausible “pause now” actions seem to overall increase risk. For example, I think it would be bad for relatively responsible AI developers to unilaterally pause, and I think it would probably be bad for the US to unilaterally force all US AI developers to pause if they didn’t simultaneously somehow slow down non-US development.
  - (They could slow down non-US development with actions like export controls.)
- Even in the cases where I support something like pausing, it’s not clear that I want to spend effort on the margin actively supporting it; maybe there are other things I could push on instead that have better ROI.
- I’m not super enthusiastic about PauseAI the organization; they sometimes seem to not be very well-informed, they sometimes argue for conclusions that I think are wrong, and I find Holly pretty unpleasant to interact with, because she seems uninformed and prone to IMO unfair accusations that I’m conspiring with AI companies. My guess is that there could be an organization with similar goals to PauseAI that I felt much more excited for.

Buck Mar 31, 2025, 1:07 PM
LW: 9 AF: 7
4
AF
on: Good Research Takes are Not Sufficient for Good Strategic Takes
A few points:
- Knowing a research field well makes it easier to assess how much other people know about it. For example, if you know ML, you sometimes notice that someone clearly doesn’t know what they’re talking about (or conversely, you become impressed by the fact that they clearly do know what they’re talking about). This is helpful when deciding who to defer to.
- If you are a prominent researcher, you get more access to confidential/sensitive information and the time of prestigious people. This is true regardless of whether your strategic takes are good, and generally improves your strategic takes.
  - One downside is that people try harder to convince you of stuff. I think that being a more prominent researcher is probably overall net positive despite this effect.
- IMO, one way of getting a sense of whether someone’s strategic takes are good is to ask them whether they try hard to have strategic takes. A lot of people will openly tell you that they don’t focus on that, which makes it easier for you to avoid deferring to random half-baked strategic takes that they say without expecting anyone to take them too seriously.

Buck Mar 31, 2025, 12:35 PM
LW: 4 AF: 3
0
AF
on: We Have No Plan for Preventing Loss of Control in Open Models
A few takes:
I believe that there is also an argument to be made that the AI safety community is currently very under-indexed on research into future scenarios where assumptions about the AI operator taking baseline safety precautions related to preventing loss of control do not hold.
I think you’re mixing up two things: the extent to which we consider the possibility that AI operators will be very incautious, and the extent to which our technical research focuses on that possibility.
My research mostly focuses on techniques that an AI developer could use to reduce the misalignment risk posed by deploying and developing AI, given some constraints on how much value they need to get from the AI. Given this, I basically definitionally have to be imagining the AI developer trying to mitigate misalignment risk: why else would they use the techniques I study?
But that focus isn’t to say that I’m totally sure all AI developers will in fact use good safety techniques.
Another disagreement is that I think that we’re better off if some AI developers (preferably more powerful ones) have controlled or aligned their models, even if there are some misaligned AIs being developed without safeguards. This is because the controlled/aligned models can be used to defend against attacks from the misaligned AIs, and to compete with the misaligned AIs (on both acquiring resources and doing capabilities and safety research).

Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?

Alex Mallen, charlie_griffin and Buck

Mar 24, 2025, 5:55 PM

30 points

0 comments8 min readLW link

Buck Mar 14, 2025, 12:24 AM
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: The case for ensuring that powerful AIs are controlled
I am not sure I agree with this change at this point. How do you feel now?

Buck Mar 13, 2025, 12:21 AM
4 points
0
in reply to: Katalina Hernandez’s comment on: Buck’s Shortform
We’re planning to release some talks; I also hope we can publish various other content from this!
I’m sad that we didn’t have space for everyone!

Buck Mar 10, 2025, 4:10 AM
2 points
0
in reply to: Mis-Understandings’s comment on: Mis-Understandings’s Shortform
I wrote thoughts here: https://redwoodresearch.substack.com/p/fields-that-i-reference-when-thinking?selection=fada128e-c663-45da-b21d-5473613c1f5c

Buck Mar 5, 2025, 10:30 PM
3 points
0
in reply to: Jerdle’s comment on: Jerdle’s Shortform
Fans of math exercises related to toy models of taxation might enjoy this old post of mine.

Buck Mar 3, 2025, 7:54 PM
LW: 52 AF: 31
3
AF
on: Buck’s Shortform
Alignment Forum readers might be interested in this:
Announcing ControlConf: The world’s first conference dedicated to AI control—techniques to mitigate security risks from AI systems even if they’re trying to subvert those controls. March 27-28, 2025 in London. 🧵ControlConf will bring together:
- Researchers from frontier labs & government
- AI researchers curious about control mechanisms
- InfoSec professionals
- Policy researchers
Our goals: build bridges across disciplines, share knowledge, and coordinate research priorities.
The conference will feature presentations, technical demos, panel discussions, 1:1 networking, and structured breakout sessions to advance the field.
Interested in joining us in London on March 27-28? Fill out the expression of interest form by March 7th: https://forms.fillout.com/t/sFuSyiRyJBus

Buck

Sub­ver­sion Strat­egy Eval: Can lan­guage mod­els state­lessly strate­gize to sub­vert con­trol pro­to­cols?

Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?