Buck

Karma: 11,379

CEO at Redwood Research.

AI safety is a highly collaborative field—almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I’m saying this here because it would feel repetitive to say “these ideas were developed in collaboration with various people” in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Buck Feb 19, 2025, 11:38 PM
LW: 34 AF: 6
15
AF
in reply to: Eliezer Yudkowsky’s comment on: How might we safely pass the buck to AI?
the OpenPhil doctrine of “AGI in 2050”
(Obviously I’m biased here by being friends with Ajeya.) This is only tangentially related to the main point of the post, but I think you’re really overstating how many Bayes points you get against Ajeya’s timelines report. Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she’d have said 10% between 2025 and 2036.
I don’t think you’ve ever made concrete predictions publicly (which makes me think it’s worse behavior for you to criticize people for their predictions), but I don’t think there are that many groups who would have put wildly higher probability on AGI in this particular time period. (I think some of the short-timelines people at the time put substantial mass on AGI arriving by now, which reduces their performance.) Maybe some of them would have said 40%? If we assume AGI by then, that’s a couple bits of better performance, but I don’t think it’s massive outperformance. (And I still think it’s plausible that AGI isn’t developed by 2036!)
In general, I think that disagreements on AI timelines often seem more extreme when you summarize people’s timelines by median timeline rather than by their probability on AGI by a particular time.

Buck Feb 11, 2025, 10:03 PM
14 points
2
in reply to: davekasten’s comment on: davekasten’s Shortform
@ryan_greenblatt is working on a list of alignment research applications. For control applications, you might enjoy the long list of control techniques in our original post.

Buck Feb 8, 2025, 4:47 PM
LW: 7 AF: 6
3
AF
in reply to: Rohin Shah’s comment on: Research directions Open Phil wants to fund in technical AI safety
- Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)
I don’t know what this means, do you have any examples?

Buck Feb 8, 2025, 4:46 PM
LW: 20 AF: 12
9
AF
in reply to: Rohin Shah’s comment on: Research directions Open Phil wants to fund in technical AI safety
I think we should just all give up on the word “scalable oversight”; it is used in many conflicting ways, sadly. I mostly talk about “recursive techniques for reward generation”.

Buck Feb 3, 2025, 8:55 PM
LW: 6 AF: 5
3
AF
on: Ten people on the inside
Some reasons why the “ten people on the inside” might have massive trouble doing even cheap things:
- Siloing. Perhaps the company will prevent info flowing between different parts of the company. I hear that this already happens to some extent already. If this happens, it’s way harder to have a safety team interact with other parts of the company (e.g. instead of doing auditing themselves, they’d have to get someone from all the different teams that are doing risky stuff to do the auditing).
- Getting cancelled. Perhaps the company will learn “people who are concerned about misalignment risk constantly cause problems for us, we should avoid ever hiring them”. I think this is plausible.
- Company-inside-the-company. Perhaps AI automation allows the company to work with just a tiny number of core people, and so the company ends up mostly just doing a secret ASI project with the knowledge of just a small trusted group. This might be sensible if the leadership is worried about leaks, or if they want to do an extremely aggressive power grab.

Buck Feb 2, 2025, 2:05 AM
2 points
0
in reply to: Olli Järviniemi’s comment on: Loppukilpailija’s Shortform
How well did this workshop/exercise set go?

Some articles in “International Security” that I enjoyed

BuckJan 31, 2025, 4:23 PM

130 points

10 comments4 min readLW link

A sketch of an AI control safety case

Tomek Korbak, joshc, Benjamin Hilton, Buck and Geoffrey Irving

Jan 30, 2025, 5:28 PM

57 points

0 comments5 min readLW link

Buck Jan 30, 2025, 4:53 PM
LW: 9 AF: 8
3
AF
in reply to: Rohin Shah’s comment on: Ten people on the inside
Yep, I think that at least some of the 10 would have to have some serious hustle and political savvy that is atypical (but not totally absent) among AI safety people.
What laws are you imagine making it harder to deploy stuff? Notably I’m imagining these people mostly doing stuff with internal deployments.
I think you’re overfixating on the experience of Google, which has more complicated production systems than most.

Buck Jan 29, 2025, 4:33 PM
LW: 31 AF: 14
5
AF
in reply to: Scott Alexander’s comment on: Ten people on the inside
I’ve talked to a lot of people who have left leading AI companies for reasons related to thinking that their company was being insufficiently cautious. I wouldn’t usually say that they’d left “in protest”; for example, most of them haven’t directly criticized the companies after leaving.
In my experience, the main reason that most of these people left was that they found it very unpleasant to working there and thought their research would be better elsewhere, not that they wanted to protest poor safety policies per se. I usually advise such people against leaving if the company has very few safety staff, but it depends on their skillset.
(These arguments don’t apply to Anthropic: there are many people there who I think will try to implement reasonable safety techniques, so on the current margin, the benefits to safety technique implementation of marginal people seems way lower. It might still make sense to work at Anthropic, especially if you think it’s a good place to do safety research that can be exported.)
Incidentally, I’m happy to talk to people who are considering leaving AI companies and give them much more specific advice.
What links here?
- ryan_greenblatt's comment on Ten people on the inside by Buck (Jan 29, 2025, 4:55 PM; 5 points)

Buck Jan 29, 2025, 4:28 PM
LW: 4 AF: 3
1
AF
in reply to: Zac Hatfield-Dodds’s comment on: Ten people on the inside
Many more than two safety-concerned people have left AI companies for reasons related to thinking that those companies are reckless.

Buck Jan 28, 2025, 6:09 PM
LW: 37 AF: 17
14
AF
on: AI companies are unlikely to make high-assurance safety cases if timelines are short
Some tweets I wrote that are relevant to this post:
In general, AI safety researchers focus way too much on scenarios where there’s enough political will to adopt safety techniques that are seriously costly and inconvenient. There’s a couple reasons for this.
Firstly, AI company staff are disincentivized from making their companies look reckless, and if they give accurate descriptions of the amount of delay that the companies will tolerate, it will sound like they’re saying the company is reckless.
Secondly, safety-concerned people outside AI companies feel weird about openly discussing the possibility of AI companies only adopting cheap risk mitigations, because they’re scared of moving the Overton window and they’re hoping for a door-in-the-face dynamic.
So people focus way more on safety cases and other high-assurance safety strategies than is deserved given how likely they seem. I think that these dynamics have skewed discourse enough that a lot of “AI safety people” (broadly interpreted) have pretty bad models here.

Ten people on the inside

BuckJan 28, 2025, 4:41 PM

139 points

28 comments4 min readLW link

Buck Jan 27, 2025, 7:30 PM
12 points
8
in reply to: Ryan Kidd’s comment on: Ryan Kidd’s Shortform
I appreciate the spirit of this type of calculation, but think that it’s a bit too wacky to be that informative. I think that it’s a bit of a stretch to string these numbers together. E.g. I think Ryan and Tom’s predictions are inconsistent, and I think that it’s weird to identify 100%-AI as the point where we need to have “solved the alignment problem”, and I think that it’s weird to use the Apollo/Manhattan program as an estimate of work required. (I also don’t know what your Manhattan project numbers mean: I thought there were more like 2.5k scientists/engineers at Los Alamos, and most of the people elsewhere were purifying nuclear material)

Buck Jan 27, 2025, 6:31 PM
9 points
4
in reply to: Maxwell Peterson’s comment on: What did you learn from leaked documents?
I also loved this doc.

Buck Jan 27, 2025, 5:33 PM
4 points
0
on: Is the output of the softmax in a single transformer attention head usually winner-takes-all?
IIRC, for most attention heads the max attention is way less than 90%, so my answer is “no”. It should be very easy to get someone to make a basic graph of this for you.

Buck Jan 24, 2025, 4:41 PM
7 points
0
in reply to: johnswentworth’s comment on: The Case Against AI Control Research
Yeah I strongly disagree that this failure mode is very likely to be very hard to mostly resolve. I’m doing some research right now that will hopefully shed some light on this!

Buck Jan 23, 2025, 8:09 PM
9 points
0
in reply to: Raemon’s comment on: The Case Against AI Control Research
My sense is that the control techniques we’ve talked about (e.g. these, and some other stuff that we’ve thought of since then) would indeed drastically reduce the risk posed by those Von Neumanns, especially if you run control evaluations with the Von Neumanns.
For example, consider the problem “I want to ask the AI a question that has a numerical answer and that would take a human 8 hours to answer, and I’m worried that occasionally on a subdomain of its choice it will blatantly lie to me”. If you do no countermeasures, it seems easy for the AI to cause this problem. But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that). So I think that adopting well-tuned variations of basic techniques seems really promising.

Buck Jan 23, 2025, 7:56 PM
6 points
0
in reply to: johnswentworth’s comment on: The Case Against AI Control Research
In your post, you emphasize the slop problem. I think that the slop problem is probably much harder to solve if those AIs are scheming. I guess you’re saying that it’s just unlikely that the AIs are scheming at the point where you’re worried about the slop problem?

Buck Jan 23, 2025, 7:34 PM
2 points
0
in reply to: Raemon’s comment on: The Case Against AI Control Research
Yeah, John’s position seems to require “it doesn’t matter whether huge numbers of Von Neumann level AGIs are scheming against you”, which seems crazy to me.

Buck

Some ar­ti­cles in “In­ter­na­tional Se­cu­rity” that I enjoyed

A sketch of an AI con­trol safety case

Ten peo­ple on the inside

Some articles in “International Security” that I enjoyed

A sketch of an AI control safety case

Ten people on the inside