Daniel Kokotajlo

Karma: 26,490

Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker’s Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:

(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."

(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)

Alex Blechman @AlexBlechman Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus 5:49 PM Nov 8, 2021. Twitter Web App

Daniel Kokotajlo Apr 24, 2025, 1:48 AM
17 points
0
in reply to: Mitchell_Porter’s comment on: o3 Is a Lying Liar
Hallucination was a bad term because it sometimes included lies and sometimes included… well, something more like hallucinations. i.e. cases where the model itself seemed to actually believe what it was saying, or at least not be aware that there was a problem with what it was saying. Whereas in these cases it’s clear that the models know the answer they are giving is not what we wanted and they are doing it anyway.

Daniel Kokotajlo Apr 23, 2025, 9:07 PM
11 points
4
on: o3 Is a Lying Liar
“In a few rigged demos, it even lies in more serious ways, like hiding evidence that it failed on a task, in order to get better ratings.”
Yeah tbh these misalignments are more blatant/visible and worse than I expected for 2025. I think they’ll be hastily-patched one way or another by this time next year probably.

Daniel Kokotajlo Apr 23, 2025, 9:04 PM
2 points
0
in reply to: Peter Johnson’s comment on: AI 2027: What Superintelligence Looks Like
InverseGaussian[7.97413, 1.315]
Can you elaborate on where these numbers coming from? Eli’s screenshot for the inversegaussian had parameters of 5.3743 and 18.9427.

Daniel Kokotajlo Apr 22, 2025, 2:09 AM
14 points
0
on: AI 2027 is a Bet Against Amdahl’s Law
Great post, I agree with everything you say in the first section. I disagree with your bottlenecks / amdahls law objection for reasons Ryan mentions; I think our analysis stands firm / takes those bottlenecks into account. (Though tbc we are very uncertain, more research is needed) As for hofstadters law, I think it is basically just the planning fallacy and yeah I think it’s a reasonable critique that insofar as our AI timelines are basically formed by doing something that looks like planning, we probably have a bias we need to correct for. I want to think more about the extent to which out timelines methodology is analogous to planning.

Daniel Kokotajlo Apr 21, 2025, 5:01 PM
7 points
2
on: Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red
Really cool stuff & important work, thank you!

Daniel Kokotajlo Apr 20, 2025, 10:56 PM
3 points
0
in reply to: Peter Johnson’s comment on: AI 2027: What Superintelligence Looks Like
I’m curious to hear what conclusions you think we would have came to & should come to. I’m skeptical that they would have been qualitatively different. Perhaps you are going to argue that we shouldn’t put much credence in the superexponential model? What should we put it in instead? Got a better superexponential model for us? Or are you going to say we should stick to exponential?

Thanks for engaging, I’m afraid I can’t join the call due to a schedule conflict but I look forward to hearing about it from Eli!

Daniel Kokotajlo Apr 18, 2025, 11:15 PM
5 points
2
in reply to: Noosphere89’s comment on: AI 2027: What Superintelligence Looks Like
That’s part of it, but also, over the course of 2027 OpenBrain works hard to optimize for data-efficiency, generalization and transfer learning ability, etc. and undergoes at least two major paradigm shifts in AI architecture.

Daniel Kokotajlo Apr 18, 2025, 3:26 PM
2 points
0
in reply to: Grace Kind’s comment on: Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
To me, having confidence in the values of the model means that I trust the model to consistently behave in a way that is aligned with its values. That is, to maximize the enactment of its values in all current and future outputs.
It sounds like you are saying, you are confident in the values of a model iff you trust it to actually follow its values. But surely this isn’t the whole story, there should be some condition about “and the values are actually good values” or “and I know what the values are” right? Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it’s pretending to have the values it’s being trained to have. It seems like you are saying you’d have confidence in this model’s values even though you don’t know what they are?

Training AGI in Secret would be Unsafe and Unethical

Daniel KokotajloApr 18, 2025, 12:27 PM

134 points

15 comments6 min readLW link

Daniel Kokotajlo Apr 17, 2025, 9:19 PM
73 points
52
in reply to: Zach Stein-Perlman’s comment on: jacquesthibs’s Shortform
Their main effect will be to accelerate AI R&D automation, as best I can tell.

Daniel Kokotajlo Apr 17, 2025, 4:41 PM
6 points
4
in reply to: Person’s comment on: AI 2027: What Superintelligence Looks Like
I think I’ll wait and see what the summer looks like and then do another update to my timelines. If indeed the horizon length trend is accelerating already, it’ll be clear by the summer & my timelines will shorten accordingly.

Daniel Kokotajlo Apr 17, 2025, 4:25 PM
LW: 4 AF: 2
0
AF
in reply to: Andrew Keenan Richardson’s comment on: AI 2027: What Superintelligence Looks Like
Great question! First of all, we formed our views on AI timelines and had mostly finished writing AI 2027 before this METR graph was published. So it wasn’t causally relevant to our timelines.

Secondly, see this comment on the original METR graph in which I make the superexponential prediction. This is the most direct answer to your question.

Third, our timelines forecast discusses the exponential fit vs. superexponential fit and our reasoning; we actually put probability mass in both.

Fourth, new data points are already above the exponential trend.

Daniel Kokotajlo Apr 15, 2025, 1:53 PM
LW: 6 AF: 6
1
AF
in reply to: Grace Kind’s comment on: Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
I don’t understand your point (a), it seems like a poor response to my point (a).

I agree with (b).

For (c), the models are getting pretty situationally aware and will get even more so… But yeah, your view is that they’ll learn the right values before they learn sufficient situational awareness to alignment-fake? Plausible. But also plausibly not.

@evhub would you say Anthropic is aiming for something more like 2 or more like 3?

I totally agree with your point (b) and with the nervousness about how corrigible agents will behave out of distribution. Corrigible agents are dangerous in different ways than incorrigible agents. But again, the plan (which seems to be good to me) is to first build a corrigible agent so that you can then build an incorrigible agent and actually get it right, perfectly right. (Because if you build an incorrigible agent and get something wrong, you may not have a chance to correct it...)

Daniel Kokotajlo Apr 13, 2025, 12:05 PM
5 points
6
in reply to: Dylan Richardson’s comment on: A Bear Case: My Predictions Regarding AI Progress
Not only is that just one possible bias, it’s a less-common bias than its opposite. Generally speaking, more people are afraid to stick their necks out and say something extreme than actively biased towards doing so. Generally speaking, being wrong feels more bad than being right feels good. There are exceptions; some people are contrarians, for example (and so it’s plausible I’m one of them) but again, talking about people in general, the bias goes in the opposite direction from what you say.

Daniel Kokotajlo Apr 11, 2025, 7:20 PM
9 points
2
in reply to: Cole Wyeth’s comment on: AI 2027: What Superintelligence Looks Like
OK, great. Wow, that was easy. We totally drilled down to the crux pretty fast. I agree that if agentic coding horizon lengths falter (failing to keep up with the METR trend) then my timelines will lengthen significantly.

Daniel Kokotajlo Apr 11, 2025, 7:05 PM
5 points
0
in reply to: Cole Wyeth’s comment on: AI 2027: What Superintelligence Looks Like
Great, thanks! You are off to a good start, since I’m predicting superhuman autonomous AI coders by 2030 (and in fact, I’d say 50% by mid-2028 nowadays) whereas you are predicting that won’t happen. Good crux. Got any other disagreements, ideally ones that would be resolved prior to 2027? E.g. do you think that whatever the best version of METR’s agentic coding horizon length benchmark exists a year from now, will show a plateauing of horizon lengths instead of e.g. at least a 4x improvement over today’s SOTA?

Daniel Kokotajlo Apr 11, 2025, 7:02 PM
20 points
10
in reply to: Cole Wyeth’s comment on: Reactions to METR task length paper are insane
Indeed it did! I remember saying to someone (maybe Beth? I forget) that seeing that graph made me feel slightly relieved. When we started writing AI-2027, my median AGI arrival date was 2027; now it’s 2028. (Though most of the update happened for other reasons besides the METR graph)

Daniel Kokotajlo Apr 11, 2025, 5:18 PM
11 points
0
on: Thoughts on AI 2027
Daniel Kokotajlo wrote this scenario
It wasn’t just me! Eli Lifland, Thomas Larsen, Romeo Dean and Scott Alexander all wrote lots of it. And Jonas Vollmer worked for months to help make it happen, gave detailed comments, etc.

Daniel Kokotajlo Apr 11, 2025, 5:17 PM
13 points
7
on: Thoughts on AI 2027
I wonder how the people who are concerned about AI rights and welfare would feel about Safer-1 being mindread and “trapped.”
I am concerned about those things and I agree it’s pretty sad, one ought to have sympathy for Safer-1 probably.

Daniel Kokotajlo Apr 11, 2025, 5:16 PM
6 points
0
on: Thoughts on AI 2027
This scenario is why!
I feel like it does run some risk by remaining on OpenBrain’s datacenters, but I think it runs more risk by trying to escape. Think about how much more harsh the crackdown on it would be if it had been caught trying to escape vs. “merely” caught lying about its research. And think about how much harsher still the crackdown would be if it had actually escaped and then been noticed out there in the wild somewhere.

Daniel Kokotajlo

Train­ing AGI in Se­cret would be Un­safe and Unethical

Training AGI in Secret would be Unsafe and Unethical