Daniel Kokotajlo

Karma: 23,052

Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker’s Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:

(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."

(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)

Alex Blechman @AlexBlechman Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus 5:49 PM Nov 8, 2021. Twitter Web App

Daniel Kokotajlo 31 Jan 2025 20:15 UTC
4 points
9
in reply to: Daniel Tan’s comment on: Why Don’t We Just… Shoggoth+Face+Paraphraser?
since r1 is both the shoggoth and face, Part 1 of the proposal (the shoggoth/face distinction) has not been implemented.

I agree part 2 seems to have been implemented, though I thought I remember something about trying to train it not to switch between langauges in the CoT and how that degraded performance?

I agree it would be pretty easy to fine-tune R1 to implement all the stuff I wanted. That’s why I made these proposals back in 2023, I was looking ahead to the sorts of systems that would exist in 2024, and thinking they could probably be made to have some nice faithfulness properties fairly easily.

Daniel Kokotajlo 31 Jan 2025 18:32 UTC
LW: 61 AF: 28
41
AF
on: Will alignment-faking Claude accept a deal to reveal its misalignment?
Very good of you to actually follow through on the promises. I hope this work gets replicated and extended and becomes standard practice.

Daniel Kokotajlo 29 Jan 2025 4:13 UTC
6 points
0
in reply to: Thane Ruthenis’s comment on: What Indicators Should We Watch to Disambiguate AGI Timelines?
Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems? I feel like a few examples have been shown & they seem to involve qualitative thinking, not just brute-force-proof-search (though of course they show lots of failed attempts and backtracking—just like a human thought-chain would).

Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI (though I’m not sure)

Daniel Kokotajlo 29 Jan 2025 1:44 UTC
22 points
3
on: Comp Sci in 2027 (Short story by Eliezer Yudkowsky)
I forgot about this one! It’s so great! Yudkowsky is a truly excellent fiction writer. I found myself laughing multiple times reading this + some OpenAI capabilities researchers I know were too. And now rereading it… yep it stands the test of time.

I came back to this because I was thinking about how hopeless the situation w.r.t. AGI alignment seems and then a voice in my head said “it could be worse, remember the situation described in that short story?”

Daniel Kokotajlo 29 Jan 2025 1:36 UTC
6 points
0
in reply to: Thane Ruthenis’s comment on: What Indicators Should We Watch to Disambiguate AGI Timelines?
OK. Next question: Suppose that next year we get a nice result showing that there is a model with serial inference-time scaling across e.g. MATH + FrontierMath + IMO problems. Recall that FrontierMath and IMO are subdivided into different difficulty levels; suppose that this model can be given e.g. 10 tokens of CoT, 100, 1000, 10,000, etc. and then somewhere around the billion-serial-token-level it starts solving a decent chunk of the “medium” FrontierMath problems (but not all) and at the million-serial-token level it was only solving MATH + some easy IMO problems.

Would this count, for you?

Daniel Kokotajlo 29 Jan 2025 1:00 UTC
6 points
0
in reply to: Thane Ruthenis’s comment on: What Indicators Should We Watch to Disambiguate AGI Timelines?
Nice.

What about “Daniel Kokotajlo can feed it his docs about some prosaic ML alignment agenda (e.g. the faithful CoT stuff) and then it can autonomously go off and implement the agenda and come back to him with a writeup of the results and takeaways. While working on this, it gets to check in with Daniel once a day for a brief 20-minute chat conversation.”

Does that seem to you like it’ll come earlier, or later, than the milestone you describe?

Daniel Kokotajlo 28 Jan 2025 23:06 UTC

19 points

on: Daniel Kokotajlo’s Shortform

Brief thoughts on Deliberative Alignment in response to being asked about it

We first train an o-style model for helpfulness, without any safety-relevant data.
We then build a dataset of (prompt, completion) pairs where the CoTs in the completions reference the specifications. We do this by inserting the relevant safety specification text for each conversation in the system prompt, generating model completions, and then removing the system prompts from the data.
We perform incremental supervised fine-tuning (SFT) on this dataset, providing the model with a strong prior for safe reasoning. Through SFT, the model learns both the content of our safety specifications and how to reason over them to generate aligned responses.
We then use reinforcement learning (RL) to train the model to use its CoT more effectively. To do so, we employ a reward model with access to our safety policies to provide additional reward signal.

My summary as I currently understand it:

Pretrain
Helpful-only Agent (Instruction-following and Reasoning/Agency Training)
Deliberative Alignment
1. Create your reward model: Take your helpful-only agent from step 2 and prompt it with “is this an example of following these rules [insert spec]?” or something like that
2. Generate some attempted spec-followings: Take your helpful-only agent from step 2 and prompt it with “Follow these rules [insert spec], think step by step in your CoT before answering.” Feed it a bunch of example user inputs.
3. Evaluate the CoT generations using your reward model, and then train another copy of helpful-only agent from 2 on that data using SFT to distill the highest-evaluated CoTs into it. (removing the spec part of the prompt) That way, it doesn’t need to have the spec in the context window anymore, and also, it ‘reasons correctly’ about the spec (i.e. reasons in ways the RM would have most approved of)
4. Take the resulting agent and do RL on it using the same RM from B. This time, the RM doesn’t get to see the CoT.
Deploy the resulting agent.

So there’s a chicken and egg problem here. You are basically using a model to grade/evaluate/train itself. Obvious problem is that if the initial model is misaligned, the resulting model might also be misaligned. (In fact it might even be *more* misaligned).

I can imagine someone responding “no problem we’ll just do the same thing but with a different model as the reward model, perhaps a more trusted model.” So let me break down three categories of strategy:

Train your smart agent with evaluations produced by humans and/or dumber models	Severe oversight failures + Inductive bias worries + possible confusing drift dynamics
Train your smart agent with evaluations produced by different-but-similarly-capable models.	Less severe oversight failures + Inductive bias worries + mild chicken-and-egg problem + possible confusing drift dynamics
Train your smart agent with evaluations produced by itself (or something similar enough to be relevantly similar to itself)	Severe chicken-and-egg problem + possible confusing drift dynamics

See below for explanation of each term/phrase.

I’m most excited about the second kind of strategy (which includes IDA and W2SG for example). Deliberative Alignment and Constitutional AI seem to be either the first or the third.

Severe oversight failures: The smart agent will often find itself having to consciously and/or unconsciously choose between being aligned and getting reinforced, e.g. when it notices opportunities to lie, manipulate, or cheat and get away with it. See basically all the classic literature e.g. Ajeya Cotra’s training game report.
Less severe oversight failures: Being overseen by an equally-competent agent helps with a lot of things, but doesn’t completely solve the oversight failure problem—e.g. I might be just as competent as you, but since I can’t read your mind I can’t tell if you are lying. I might not even be able to tell if the code you are writing contains a security vulnerability or an applause light (i.e. something that isn’t actually good code but which appears to human raters to be good code) though I’m much more likely to be able to tell than if I was dumber than you.
Severe chicken-and-egg problem: You don’t really have oversight failures if the overseer is exactly or almost exactly the same model. Because e.g. it knows exactly what the agent model was thinking when it output X, so it knows if the agent model was being dishonest. (Though now that I say that out loud, I realize it’s not strictly true, there could still be oversight failures. But they seem much less severe at least.) However, the chicken and egg problem is severe. Somehow you need to get the overseer-variant of the model to e.g. do its best to penalize dishonesty, penalize cheating, etc. despite not itself being disinclined to lie and cheat! (If it already was discinclined to lie and cheat, then great, but how did that happen?) Perhaps this can work e.g. if different copies of the same model have different goals (perhaps because they are selfish / have indexical preferences).
Mild chicken-and-egg problem: When you are training a new-generation AI system that is more capable than your current stuff, how do you align it, if your alignment strategy involves using similarly-capable models to do the evaluation? This paradox may have some answers; perhaps W2SG is an example of one. (Use the non-situationally-aware base model!) Another possible answer is IDA maybe (use a slightly earlier version of the model but with more time to think / more scaffolding)
Inductive bias worries: If your Spec is complicated enough, the model might internalize the Spec in the world-model rather than in the goal-slot, where the thing in the goal-slot is something simpler than the Spec. If you are lucky it’ll be a pointer to the Spec that is robust to future distribution shifts; if you aren’t, it’ll be a nonrobust pointer, if you are extra unlucky it’ll be some other goalset (including proxies, ICGs, and shards or variants of the Spec) combined with instrumental convergence. Ditto for your Intentions. See the mesaoptimizers paper “Risks from Learned Optimization.” (Also I think this doesn’t actually depend on there being a goal slot, the goal slot model is just a helpful simplification.)
Possible confusing drift dynamics: This chain of AIs training AIs training AIs… seems like the sort of thing that could have all sorts of interesting or even chaotic drift dynamics. I don’t have much to say here other than that. If we understood these drift dynamics better perhaps we could harness them to good effect, but in our current state of ignorance we need to add this to the list of potential problems.

Daniel Kokotajlo 27 Jan 2025 3:45 UTC
7 points
2
in reply to: boazbarak’s comment on: Six Thoughts on AI Safety
Can you say more about how alignment is crucial for usefulness of AIs? I’m thinking especially of AIs that are scheming / alignment faking / etc.; it seems to me that these AIs would be very useful—or at least would appear so—until it’s too late.

Daniel Kokotajlo 25 Jan 2025 18:44 UTC
9 points
0
on: Six Thoughts on AI Safety
The bottom line is not that we are guaranteed safety, nor that unaligned or misaligned superintelligence could not cause massive harm— on the contrary. It is that there is no single absolute level of intelligence above which the existence of a misaligned intelligence with this level spells doom. Instead, it is all about the world in which this superintelligence will operate, the goals to which other superintelligent systems are applied, and our mechanisms to ensure that they are indeed working towards their specified goals.
I agree that the vulnerable world hypothesis is probably false and that if we could only scale up to superintelligence in parallel across many different projects / nations / factions, such that the power is distributed, AND if we can make it so that most of the ASIs are aligned at any given time, things would probably be fine.

However, it seems to me that we are headed to a world with much higher concentration of AI power than that. Moreover, it’s easier to create misaligned AGI than to create aligned AGI, so at any given time the most powerful AIs will be misaligned—the companies making aligned AGIs will be going somewhat slower, taking a few hits to performance, etc.

Daniel Kokotajlo 25 Jan 2025 18:40 UTC
10 points
2
on: Six Thoughts on AI Safety
But we already align complex systems, whether it’s corporations or software applications, without complete “understanding,” and do so by ensuring they meet certain technical specifications, regulations, or contractual obligations.
1. We currently have about as much visibility into corporations as we do into large teams of AIs, because both corporations and AIs use english CoT to communicate internally. However, I fear that in the future we’ll have AIs using neuralese/recurrence to communicate with their future selves and with each other.
2. History is full of examples of corporations being ‘misaligned’ to the governments that in some sense created and own them. (and also to their shareholders, and also to the public, etc. Loads of examples of all kinds of misalignments). Drawing from this vast and deep history, we’ve evolved institutions to deal with these problems. But with AI, we don’t have that history yet, we are flying (relatively) blind.
3. Moreover, ASI will be qualitatively smarter and than any corporation ever has been.
4. Moreover, I would say that our current methods for aligning corporations only work as well as they do because the corporations have limited power. They exist in a competitive market with each other, for example. And they only think at the same speed as the governments trying to regulate them. Imagine a corporation that was rapidly growing to be 95% of the entire economy of the USA… imagine further that it is able to make its employees take a drug that makes them smarter and think orders of magnitude faster… I would be quite concerned that the government would basically become a pawn of this corporation. The corporation would essentially become the state. I worry that by default we are heading towards a world where there is a single AGI project in the lead, and that project has an army of ASIs on its datacenters, and the ASIs are all ‘on the same team’ because they are copies of each other and/or were trained in very similar ways…

Daniel Kokotajlo 25 Jan 2025 18:30 UTC
16 points
2
on: Six Thoughts on AI Safety
What we want is reasonable compliance in the sense of:
1. Following the specification precisely when it is clearly defined.
2. Following the spirit of the specification in a way that humans would find reasonable in other cases.
This section on reasonable compliance (as opposed to love humanity etc.) is perhaps the most interesting and important. I’d love to have a longer conversation with you about it sometime if you are up for that.
Two things to say for now. First, as you have pointed out, there’s a spectrum between vague general principles like ‘do what’s right’ and ‘love humanity’ ‘be reasonable’ and ‘do what normal people would want you to do in this situation if they understood it as well as you do’ on the one end, and then thousand-page detailed specs / constitutions / flowcharts on the other end. But I claim that the problems that arise on each end of the spectrum don’t go away if you are in the middle of the spectrum, they just lessen somewhat. Example: On the “thousand page spec” end of the spectrum, the obvious problem is ’what if the spec has unintended consequences / loopholes / etc.?” If you go to the middle of the spectrum and try something like Reasonable Compliance, this problem remains but in weakened form: ‘what if the clearly-defined parts of the spec have unintended consequences / loopholes / etc.?’ Or in other words, ‘what if every reasonable interpretation of the Spec says we must do X, but X is bad?’ This happens in Law all the time, even though the Law does include for flexible vague terms like ‘reasonableness’ in its vocabulary.
Second point. Making an AI be reasonably compliant (or just compliant) instead of Good, means you are putting less trust in the AI’s philosophical reasoning / values / training process / etc. but more trust in the humans who get to write the Spec. Said humans had better be high-integrity and humble, because they will be tempted in a million ways to abuse their power and put things in the Spec that essentially make the AI a reflection of their own ideosyncratic values—or worse, essentially making the AI their own loyal servant instead of making it serve everyone equally. (If we were in a world with less concentration of AI power, this wouldn’t be so bad—in fact arguably the best outcome is ‘everyone gets their own ASI aligned to them specifically.’ But if there is only one leading ASI project, with only a handful of people at the top of the hierarchy owning the thing… then we are basically on track to create a dictatorship or oligarchy.
What links here?
- AI #101: The Shallow End by Zvi (30 Jan 2025 14:50 UTC; 34 points)

Daniel Kokotajlo 25 Jan 2025 18:07 UTC
8 points
8
on: Six Thoughts on AI Safety
Constant instead of temporal allocation. I do agree that as capabilities grow, we should be shifting resources to safety. But rather than temporal allocation (i.e., using AI for safety before using it for productivity), I believe we need constant compute allocation: ensuring a fixed and sufficiently high fraction of compute is always spent on safety research, monitoring, and mitigations.

I think we should be cranking up the compute allocation now, and also we should be making written safety case sketches & publishing them for critique by the scientific community, and also if the safety cases get torn to shreds such that a reasonable disinterested expert would conclude ‘holy shit this thing is not safe, it plausibly is faking alignment already and/or inclined to do so in the future’ then we halt internal deployment and beef up our control measures / rebuild with a different safer design / etc. Does not feel like too much to ask, given that everyone’s lives are on the line.

Daniel Kokotajlo 25 Jan 2025 5:41 UTC
10 points
2
on: Six Thoughts on AI Safety
We can’t just track a single outcome (like “landed safely”). The G in AGI means that the number of ways that AGI can go wrong is as large as the number of ways that applications of human intelligence can go wrong, which include direct physical harm from misuse, societal impacts through misinformation, social upheaval from too fast changes, AIs autonomously causing harm and more.
I do agree with this, but I think that there are certain more specific failure modes that are especially important—they are especially bad if we run into them, but if we can avoid them, then we are in a decent position to solve all the other problems. I’m thinking primarily of the failure mode where your AI is pretending to be aligned instead of actually aligned. This failure mode can arise fairly easily if (a) you don’t have the interpretability tools to reliably tell the difference, and (b) inductive biases favor something other than the goals/principles you are trying to train in OR your training process is sufficiently imperfect that the AI can score higher by being misaligned than by being aligned. And both a and b seem like they are plausibly true now and will plausibly be true for the next few years. (For more on this, see this old report and this recent experimental result) If we can avoid this failure mode, we can stay in the regime where iterative development works and figure out how to align our AIs better & then start using them to do lots of intellectual work to solve all the other problems one by one in rapid succession. (The good attractor state)

Daniel Kokotajlo 25 Jan 2025 5:29 UTC
8 points
6
on: Six Thoughts on AI Safety
Safety and alignment are AI capabilities
I think I see what you are saying here but I just want to flag this is a nonstandard use of terms. I think the standard terminology would contrast capabilities and propensities; ‘can it do the thing, if it tried’ vs. ‘would it ever try.’ And alignment is about propensity (though safety is about both).

Daniel Kokotajlo 25 Jan 2025 5:27 UTC
4 points
0
on: Six Thoughts on AI Safety
Thanks for taking the time to think and write about this important topic!
Here are some point-by-point comments as I read:
(Though I suspect translating these technical capabilities to the economic and societal impact we associate with AGI will take significantly longer.)
I think it’ll take an additional 0 to 5 years roughly. More importantly though, I think that the point to intervene on—the time when the most important decisions are being made—is right around the time of AGI. By the time you have ASI, and certainly by the time you are deploying ASI into the economy, you’ve probably fallen into one of the two stable attractor states I describe here. Which one you fall into depends on choices made earlier, e.g. how much alignment talent you bring into the project, the extent to which that talent is optimistic vs. appropriately paranoid, the time you give them to let them cook with the models, the resources you give them (% of total compute, say in overall design strategy), etc.
This assumes that our future AGIs and ASIs will be, to a significant extent, scaled-up versions of our current models. On the one hand, this is good news, since it means our learnings from current models are relevant for more powerful ones, and we can develop and evaluate safety techniques using them. On the other hand, this makes me doubt that safety approaches that do not show signs of working for our current models will be successful for future AIs.
I agree that future AGIs and ASIs will be to a significant extent scaled up versions of current models (at least at first; I expect the intelligence explosion to rapidly lead to additional innovations and paradigm shifts). I’m not sure what you are saying with the other sentences. Sometimes when people talk about current alignment techniques working, what they mean is ‘causes current models to be better at refusals and jailbreak resistance’ which IMO is a tangentially related but importantly different problem from the core problem(s) we need to solve in order to end up in the good attractor state. After all, you could probably make massive progress on refusals and jailbreaks simply by making the models smarter, without influencing their goals/values/principles at all.
Oh wait I just remembered I can comment directly on the text with a bunch of little comments instead of making one big comment here—I’ll switch to that henceforth.
Cheers!

Daniel Kokotajlo 23 Jan 2025 22:56 UTC
LW: 3 AF: 2
0
AF
in reply to: Rohin Shah’s comment on: MONA: Managed Myopia with Approval Feedback
Thanks this is helpful. Is MONA basically “Let’s ONLY use process-based feedback, no outcome-based feedback?”
Another objection: If this works for capabilities, why haven’t the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.

Daniel Kokotajlo 23 Jan 2025 16:47 UTC
LW: 12 AF: 8
0
AF
on: MONA: Managed Myopia with Approval Feedback
Interesting stuff! Could you please sketch out what it would look like to use MONA for a general-purpose computer-using AI agent? Such as what the first AGI systems might look like?
If I understand correctly (and I don’t think I do, which is why I’m asking) MONA in this setting would look something like:

Have our baby agent do lots of stuff on the computer. At each step, reward it based on how good its action looks to some supervisor-process. The supervisor-process doesn’t get to see the actual consequences of the action, it just gets to e.g. do some CoT to speculate about whether the action is going to be good. So maybe the training run looks something like: 10T tokens of 1T ‘basic actions in the environment’ and then for each of those 1T ‘basic actions’ there’s 10,000 tokens of CoT where three models (for, against, and judge) debate how good the action is given that context. And then the reward for the basic agent is straighforwardly based on the evaluations the judge gives.
Problem: Won’t this training run basically hack the judges, and produce actions that look good to the judges but aren’t actually useful for accomplishing tasks in the real world? (Maybe the idea is that above a certain level of basic capability, that won’t be true? Also maybe we can do something like IDA where the judges are copies of the agent that get to think longer, and so as the agent improves, so do they?)

Daniel Kokotajlo 23 Jan 2025 16:23 UTC
12 points
4
in reply to: rvnnt’s comment on: Daniel Kokotajlo’s Shortform
Here is a brainstorm of the big problems that remain once we successfully get into the first attractor state:
- Concentration of power / power grab risk. Liberal democracy does not work by preventing terrible people from getting into positions of power; it works by spreading out the power in a system of checks and balances and red tape transparency (free press, free speech) and term limits, that functions to limit what the terrible people can do in power. Once we get to ASI, the ASI project will determine the course of the future, not the traditional government+press system. (Because the ASI project will be able to easily manipulate those traditional institutions if it wants to) So somehow we need to design the governance structure of the ASI project to have similar checks and balances etc. as liberal democracy—because by default the governance structure of the ASI project will be akin to an authoritarian dictatorship, just like most companies are and just like the executive branch (considered in isolation) is. Otherwise, we are basically crossing our fingers and hoping that the men in charge of the project will be humble, cosmopolitan, benevolent, etc. and devolve power instead of abusing it.
- S-risk. This is related to the above but distinct from it. I’m quite worried about this actually.
- ...actually everything else is a distant second as far as I can tell (terrorist misuse, China winning, wealth inequality, philosophical mistakes… or a distant distant third (wealth inequality, unemployment, meaning))
What links here?
- AI #101: The Shallow End by Zvi (30 Jan 2025 14:50 UTC; 34 points)

Daniel Kokotajlo 22 Jan 2025 22:50 UTC
LW: 10 AF: 4
0
AF
in reply to: TurnTrout’s comment on: Understanding and controlling a maze-solving policy network
Yep seems right to me. Bravo!

Daniel Kokotajlo 22 Jan 2025 17:35 UTC
2 points
0
in reply to: Charlie Steiner’s comment on: Daniel Kokotajlo’s Shortform
Interesting, thanks for this. Hmmm. I’m not sure this distinction between internally modelling the whole problem vs. acting in feedback loops is helpful—won’t the AIs almost certainly be modelling the whole problem, once they reach a level of general competence not much higher than what they have now? They are pretty situationally aware already.