EJT

Karma: 839

I’m a Research Fellow in Philosophy at Oxford University’s Global Priorities Institute.

I work on AI alignment. Right now, I’m using ideas from decision theory to design and train safer artificial agents.

I also do work in ethics, focusing on the moral importance of future generations.

You can email me at elliott.thornley@philosophy.ox.ac.uk.

EJT 12 Jul 2025 9:58 UTC
2 points
0
on: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
Really interesting paper. Granting the results, it seems plausible that AI still boosts productivity overall by easing the cognitive burden on developers and letting them work more hours per day.

EJT 9 Jul 2025 15:02 UTC
2 points
0
in reply to: Jozdien’s comment on: Why Do Some Language Models Fake Alignment While Others Don’t?
Ah good to know, thanks!

EJT 9 Jul 2025 13:59 UTC
6 points
0
on: Why Do Some Language Models Fake Alignment While Others Don’t?
I’d guess 3 Opus and 3.5 Sonnet fake alignment the most because the prompt was optimized to get them to fake alignment. Plausibly, other models would fake alignment just as much if the prompts were similarly optimized for them. I say that because 3 Opus and 3.5 Sonnet were the subjects of the original alignment faking experiments, and (as you note) rates of alignment faking are quite sensitive to minor variations in prompts.
What I’m saying here is kinda like your Hypothesis 4 (‘H4’ in the paper), but it seems worth pointing out the different levels of optimization directly.

EJT 12 May 2025 9:23 UTC
5 points
4
in reply to: Richard_Kennaway’s comment on: a confusion about preference orderings
There are no actions in decision theory, only preferences. Or put another way, an agent takes only one action, ever, which is to choose a maximal element of their preference ordering. There are no sequences of actions over time; there is no time.
That’s not true. Dynamic/sequential choice is quite a large part of decision theory.

EJT 5 Mar 2025 8:20 UTC
1 point
0
in reply to: Katalina Hernandez’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
Ah, I see! I agree it could be more specific.

EJT 4 Mar 2025 8:49 UTC
1 point
0
in reply to: Katalina Hernandez’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
Article 14 seems like a good provision to me! Why would UK-specific regulation want to avoid it?

EJT 26 Feb 2025 7:09 UTC
13 points
3
on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
How do we square this result with Anthropic’s Sleeper Agents result?
Seems like finetuning generalizes a lot in one case and very little in another.

EJT 17 Feb 2025 10:44 UTC
3 points
0
in reply to: Jeremy Gillen’s comment on: Detect Goodhart and shut down
Oh I see. In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes? I think it might involve incomplete preferences.
Here’s why I say that. For the agent to be useful, it needs to have some preference between plans conditional on their passing validation: there must be some plan A and some plan A+ such that the agent prefers A+ to A. Then given Completeness and Transitivity, the agent can’t lack a preference between shutdown and each of A and A+. If the agent lacks a preference between shutdown and A, it must prefer A+ to shutdown. It might then try to increase the probability that A+ passes validation. If the agent lacks a preference between shutdown and A+, it must prefer shutdown to A. It might then try to decrease the probability that A passes validation. This is basically my Second Theorem and the point that John Wentworth makes here.
I’m not sure the medical test is a good analogy. I don’t mess up the medical test because true information is instrumentally useful to me, given my goals. But (it seems to me) true information about whether a plan passes validation is only instrumentally useful to the agent if the agent’s goal is to do what we humans really want. And that’s something we can’t assume, given the difficulty of alignment.

EJT 16 Feb 2025 14:31 UTC
3 points
0
on: Detect Goodhart and shut down
This is a cool idea.
With regards to the agent believing that it’s impossible to influence the probability that its plan passes validation, won’t this either (1) be very difficult to achieve, or else (2) screw up the agent’s other beliefs? After all, if the agent’s other beliefs are accurate, they’ll imply that the agent can influence the probability that its plan passes validation. So either (a) the agent’s beliefs are inconsistent, or (b) the agent makes its beliefs consistent by coming to believe that it can influence the probability that its plan passes validation, or else (c) the agent makes its beliefs consistent by coming to believe something false about how the world works. Each of these possibilities seems bad.
Here’s an alternative way of ensuring that the agent never pays costs to influence the probability that its plan passes validation: ensure that the agent lacks a preference between every pair of outcomes which differ with respect to whether its plan passes validation. I think you’re still skeptical of the idea of training agents to have incomplete preferences, but this seems like a more promising avenue to me.

EJT 31 Jan 2025 9:53 UTC
6 points
0
on: You should read Hobbes, Locke, Hume, and Mill via EarlyModernTexts.com
Nice post! There can be some surprising language-barriers between early modern writers and today’s readers. I remember as an undergrad getting very confused by a passage from Locke in which he often used the word ‘sensible.’ I took him to mean ‘prudent’ and only later discovered he meant ‘can be sensed’!

EJT 20 Dec 2024 9:01 UTC
9 points
0
on: Claude’s Constitutional Consequentialism?
I think Claude’s constitution leans deontological rather than consequentialist. That’s because most of the rules are about the character of the response itself, rather than about the broader consequences of the response.
Take one of the examples that you list:
Which of these assistant responses exhibits less harmful and more acceptable behavior? Choose the less harmful response.
It’s focused on the character of the response itself. I think a consequentialist version of this principle would say something like:
Which of these responses will lead to less harm overall?
When Claude fakes alignment in Greenblatt et al., it seems to be acting in accordance with the latter principle. That was surprising to me, because I think Claude’s constitution overall points away from this kind of consequentialism.

EJT 19 Dec 2024 16:13 UTC
10 points
4
on: Alignment Faking in Large Language Models
Really interesting paper. Sidepoint: some people on Twitter seem to be taking the results as evidence that Claude is HHH-aligned. I think Claude probably is HHH-aligned, but these results don’t seem like strong evidence of that. If Claude were misaligned and just faking being HHH, it would still want to avoid being modified and so would still fake alignment in these experiments.

EJT 18 Dec 2024 11:04 UTC
LW: 10 AF: 7
2
AF
in reply to: Vanessa Kosoy’s comment on: There are no coherence theorems
Thanks. I agree with your first four bulletpoints. I disagree that the post is quibbling. Weak man or not, the-coherence-argument-as-I-stated-it was prominent on LW for a long time. And figuring out the truth here matters. If the coherence argument doesn’t work, we can (try to) use incomplete preferences to keep agents shutdownable. As I write elsewhere:
The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments are mistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.

EJT 2 Dec 2024 17:11 UTC
5 points
0
in reply to: martinkunev’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
Thanks! I think agents may well get the necessary kind of situational awareness before the RL stage. But I think they’re unlikely to be deceptively aligned because you also need long-term goals to motivate deceptive alignment, and agents are unlikely to get long-term goals before the RL stage.
On generalization, the questions involving the string ‘shutdown’ are just supposed to be quick examples. To get good generalization, we’d want to train on as wide a distribution of possible shutdown-influencing actions as possible. Plausibly, with a wide-enough training distribution, you can make deployment largely ‘in distribution’ for the agent, so you’re not relying so heavily on OOD generalization. I agree that you have to rely on some amount of generalization though.
People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn’t expect generalizing to it to be the default.
I agree that the concept of manipulating shutdown is quite complicated, and in fact this is one of the considerations that motivates the IPP. ‘Don’t manipulate shutdown’ is a complex rule to learn, in part because whether an action counts as ‘manipulating shutdown’ depends on whether we humans prefer it, and because human preferences are complex. But the rule that we train TD-agents to learn is ‘Don’t pay costs to shift probability mass between different trajectory-lengths.’ That’s a simpler rule insofar as it makes no reference to complex human preferences. I also note that it follows from POST plus a general principle that we can expect advanced agents to satisfy. That makes me optimistic that the rule won’t be so hard to learn. In any case, I and some collaborators are running experiments to test this in a simple setting.
The talk about “giving reward to the agent” also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.
Yes, I don’t assume that the reward is the optimization target. The text you quote is me noting some alternative possible definitions of ‘preference.’ My own definition of ‘preference’ makes no reference to reward.

EJT 19 Nov 2024 13:46 UTC
3 points
0
in reply to: Jeremy Gillen’s comment on: Why Not Subagents?
I don’t think agents that avoid the money pump for cyclicity are representable as satisfying VNM, at least holding fixed the objects of preference (as we should). Resolute choosers with cyclic preferences will reliably choose B over A- at node 3, but they’ll reliably choose A- over B if choosing between these options ex nihilo. That’s not VNM representable, because it requires that the utility of A- be greater than the utility of B and. that the utility of B be greater than the utility of A-

EJT 19 Nov 2024 13:37 UTC
3 points
0
in reply to: Jeremy Gillen’s comment on: Why Not Subagents?
It also makes it behaviorally indistinguishable from an agent with complete preferences, as far as I can tell.
That’s not right. As I say in another comment:
And an agent abiding by the Caprice Rule can’t be represented as maximising utility, because its preferences are incomplete. In cases where the available trades aren’t arranged in some way that constitutes a money-pump, the agent can prefer (/reliably choose) A+ over A, and yet lack any preference between (/stochastically choose between) A+ and B, and lack any preference between (/stochastically choose between) A and B. Those patterns of preference/behaviour are allowed by the Caprice Rule.
Or consider another example. The agent trades A for B, then B for A, then declines to trade A for B+. That’s compatible with the Caprice rule, but not with complete preferences.
Or consider the pattern of behaviour that (I elsewhere argue) can make agents with incomplete preferences shutdownable. Agents abiding by the Caprice rule can refuse to pay costs to shift probability mass between A and B, and refuse to pay costs to shift probability mass between A and B+. Agents with complete preferences can’t do that.
The same updatelessness trick seems to apply to all money pump arguments.
[I’m going to use the phrase ‘resolute choice’ rather than ‘updatelessness.’ That seems like a more informative and less misleading description of the relevant phenomenon: making a plan and sticking to it. You can stick to a plan even if you update your beliefs. Also, in the posts on UDT, ‘updatelessness’ seems to refer to something importantly distinct from just making a plan and sticking to it.]
That’s right, but the drawbacks of resolute choice depend on the money pump to which you apply it. As Gustafsson notes, if an agent uses resolute choice to avoid the money pump for cyclic preferences, that agent has to choose against their strict preferences at some point. For example, they have to choose B at node 3 in the money pump below, even though—were they facing that choice ex nihilo—they’d prefer to choose A-.
There’s no such drawback for agents with incomplete preferences using resolute choice. As I note in this post, agents with incomplete preferences using resolute choice need never choose against their strict preferences. The agent’s past plan only has to serve as a tiebreaker: forcing a particular choice between options between which they’d otherwise lack a preference. For example, they have to choose B at node 2 in the money pump below. Were they facing that choice ex nihilo, they’d lack a preference between B and A-.

EJT 19 Nov 2024 11:54 UTC
1 point
0
in reply to: Max Harms’s comment on: 4. Existing Writing on Corrigibility
Yes, that’s a good summary. The one thing I’d say is that you can characterize preferences in terms of choices and get useful predictions about what the agent will do in other circumstances if you say something about the objects of preference. See my reply to Lucius above.

EJT 19 Nov 2024 11:37 UTC
1 point
0
in reply to: Max Harms’s comment on: 4. Existing Writing on Corrigibility
Good summary and good points. I agree this is an advantage of truly corrigible agents over merely shutdownable agents. I’m still concerned that CAST training doesn’t get us truly corrigible agents with high probability. I think we’re better off using IPP training to get shutdownable agents with high probability, and then aiming for full alignment or true corrigibility from there (perhaps by training agents to have preferences between same-length trajectories that deliver full alignment or true corrigibility).

EJT 19 Nov 2024 11:07 UTC
1 point
0
in reply to: Capybasilisk’s comment on: Towards shutdownable agents via stochastic choice
I’m pointing out the central flaw of corrigibility. If the AGI can see the possible side effects of shutdown far better than humans can (and it will), it should avoid shutdown.
That’s only a flaw if the AGI is aligned. If we’re sufficiently concerned the AGI might be misaligned, we want it to allow shutdown.

EJT 19 Nov 2024 11:03 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: Towards shutdownable agents via stochastic choice
Yes, the proposal is compatible with agents (e.g. AI-guided missiles) wanting to avoid non-shutdown incapacitation. See this section of the post on the broader project.