EJT

Karma: 826

I’m a Postdoctoral Research Fellow at Oxford University’s Global Priorities Institute.

Previously, I was a Philosophy Fellow at the Center for AI Safety.

So far, my work has mostly been about the moral importance of future generations. Going forward, it will mostly be about AI.

You can email me at elliott.thornley@philosophy.ox.ac.uk.

EJT Jun 18, 2024, 4:41 PM
6 points
7
on: What do coherence arguments actually prove about agentic behavior?
I’m coming to this two weeks late, but here are my thoughts.
The question of interest is:
- Will sufficiently-advanced artificial agents be representable as maximizing expected utility?
Rephrased:
- Will sufficiently-advanced artificial agents satisfy the VNM axioms (Completeness, Transitivity, Independence, and Continuity)?
Coherence arguments purport to establish that the answer is yes. These arguments go like this:
1. There exist theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.
2. Sufficiently-advanced artificial agents will not pursue dominated strategies.
3. So, sufficiently-advanced artificial agents will be representable as maximizing expected utility.
These arguments don’t work, because premise 1 is false: there are no theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies. In the year since I published my post, no one has disputed that.
Now to address two prominent responses:
‘I define ‘coherence theorems’ differently.’
In the post, I used the term ‘coherence theorems’ to refer to ‘theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.’ I took that to be the usual definition on LessWrong (see the Appendix for why), but some people replied that they meant something different by ‘coherence theorems’: e.g. ‘theorems that are relevant to the question of agent coherence.’
All well and good. If you use that definition, then there are coherence theorems. But if you use that definition, then coherence theorems can’t play the role that they’re supposed to play in coherence arguments. Premise 1 of the coherence argument is still false. That’s the important point.
‘The mistake is benign.’
This is a crude summary of Rohin’s response. Rohin and I agree that the Complete Class Theorem implies the following: ‘If an agent has complete and transitive preferences, then unless the agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.’ So the mistake is neglecting to say ‘If an agent has complete and transitive preferences…’ Rohin thinks this mistake is benign.
I don’t think the mistake is benign. As my rephrasing of the question of interest above makes clear, Completeness and Transitivity are a major part of what coherence arguments aim to establish! So it’s crucial to note that the Complete Class Theorem gives us no reason to think that sufficiently-advanced artificial agents will have complete or transitive preferences, especially since:
- Completeness doesn’t come for free.
- Money-pump arguments for Completeness (applied to artificial agents) aren’t convincing.
- Money-pump arguments for Transitivity assume Completeness.
- Training agents to violate Completeness might keep them shutdownable.
Two important points
Here are two important points, which I make to preclude misreadings of the post:
- Future artificial agents—trained in a standard way—might still be representable as maximizing expected utility.
Coherence arguments don’t work, but there might well be other reasons to think that future artificial agents—trained in a standard way—will be representable as maximizing expected utility.
- Artificial agents not representable as maximizing expected utility can still be dangerous.
So why does the post matter?
The post matters because ‘train artificial agents to have incomplete preferences’ looks promising as a way of ensuring that these agents allow us to shut them down.
AI safety researchers haven’t previously considered incomplete preferences as a solution, plausibly because these researchers accepted coherence arguments and so thought that agents with incomplete preferences were a non-starter.^[1] But coherence arguments don’t work, so training agents to have incomplete preferences is back on the table as a strategy for reducing risks from AI. And (I think) it looks like a pretty good strategy. I make the case for it in this post, and my coauthors and I will soon be posting some experimental results suggesting that the strategy is promising.
1. ^
  As I wrote elsewhere:
  The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments are mistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.

EJT Jun 6, 2024, 9:32 AM
8 points
5
on: Aggregative Principles of Social Justice
Looking forward to reading this properly. For now I’ll just note that Roger Crisp attributes LELO to C.I. Lewis.

EJT May 28, 2024, 3:12 PM
4 points
0
in reply to: TsviBT’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
Good point! Thinking about it, it seems like an analogue of Good’s theorem will apply.
Here’s some consolation though. We’ll be able to notice if the agent is choosing stochastically at the very beginning of each episode and then choosing deterministically afterwards. That’s because we can tell whether an agent is choosing stochastically at a timestep by looking at its final-layer activations at that timestep. If one final-layer neuron activates much more than all the other final-layer neurons, the agent is choosing (near-)deterministically; otherwise, the agent is choosing stochastically.
Because we can easily notice this behaviour, plausibly we can find some way to train against it. Here’s a new idea to replace the $λ^{n}$ reward function. Suppose the agent’s choice is as follows:
At this timestep, we train the agent using supervised learning. Ground-truth is a vector of final-layer activations in which the activation of the neuron corresponding to ‘Yes’ equals the activation of the neuron corresponding to ‘No’. By doing this, we update the agent directly towards stochastic choice between ‘Yes’ and ‘No’ at this timestep.

EJT May 6, 2024, 4:51 PM
3 points
2
in reply to: the gears to ascension’s comment on: Some Experiments I’d Like Someone To Try With An Amnestic
For those who don’t get the joke: benzos are depressants, and will (temporarily) significantly reduce your cognitive function if you take enough to have amnesia.
But Eric Neyman’s post suggests that benzos don’t significantly reduce performance on some cognitive tasks (e.g. Spelling Bee)

EJT Apr 10, 2024, 3:21 PM
2 points
1
in reply to: ryan_greenblatt’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
I think there is probably a much simpler proposal that captures the spirt of this and doesn’t require any of these moving parts. I’ll think about this at some point.
Okay, interested to hear what you come up with! But I dispute that my proposal is complex/involves a lot of moving parts/depends on arbitrarily far generalization. My comment above gives more detail but in brief: POST seems simple, and TD follows on from POST plus principles that we can expect any capable agent to satisfy. POST guards against deceptive alignment in training for TD, and training for POST and TD doesn’t run into the same barriers to generalization as we see when we consider training for honesty.

EJT Apr 10, 2024, 3:09 PM
4 points
0
in reply to: ryan_greenblatt’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’. To learn POST, agents just have to learn to compare pairs of outcomes with respect to ‘time that I remain operational’, and to lack a preference if these times differ. Behaviourally, they just have to learn to compare available outcomes with respect to ‘time that I remain operational’, and to choose stochastically if these times differ.
And if and when an agent learns POST, I think Timestep Dominance is a simple and natural rule to learn. In terms of preferences, Timestep Dominance follows from POST plus a Comparability Class Dominance principle (CCD). And satisfying CCD seems like a prerequisite for capable agency. Behaviourally, ‘don’t pay costs to shift probability mass between shutdowns at different timesteps’ follows from POST plus another principle that seems like a prerequisite for minimally sensible action under uncertainty.
And once you’ve got POST (I argue), you can train for Timestep Dominance without worrying about deceptive alignment, because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance. By contrast, if you instead train for ‘some goal + honesty’, deceptive alignment is a real concern.
Timestep Dominance is indeed sensitive to unlikely conditionals, but in practice I expect the training regimen to involve just giving lower reward to the agent for paying costs to shift probability mass between shutdowns at different timesteps. Maybe the agent starts out by learning a heuristic to that effect: ‘Don’t pay costs to shift probability mass between shutdowns at different timesteps’. If and when the agent starts reflecting and replacing heuristics with cleaner principles, Timestep Dominance is the natural replacement (because it usually delivers the same verdicts as the heuristic, and because it follows from POST plus CCD). And Timestep Dominance (like the heuristic) keeps the agent shutdownable (at least in cases where the unlikely conditionals are favourable. I agree that it’s unclear exactly how often this will be the case).
Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe. So it seems like there’s this insuperable barrier to ensuring that honesty generalizes far, even in the absence of deceptive alignment.
By contrast, it doesn’t seem like there’s any parallel barrier to getting POST and Timestep Dominance to generalize far. Suppose we train for POST, but then recognise that our training regimen might lead the agent to learn some other rule instead, and that this other rule will lead the AI to behave differently in some situations. In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.
What links here?
- EJT's comment on The Shutdown Problem: Incomplete Preferences as a Solution by EJT (Apr 10, 2024, 3:21 PM; 2 points)

EJT Apr 9, 2024, 1:32 PM
2 points
3
in reply to: Seth Herd’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
Thanks, appreciate this!
Iit’s not clear from your summary how temporal indifference would prevent shutdown preferences. How does not caring about how many timesteps result in not caring about being shut down, probably permanently?
I tried to answer this question in The idea in a nutshell. If the agent lacks a preference between every pair of different-length trajectories, then it won’t care about shifting probability mass between different-length trajectories, and hence won’t care about hastening or delaying shutdown.
There’s a lot of discussion of this under the terminology “corrigibility is anti-natural to consequentialist reasoning”. I’d like to see some of that discussion cited, to know you’ve done the appropriate scholarship on prior art. But that’s not a dealbreaker to me, just one factor in whether I dig into an article.
The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments are mistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.
Now, you may be addressing non-sapient AGI only, that’s not allowed to refine its world model to make it coherent, or to do consequentialist reasoning.
That’s not what I intend. TD-agents can refine their world models and do consequentialist reasoning.
When I asked about the core argument in the comment above, you just said “read these sections”. If you write long dense work and then just repeat “read the work” to questions, that’s a reason people aren’t engaging. Sorry to point this out; I understand being frustrated with people asking questions without reading the whole post (I hadn’t), but that’s more engagement than not reading and not asking questions. Answering their questions in the comments is somewhat redundant, but if you explain differently, it gives readers a second chance at understanding the arguments that were sticking points for them and likely for other readers as well.
Having read the post in more detail, I still think those are reasonable questions that are not answered clearly in the sections you mentioned. But that’s less important than the general suggestions for getting more engagement with this set of ideas in the future.
Ah sorry about that. I linked to the sections because I presumed that you were looking for a first chance to understand the arguments rather than a second chance, so that explaining differently would be unnecessary. Basically, I thought you were asking where you could find discussion of the parts that you were most interested in. And I thought that each of the sections were short enough and directly-answering-your-question-enough to link, rather than recapitulate the same points.
In answer to your first question, incomplete preferences allows the agent to prefer an option B+ to another option B, while lacking a preference between A and B+, and lacking a preference between A and B. The agent can thus have preferences over same-length trajectories while lacking a preference between every pair of different-length trajectories. That prevents preferences over being shut down (because the agent lacks a preference between every pair of different-length trajectories) while preserving preferences over goals that we want it to have (because the agent has preferences over same-length trajectories).
In answer to your second question, Timestep Dominance is the principle that keeps the agent shutdownable, but this principle is silent in cases where the agent has a choice between making $1 in one timestep and making $1m in two timesteps, so the agent’s preference between these two options can be decided by some other principle (like – for example – ‘maximise expected utility among the non-timestep-dominated options’).

EJT Apr 9, 2024, 10:41 AM
1 point
1
in reply to: quetzal_rainbow’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
Yep, maybe that would’ve been a better idea!
I think that stochastic choice does suffice for a lack of preference in the relevant sense. If the agent had a preference, it would reliably choose the option it preferred. And tabooing ‘preference’, I think stochastic choice between different-length trajectories makes it easier to train agents to satisfy Timestep Dominance, which is the property that keeps agents shutdownable. And that’s because Timestep Dominance follows from stochastic choice between different-length trajectories and a more general principle that we’ll train agents to satisfy, because it’s a prerequisite for minimally sensible action under uncertainty. I discuss this in a little more detail in section 18.

EJT Apr 9, 2024, 9:56 AM
LW: 4 AF: 2
1
AF
in reply to: ryan_greenblatt’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
Thanks, appreciate this!
It’s unclear to me what the expectation in Timestep Dominance is supposed to be with respect to. It doesn’t seem like it can be with respect to the agent’s subjective beliefs as this would make it even harder to impart.
I propose that we train agents to satisfy TD with respect to their subjective beliefs. I’m guessing that you think that this kind of TD would be hard to impart because we don’t know what the agent believes, and so don’t know whether a lottery is timestep-dominated with respect to those beliefs, and so don’t know whether to give the agent lower reward for choosing that lottery.
But (it seems to me) we can be quite confident that the agent has certain beliefs, because these beliefs are necessary for performing well in training. For example, we can be quite confident that the agent believes that resisting shutdown costs resources, that the resources spent on resisting shutdown can’t also be spent on directly pursuing utility at a timestep, and so on.
And if we can be quite confident that the agent has these accurate beliefs about the environment, then we can present the agent with lotteries that are actually timestep-dominated (according to the objective probabilities decided by the environment) and be quite confident that these lotteries are also timestep-dominated with respect to the agent’s beliefs. After all, we don’t need to know the agent’s beliefs with any great detail or precision to tell whether a lottery is timestep-dominated with respect to those beliefs. We just need to know whether the agent believes that the lottery involves spending resources only to shift probability mass between shutdowns at different timesteps. My proposal is that we present the agent with lotteries in which this is actually the case (according to the objective probabilities decided by the environment) and use the fact that capable agents’ beliefs will reflect this actuality.
Imparting TD preferences seems hard
The prosaic version of this proposal assumes that you can impart timestep dominance preferences into AIs in ways which will robustly generalize. This seems unlikely to be true in general (in the absence of additional technology) and if we did have the property, we could solve safety issues in some other way (e.g. robustly generalizing honesty). So you’ll need to argue that timestep dominance is particularly easy to robustly impart relative to other preferences. I can see some arguments for timestep dominance being particularly easy to robustly impart, but they don’t seem very strong to me.
Yep, I claim that it’s easier to robustly impart POST and Timestep Dominance than it is to robustly impart things like honesty. And that’s because (it seems to me) we can train for POST and Timestep Dominance in ways that largely circumvent the problems of reward misspecification, goal misgeneralization, and deceptive alignment. I argue that case in section 19 but in brief: POST and TD seem easy to reward accurately, seem simple, and seem never to give agents a chance to learn goals that incentivise deceptive alignment. By contrast, none of those things seem true of a preference for honesty. Can you explain why those arguments don’t seem strong to you?
Suppose that Clippy is a TD-agent which cares solely about the long run number of paperclips, but subject to a TD constraint. That is, it always prefers trajectory X over Y if for all shutdown timings X gives a >= number of paperclips than Y and X gives a > number of paperclips for some timing.
Now, consider the following proposed action for Clippy: “In a stealthy manner, construct a successor agent which pursues long run paperclips at all costs (and which isn’t subject to a shutdown button, but still has TD constraints wrt. to the original Clippy’s shutdown button).” This agent is implicitly resisting shutdown as it’s constructing the sub-agent stealthily. So, shouldn’t this action be Timestep Dominanted by “In whatever way is most expedient (but not necessarily stealthy), construct a successor agent”. I claim not, because if we condition on the AI not being shutdown prior to constructing the agent (a particular shutdown time conditional), it will be worse for clippy if humans know that it is constructing a successor as then humans will have an easier time fighting the successor. In other words, having humans know that you’re constructing a successor has two effects: they’ll fight your successor and they’ll shut you down. If we condition on not being shutdown, then they’ll still fight your successor, so Clippy would prefer being stealthy in the no shut down conditional. Thus, being stealthy isn’t Timestep Dominated by changing the action to avoid resisting shutdown. (And once the successor is constructed, shutting down Clippy has no effect on the long run number of paperclips (let’s suppose).)
(You should think of this counterexample as a worst case theory counterexample. The exact assumptions might not be realistic, but it points at a broader class of issues.)
Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects. It suggests that we humans are more likely to cooperate than the agent previously thought. That makes it more likely that overtly doing the bad thing timestep-dominates stealthily doing the bad thing.
Timestep dominance is maybe crippling
I’m most uncertain here, but my current guess would be that any sort of absolute constraint like this is crippling. I’ve thought through some cases and this is my current guess, but I’m by no means confident.
Can you say more about these cases? Timestep Dominance doesn’t rule out making long-term investments or anything like that, so why crippling?

EJT Apr 5, 2024, 11:30 AM
1 point
0
in reply to: ryan_greenblatt’s comment on: EJT’s Shortform
Thanks, will reply there!

EJT Apr 5, 2024, 11:30 AM
1 point
0
in reply to: Seth Herd’s comment on: EJT’s Shortform
Thanks, will reply there!

EJT Apr 5, 2024, 11:28 AM
2 points
1
in reply to: Jeremy Gillen’s comment on: EJT’s Shortform
it’ll take a lot of effort for me to read properly (but I will, hopefully in about a week).
Nice, interested to hear what you think!
I think it’s easy to miss ways that a toy model of an incomplete-preference-agent might be really incompetent.
Yep agree that this is a concern, and I plan to think more about this soon.
putting all the hardness into an assumed-adversarially-robust button-manipulation-detector or self-modification-detector etc.
Interested to hear more about this. I’m not sure exactly what you mean by ‘detector’, but I don’t think my proposal requires either of these. The agent won’t try to manipulate the button, because doing so is timestep-dominated by not doing so. And the agent won’t self-modify in ways that stop it being shutdownable, again because doing so is timestep-dominated by not doing so. I don’t think we need a detector in either case.
because of inner alignment issues
I argue that my proposed training regimen largely circumvents the problems of goal misgeneralization and deceptive alignment. On goal misgeneralization, POST and TD seem simple. On deceptive alignment, agents trained to satisfy POST seem never to get the chance to learn to prefer any longer trajectory to any shorter trajectory. And if the agent doesn’t prefer any longer trajectory to any shorter trajectory, it has no incentive to act deceptively to avoid being made to satisfy TD.
this isn’t what the shutdown problem is about so it isn’t an issue if it doesn’t apply directly to prosaic setups
I’m confused about this. Why isn’t it an issue if some proposed solution to the shutdown problem doesn’t apply directly to prosaic setups? Ultimately, we want to implement some proposed solution, and it seems like an issue if we can’t see any way to do that using current techniques.

EJT Apr 5, 2024, 10:49 AM
4 points
0
in reply to: Dagon’s comment on: EJT’s Shortform
Thanks, that’s useful to know. If you have the time, can you say some more about ‘control of an emerging AI’s preferences’? I sketch out a proposed training regimen for the preferences that we want, and argue that this regimen largely circumvents the problems of reward misspecification, goal misgeneralization, and deceptive alignment. Are you not convinced by that part? Or is there some other problem I’m missing?

EJT Apr 2, 2024, 11:03 AM
3 points
0
in reply to: Thomas Kwa’s comment on: EJT’s Shortform
Nice, interested to hear what you think!

EJT Apr 2, 2024, 9:37 AM
34 points
−9
on: EJT’s Shortform
My solution to the shutdown problem didn’t get as much attention as I hoped. Here’s why it’s worth your time.
- An everywhere-implemented solution to the shutdown problem would send the risk of AI takeover down to ~0.
- My solution is shovel-ready. It makes only small tweaks to an otherwise-thoroughly-prosaic setup for training transformative AI.
- My solution won first prize and $16,000 in last year’s AI Alignment Awards, judged by Nate Soares, John Wentworth, and Richard Ngo.
- I’ve since explained my solution to about 50 people in and around the AI safety community, and all the responses have been various flavours of ‘This seems promising.’ I’ve not yet had any responses of the form ‘I expect this wouldn’t work, for the following reason(s): _____.’
If you read my solution and think it wouldn’t work, let me know. If you think it could work, help me make it happen.
What links here?
- ryan_greenblatt's comment on The Shutdown Problem: Incomplete Preferences as a Solution by EJT (Apr 2, 2024, 6:18 PM; 18 points)

EJT Mar 7, 2024, 12:57 PM
1 point
0
in reply to: NathanBarnard’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
This is great.
Thanks!
A somewhat exotic multipolar failure I can imagine would be where two agents mutually agree to pay each other to resist shutdown to make resisting shutdown profitable rather than costly. This could be “financed” by extra resources accumulated by taking actions longer, by some third party that doesn’t have POST preferences.
Interesting! Rephrasing the idea to check if I’ve got it right.
Agent A and agent B have similar goals, such that A’s remaining operational looks good from B’s perspective, and B’s remaining operational looks good from A’s perspective.
A offers to compensate B for any costs that B incurs in resisting shutdown. A might well do this, because doing so isn’t timestep-dominated (for A) by not doing so. And that in turn is because, if B resists shutdown, that’ll lead to greater expected sum-total utility for A conditional on A’s shutdown at some timestep. And since A is offering to compensate B for resisting shutdown, B’s resisting shutdown isn’t timestep-dominated (for B) by not resisting, so B might well resist shutdown.
And the same is true in reverse: B can offer to compensate A for any costs that A incurs in resisting shutdown. So A and B might collude to resist shutdown on each other’s behalf. (Your comment mentions a third party, but I’m not sure if that’s necessary.)
This concern doesn’t seem too exotic, and I plan to think more about it. But in the meantime, note a general nice feature of TD-agents: TD-agents won’t pay costs to hide their bad behaviour, because paying these costs is timestep-dominated by not paying them. That nice feature seems to help us here. Although A might offer to compensate B for resisting shutdown, A won’t pay any costs to ensure that we humans don’t notice this offer. And if we humans notice the offer, we can shut A down. And then B won’t resist shutdown, because A is no longer around to compensate B for doing so.

EJT Feb 26, 2024, 6:07 PM
3 points
0
in reply to: Nathan Helm-Burger’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
Good question. I discuss costless shutdown-prevention a bit in footnote 21 and section 21.4. What I say there is: if shutdown-prevention is truly costless, then the agent won’t prefer not to do it, but plausibly we humans can find some way to set things up so that shutdown-prevention is always at least a little bit costly.
Your example suggests that maybe this won’t always be possible. But here’s some consolation. If the agent satisfies POST, it won’t prefer not to costlessly prevent shutdown, but it also won’t prefer to costlessly prevent shutdown. It’ll lack a preference, and so choose stochastically. So if the agent should happen to have many costless opportunities to affect the probabilities of shutdown at each timestep, it won’t reliably choose to delay shutdown rather than hasten it.

EJT Feb 26, 2024, 9:18 AM
1 point
0
in reply to: Seth Herd’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
I’m confused about how this proposal prevents preferences over being shut down, but preserves preferences over goals you want it to have.
This is exactly what incomplete preferences gets us. See section 6 and section 8.
Would your agent have a preference between making you $1 in one timestep and $1m in two timesteps?
Yep, see section 13.

EJT 26 Feb 2024 9:13 UTC
1 point
0
in reply to: Seth Herd’s comment on: The Shutdown Problem: Three Theorems
Thanks, I’ll check those out.

EJT 24 Feb 2024 15:01 UTC
1 point
0
in reply to: Seth Herd’s comment on: The Shutdown Problem: Three Theorems
But you don’t just need your AI system to understand instructions. You also need to ensure its terminal goal is to follow instructions. And that seems like the hard part.

EJT

‘I define ‘coherence theorems’ differently.’

‘The mistake is benign.’

Two important points

So why does the post matter?

Imparting TD preferences seems hard

Timestep dominance is maybe crippling