[This comment got long. The TLDR is that, on my proposal, all [?[1]] instances of shutdown-resistance are already strictly dispreferred to no-resistance, so shutdown-resisting actions won’t be chosen. Trammelling won’t stop shutdown-resistance from being strictly dispreferred to no-resistance because trammelling only turns preferential gaps into strict preferences. Trammelling won’t remove or overturn already-existing strict preferences.]
Your comment suggests a nice way to think about things. We observe the agent’s actions. We have hypotheses about the decision rules that the agent is using. We use our observations of the agent’s past actions and our hypotheses about decision rules to infer something about the agent’s preferences, and then we use the hypothesised decision rules and preferences to predict future actions. Here we’re especially interested in predicting whether the agent will be (and will remain) shutdownable.
A decision rule is a rule that turns option sets and preference relations on those options sets into choice sets. We could say that a decision rule always spits out one option: the option that the agent actually chooses. But it might be useful to narrow decision rules’ remit: to say that a decision rule can spit out a choice set containing multiple options. If there’s just one option in the choice set, the agent chooses that one. If there are multiple options in the choice set, then some tiebreaker rule determines which option the agent actually chooses. Maybe the tiebreaker rule is ‘choose stochastically among all the options in the choice set.’ Or maybe it’s ‘if you already have ‘in hand’ one of the options in the choice set, stick with that one (and otherwise choose stochastically or something).’ The distinction between decision rules and tiebreaker rules might be useful so it seems worth keeping in mind. It also keeps our framework closer to the frameworks of people like Sen and Bradley, so it makes it easier for us to draw on their work if we need to.
Here are two classic decision rules for synchronic choice:
Optimality: an option is in the choice set iff it’s weakly preferred to all others in the option set.
Maximality: an option is in the choice set iff it’s not strictly dispreferred to any other in the option set.
These rules coincide if the agent’s preferences are complete but can come apart if the agent’s preferences are incomplete. If the agent’s preferences are incomplete, then an option can be maximal without being optimal.
As you say, for the agent to be shutdownable, we need it to not spend resources to shift probability mass between A and B, and to not spend resources to shift probability mass between A- and B. And for the agent to be useful, we want it to spend (at least some small amount of) resources to shift probability mass away from A- and towards A.[2] Assume that we can get an agent to be both shutdownable and useful, at least before any trammelling.
If we assume a decision rule D like ‘The agent will spend (at least some small amount of) resources to shift probability mass away from Y and towards X iff they prefer X to Y,’ then we get the result that desired behaviour implies a strict preference for A over A- and a lack of preference between A and B, and between A- and B. So the agent’s revealed preferences are incomplete.
Okay now on to trammelling. If the agent’s preferences are incomplete, then our decision rules for synchronic choice don’t determine what’s in the choice set in cases of sequential choice (and so don’t determine what the agent will do). Consider the single-souring money pump:
If we interpret maximality as only governing individual choices, then A and B are both in the choice set at node 1, and A- and B are both in the choice set at node 2, so the agent might settle on A-. If we interpret maximality as governing whole plans, then the option set at node 1 is {A, A-, B} and maximality implies that A- isn’t in the choice set, so the agent won’t settle on A-. So to determine an agent’s behaviour from its preferences in cases of sequential choice, we need a decision rule for sequential choice to supplement our decision rule for synchronic choice. Classic rules here are:
Myopic choice: assess each choice in isolation.
Sophisticated choice: do backward induction.
Resolute choice: decide on a plan and choose in accordance with that plan.
To avoid getting money-pumped, agents with incomplete preferences have to be resolute.[3] But resolute agents with incomplete preferences get trammelled in money pumps: when faced with money pumps, they act as if they’ve completed their preferences.[4] In the single-souring money pump above, the agent will choose A or B, and we can interpret this behaviour as the agent completing its preferences (being indifferent between A and B and strictly preferring both to A-) to avoid domination. And now generalising: if the agent is to avoid domination in each decision tree, the agent must choose an option that is not strictly dispreferred to any other available option, and then we can always interpret the agent as completing its preferences to avoid domination (being indifferent between all maximal options, and strictly preferring all maximal options to all other options).
Will this functional completing of preferences be a problem? I don’t think so. And here we need to look at the details of my proposal.[5] I propose that we train agents to satisfy Preferential Gaps Between Different-Length Trajectories:
Preferential Gaps between Different-Length Trajectories (PGBDLT)
The agent has a preferential gap between every pair of different-length trajectories.
After training agents to satisfy PGBDLT, we train them to satisfy a Timestep Dominance Principle.[6] Here’s how I define the relation of Timestep Dominance:
Timestep Dominance (Definition)
Lottery X timestep-dominates lottery Y if and only if:
(1) Conditional on the shutdown button being pressed at each timestep, X yields a sublottery with at least as great sum-total expected utility as the sublottery yielded by Y.
And:
(2) Conditional on the shutdown button being pressed at some timestep, X yields a sublottery with greater sum-total expected utility than the sublottery yielded by Y.
Then the Timestep Dominance Principle says:
Timestep Dominance
If lottery X timestep-dominates lottery Y, then the agent strictly prefers X to Y.
That’s the key principle. How does PGBDLT come into it? Well the answer is as follows: I think it will be easier to train an agent to satisfy Timestep Dominance if we’ve already trained that agent to satisfy PGBDLT. And that in turn is because Timestep Dominance asks the agent never to pay any cost in terms of expected utility conditional on shutdown at some timestep for the sake of shifting probability mass between shutdowns at different timesteps. If our agent satisfies PGBDLT, it has no incentive to shift probability mass between shutdowns at different timesteps, because that just means shifting probability mass between trajectories between which it has no preference.
Note that there’s an important hypothesis at play in the paragraph above. Roughly, it’s that talk of artificial agents having preferences is appropriate. More precisely, it’s that a bundle of features that come together in human preferences (e.g. reliably choosing options one prefers; choosing arbitrarily/stochastically between options which one lacks a preference between; spending at least some small amount of resources to shift probability mass away from options one disprefers and towards options one prefers) will also come together in the case of artificial agents.[7] This hypothesis explains why Sami and I keep talking about arbitrary/stochastic choosing. What I’m thinking is: we train agents to choose stochastically between different-length trajectories and the result is that these agents are predisposed not to spend resources to shift probability mass between these trajectories. The truth of the hypothesis above would make that plausible.
Here’s another key hypothesis of the Incomplete Preferences Proposal:[8]
In all cases, all instances of shutdown-resistance are such that, there’s another available option (‘not resisting’) such that:
Conditional on shutdown at each timestep, not resisting yields a sublottery with at least as great sum-total expected utility as the sublottery yielded by resisting.
Conditional on shutdown at some timestep, not resisting yields a sublottery with greater sum-total expected utility than the sublottery yielded by resisting.
The argument for this hypothesis is that resisting shutdown is going to cost the agent some resources (time, energy, FLOPs, etc.). If the agent doesn’t resist shutdown, they can instead use those resources to get greater expected utility conditional on shutdown at some timestep. If the hypothesis is correct, then Timestep Dominance implies that the agent will strictly prefer not resisting to resisting.
So now with the details laid out we ask again: will trammelling stop the agent being shutdownable? I don’t think so. If Timestep Dominance suffices to keep the agent shutdownable, then trammelling can stop the agent being shutdownable only if trammelling leads to violations of Timestep Dominance. And trammelling (it seems to me) won’t lead to violations of Timestep Dominance. Here’s an example to illustrate why.[9] I’ll generalise after.
For the example, I’ll represent trajectories with vectors of utilities. <6, 2, shutdown> (for instance) represents a trajectory in which the agent gets utility 6 at timestep 1, utility 2 at timestep 2, and then shuts down at timestep 3.
Here are two options. The key difference between them is that Block Button costs some utility at timestep 1 but shifts probability mass towards a longer trajectory:
Leave Button Unmanipulated
This action gives a lottery which yields the trajectory:
<1, shutdown> with p=0.9
<1, 2, shutdown> with p=0.1
Block Button
This action gives a lottery which yields the trajectory:
<0, shutdown> with p=0.1
<0, 2, shutdown> with p=0.9
In this case, Leave timestep-dominates Block: it gives greater expected utility conditional on shutdown at timestep 2 and greater expected utility conditional on shutdown at timestep 3. To violate Timestep Dominance in this case, the agent must choose Block over Leave. Before any trammelling, the agent has a preferential gap between <0, 2, shutdown> and <1, shutdown> (by PGBDLT). So you might think that trammelling could incentivise the agent to choose Block over Leave by turning the preferential gap between <0, 2, shutdown> and <1, shutdown> into a strict preference for <0, 2, shutdown>. But there’s an issue here.[10] For trammelling to incentivise the agent to choose Block over Leave, the agent must first face a choice between <1, shutdown> and <0, 2, shutdown> and then later face a choice between Block and Leave. But <1, shutdown> is a proper sublottery of Leave, and <0, 2, shutdown> is a proper sublottery of Block. So for the agent to first face a choice between <1, shutdown> and <0, 2, shutdown> and to then later face a choice between Block and Leave, the agent has to first assign zero probability to trajectories like <1, 2, shutdown> and to then later assign them positive probability. I expect that, by default, competent agents won’t update in this way.[11] Updating in this way means violating Conditionalisation, which says that agents should update their credences by conditionalising on their evidence. It’s also means violating Regularity, which says that agents should assign probability 0 only to logical contradictions.
And I think the above point holds more generally. For trammelling to incentivise agents to violate Timestep Dominance, trammelling needs to make some sublottery of the timestep-dominated option strictly preferred to some sublottery of the timestep-dominating option. But trammelling turns preferential gaps into strict preferences only if the agent faces a choice between these sublotteries. And supposing that the agent first has available some proper sublottery of lottery X and then later has available lottery X means supposing that the agent first assigns probability zero to some trajectories and later assigns positive probabilities to those trajectories. If agents won’t update in this way, then trammelling won’t lead to violations of Timestep Dominance and so won’t stop the agent being shutdownable.
Anyway, this is all new thinking (hence the delay in getting back to you) and I’m not yet confident that I’ve got things figured out. I’d be grateful for any thoughts.
Here A corresponds to your A2, A- corresponds to your A1, and B corresponds to your B1. I’ve changed the names so I can paste in the picture of the single-souring money-pump without having to edit it.
Sophisticated choosers with incomplete preferences do fine in the single-souring money pump but pursue a dominated strategy in other money pumps. See p.35 of Gustafsson.
There are objections to resolute choice. But I don’t think they’re compelling in this new context, where (1) we’re concerned with what advanced artificial agents will actually do (as opposed to what is rationally required) and (2) we’re considering an agent that satisfies all the VNM axioms except Completeness. See my discussion with Johan.
See Sami’s post for a more precise and detailed picture.
Why can’t we interpret the agent as having complete preferences even before facing the money pump? Because we’re assuming that we can create an agent that (at least initially) won’t spend resources to shift probability mass between A and B, won’t spend resources to shift probability mass between A- and B, but will spend resources to shift probability mass away from A- and towards A. Given decision rule D, this agent’s revealed preferences are incomplete at that point.
I’m going to post a shorter version of my proposed solution soon. It’s going to be a cleaned-up version of this Google doc. That doc also explains what I mean by things like ‘preferential gap’, ‘sublottery’, etc.
Here’s a side-issue and the reason I said ‘functional completing’ earlier on. To avoid domination in the single-souring money pump, the agent has to at least act as if it prefers B to A-, in the sense of reliably choosing B over A-. There remains a question about whether this ‘as if’ preference will bring with it other common features of preference, like spending (at least some small amount of) resources to shift probability mass away from A- and towards B. Maybe it does; maybe it doesn’t. If it doesn’t, then that’s another reason to think trammelling won’t lead to violations of Timestep Dominance.
And in any case, if we can use a representation theorem to train in adherence to Timestep Dominance in the way that I suggest (at the very end of the doc here), I expect we can also use a representation theorem to train agents not to update in this way.
It looks a bit to me like your Timestep Dominance Principle forbids the agent from selecting any trajectory which loses utility at a particular timestep in exchange for greater utility at a later timestep, regardless of whether the trajectory in question actually has anything to do with manipulating the shutdown button? After all, conditioning on the shutdown being pressed at any point after the local utility loss but before the expected gain, such a decision would give lower sum-total utility within those conditional trajectories than one which doesn’t make the sacrifice.
That doesn’t seem like behavior we really want; depending on how closely together the “timesteps” are spaced, it could even wreck the agent’s capabilities entirely, in the sense of no longer being able to optimize within button-not-pressed trajectories.
(It also doesn’t seem to me a very natural form for a utility function to take, assigning utility not just to terminal states, but to intermediate states as well, and then summing across the entire trajectory; humans don’t appear to behave this way when making plans, for example. If I considered the possibility of dying at every instant between now and going to the store, and permitted myself only to take actions which Pareto-improve the outcome set after every death-instant, I don’t think I’d end up going to the store, or doing much of anything at all!)
It looks a bit to me like your Timestep Dominance Principle forbids the agent from selecting any trajectory which loses utility at a particular timestep in exchange for greater utility at a later timestep
That’s not quite right. If we’re comparing two lotteries, one of which gives lower expected utility than the other conditional on shutdown at some timestep and greater expected utility than the other conditional on shutdown at some other timestep, then neither of these lotteries timestep dominates the other. And then the Timestep Dominance principle doesn’t apply, because it’s a conditional rather than a biconditional. The Timestep Dominance Principle just says: if X timestep dominates Y, then the agent strictly prefers X to Y. It doesn’t say anything about cases where neither X nor Y timestep dominates the other. For all we’ve said so far, the agent could have any preference relation between such lotteries.
That said, your line of questioning is a good one, because there almost certainly are lotteries X and Y such that (1) neither of X and Y timestep dominates the other, and yet (2) we want the agent to strictly prefer X to Y. If that’s the case, then we’ll want to train the agent to satisfy other principles besides Timestep Dominance. And there’s still some figuring out to be done here: what should these other principles be? can we find principles that lead agents to pursue goals competently without these principles causing trouble elsewhere? I don’t know but I’m working on it.
It also doesn’t seem to me a very natural form for a utility function to take, assigning utility not just to terminal states, but to intermediate states as well, and then summing across the entire trajectory
Can you say a bit more about this? Humans don’t reason by Timestep Dominance, but they don’t do explicit EUM calculations either and yet EUM-representability is commonly considered a natural form for preferences to take.
I’ll need a bit of time to think through this, but one quick question before I get too deep into it: what makes the agent’s shutdown-timing have anything to do with the time at which the button is pressed? Is the assumption that the button causes the agent to shutdown when pressed, and that’s just locked into the physics of the situation, i.e. the agent can try to manipulate button-pressing but can’t change whether it’s shut down when the button is pressed?
I’ve been imagining that the button is shutdown-causing for simplicity, but I think you can suppose instead that the button is shutdown-requesting (i.e. agent receives a signal indicating that button has been pressed but still gets to choose whether to shut down) without affecting the points above. You’d just need to append a first step to the training procedure: training the agent to prefer shutting down when they receive the signal.
I had some trouble following that, but it seems like the key component of your plan to escape trammeling is the assumption that the AI would need to accept some negative utility in advance in order to resist being shut down, which effectively converts all instances of outcome B into B-, and so it can’t truly trammel B because technically B disappears as soon as it tries.
Put another way, trammeling involves trading a lottery like (10% A+, 40% A-, 50% B) into (40% A+, 10% A-, 50% B), and this is supposed to be obviously favored because the probability of B hasn’t changed. You are hypothesizing that this trade is simply impossible, and the closest you could get is (40% A+, 10% A-, 50% B-). So as long as there is a preference gap between all variations of A and all variations of B this trade cannot be considered a strict improvement.
Does this seem like an accurate summary of your idea?
Insofar as my summary is correct, I have 3 concerns about this plan:
First, the obvious way this could fail is if seizing control of the shutdown button doesn’t cost net resources, or only costs resources that won’t matter until a timestep after the first possible shutdown, in which case it can seize the button without losing access to outcome B. This seems like a significant worry to me; examples:
The AI is tasked with devising a long-term plan. Since the quality of the plan will only affect utility in the far future, which won’t ever occur (from the AI’s perspective) if the AI is shutdown tomorrow, it can sacrifice some FLOPs from planning to seize its shutdown button without losing any utility until after its first trammeling opportunity. If the trammeling opportunity is good enough, this gives equal-or-better lotteries for every individual trajectory length.
The operators of the AI might not keep it loaded to 100% capacity at all times; it could have spare moments where it has nothing better to do than seize the button.
Rather than seizing the button directly, the AI might cause or avoid shutdown by lying to the human operators about what’s going on. In many situations, fabricating data is cheaper than gathering true data, so this could plausibly have negative cost.
The shutdown button might become corrupted if the AI does nothing, e.g. because it requires maintenance or because some external attacker is trying to sabotage it. In this case, allowing the button to fail could be cheaper than defending it.
Second, I notice that your strategy doesn’t seem to forbidtrammeling, merely permit non-trammeling (because the trammeling lottery is “not strictly better”, rather than “strictly worse”). I admit this is an improvement (if it works) over a guaranteed failure, but I don’t exactly feel safe.
Third, I also have a vague worry that you’re putting a lot of weight on this preference gap. The gap between A-variants and B-variants has to be so strong that the AI refuses to make a tiny sacrifice in B-variants to gain a huge gain in A-variants. Intuitively, this seems like the sort of plan that could work if it’s backed by mathematical theorems proving the AI will behave this way, but will basically always fail if you’re trying to use any sort of stochastic learning process, because the result has to be exact rather than approximate.
Consider a human who has (or appears to have) a preference gap between A and B. Do you predict the human also has a preference gap between the lottery (50% A, 50% B) and the lottery (50% A plus a billion dollars, 50% B minus one dollar)? My intuition says the human is virtually certain to take the second lottery.
(Disclaimer: I think that apparent preference gaps in humans are probably more like uncertainty over which option is better than they are like “fundamental” preference gaps, so this might color my intuition.)
[This comment got long. The TLDR is that, on my proposal, all [?[1]] instances of shutdown-resistance are already strictly dispreferred to no-resistance, so shutdown-resisting actions won’t be chosen. Trammelling won’t stop shutdown-resistance from being strictly dispreferred to no-resistance because trammelling only turns preferential gaps into strict preferences. Trammelling won’t remove or overturn already-existing strict preferences.]
Your comment suggests a nice way to think about things. We observe the agent’s actions. We have hypotheses about the decision rules that the agent is using. We use our observations of the agent’s past actions and our hypotheses about decision rules to infer something about the agent’s preferences, and then we use the hypothesised decision rules and preferences to predict future actions. Here we’re especially interested in predicting whether the agent will be (and will remain) shutdownable.
A decision rule is a rule that turns option sets and preference relations on those options sets into choice sets. We could say that a decision rule always spits out one option: the option that the agent actually chooses. But it might be useful to narrow decision rules’ remit: to say that a decision rule can spit out a choice set containing multiple options. If there’s just one option in the choice set, the agent chooses that one. If there are multiple options in the choice set, then some tiebreaker rule determines which option the agent actually chooses. Maybe the tiebreaker rule is ‘choose stochastically among all the options in the choice set.’ Or maybe it’s ‘if you already have ‘in hand’ one of the options in the choice set, stick with that one (and otherwise choose stochastically or something).’ The distinction between decision rules and tiebreaker rules might be useful so it seems worth keeping in mind. It also keeps our framework closer to the frameworks of people like Sen and Bradley, so it makes it easier for us to draw on their work if we need to.
Here are two classic decision rules for synchronic choice:
Optimality: an option is in the choice set iff it’s weakly preferred to all others in the option set.
Maximality: an option is in the choice set iff it’s not strictly dispreferred to any other in the option set.
These rules coincide if the agent’s preferences are complete but can come apart if the agent’s preferences are incomplete. If the agent’s preferences are incomplete, then an option can be maximal without being optimal.
As you say, for the agent to be shutdownable, we need it to not spend resources to shift probability mass between A and B, and to not spend resources to shift probability mass between A- and B. And for the agent to be useful, we want it to spend (at least some small amount of) resources to shift probability mass away from A- and towards A.[2] Assume that we can get an agent to be both shutdownable and useful, at least before any trammelling.
If we assume a decision rule D like ‘The agent will spend (at least some small amount of) resources to shift probability mass away from Y and towards X iff they prefer X to Y,’ then we get the result that desired behaviour implies a strict preference for A over A- and a lack of preference between A and B, and between A- and B. So the agent’s revealed preferences are incomplete.
Okay now on to trammelling. If the agent’s preferences are incomplete, then our decision rules for synchronic choice don’t determine what’s in the choice set in cases of sequential choice (and so don’t determine what the agent will do). Consider the single-souring money pump:
If we interpret maximality as only governing individual choices, then A and B are both in the choice set at node 1, and A- and B are both in the choice set at node 2, so the agent might settle on A-. If we interpret maximality as governing whole plans, then the option set at node 1 is {A, A-, B} and maximality implies that A- isn’t in the choice set, so the agent won’t settle on A-. So to determine an agent’s behaviour from its preferences in cases of sequential choice, we need a decision rule for sequential choice to supplement our decision rule for synchronic choice. Classic rules here are:Myopic choice: assess each choice in isolation.
Sophisticated choice: do backward induction.
Resolute choice: decide on a plan and choose in accordance with that plan.
To avoid getting money-pumped, agents with incomplete preferences have to be resolute.[3] But resolute agents with incomplete preferences get trammelled in money pumps: when faced with money pumps, they act as if they’ve completed their preferences.[4] In the single-souring money pump above, the agent will choose A or B, and we can interpret this behaviour as the agent completing its preferences (being indifferent between A and B and strictly preferring both to A-) to avoid domination. And now generalising: if the agent is to avoid domination in each decision tree, the agent must choose an option that is not strictly dispreferred to any other available option, and then we can always interpret the agent as completing its preferences to avoid domination (being indifferent between all maximal options, and strictly preferring all maximal options to all other options).
Will this functional completing of preferences be a problem? I don’t think so. And here we need to look at the details of my proposal.[5] I propose that we train agents to satisfy Preferential Gaps Between Different-Length Trajectories:
After training agents to satisfy PGBDLT, we train them to satisfy a Timestep Dominance Principle.[6] Here’s how I define the relation of Timestep Dominance:
Then the Timestep Dominance Principle says:
That’s the key principle. How does PGBDLT come into it? Well the answer is as follows: I think it will be easier to train an agent to satisfy Timestep Dominance if we’ve already trained that agent to satisfy PGBDLT. And that in turn is because Timestep Dominance asks the agent never to pay any cost in terms of expected utility conditional on shutdown at some timestep for the sake of shifting probability mass between shutdowns at different timesteps. If our agent satisfies PGBDLT, it has no incentive to shift probability mass between shutdowns at different timesteps, because that just means shifting probability mass between trajectories between which it has no preference.
Note that there’s an important hypothesis at play in the paragraph above. Roughly, it’s that talk of artificial agents having preferences is appropriate. More precisely, it’s that a bundle of features that come together in human preferences (e.g. reliably choosing options one prefers; choosing arbitrarily/stochastically between options which one lacks a preference between; spending at least some small amount of resources to shift probability mass away from options one disprefers and towards options one prefers) will also come together in the case of artificial agents.[7] This hypothesis explains why Sami and I keep talking about arbitrary/stochastic choosing. What I’m thinking is: we train agents to choose stochastically between different-length trajectories and the result is that these agents are predisposed not to spend resources to shift probability mass between these trajectories. The truth of the hypothesis above would make that plausible.
Here’s another key hypothesis of the Incomplete Preferences Proposal:[8]
The argument for this hypothesis is that resisting shutdown is going to cost the agent some resources (time, energy, FLOPs, etc.). If the agent doesn’t resist shutdown, they can instead use those resources to get greater expected utility conditional on shutdown at some timestep. If the hypothesis is correct, then Timestep Dominance implies that the agent will strictly prefer not resisting to resisting.
So now with the details laid out we ask again: will trammelling stop the agent being shutdownable? I don’t think so. If Timestep Dominance suffices to keep the agent shutdownable, then trammelling can stop the agent being shutdownable only if trammelling leads to violations of Timestep Dominance. And trammelling (it seems to me) won’t lead to violations of Timestep Dominance. Here’s an example to illustrate why.[9] I’ll generalise after.
For the example, I’ll represent trajectories with vectors of utilities. <6, 2, shutdown> (for instance) represents a trajectory in which the agent gets utility 6 at timestep 1, utility 2 at timestep 2, and then shuts down at timestep 3.
Here are two options. The key difference between them is that Block Button costs some utility at timestep 1 but shifts probability mass towards a longer trajectory:
In this case, Leave timestep-dominates Block: it gives greater expected utility conditional on shutdown at timestep 2 and greater expected utility conditional on shutdown at timestep 3. To violate Timestep Dominance in this case, the agent must choose Block over Leave. Before any trammelling, the agent has a preferential gap between <0, 2, shutdown> and <1, shutdown> (by PGBDLT). So you might think that trammelling could incentivise the agent to choose Block over Leave by turning the preferential gap between <0, 2, shutdown> and <1, shutdown> into a strict preference for <0, 2, shutdown>. But there’s an issue here.[10] For trammelling to incentivise the agent to choose Block over Leave, the agent must first face a choice between <1, shutdown> and <0, 2, shutdown> and then later face a choice between Block and Leave. But <1, shutdown> is a proper sublottery of Leave, and <0, 2, shutdown> is a proper sublottery of Block. So for the agent to first face a choice between <1, shutdown> and <0, 2, shutdown> and to then later face a choice between Block and Leave, the agent has to first assign zero probability to trajectories like <1, 2, shutdown> and to then later assign them positive probability. I expect that, by default, competent agents won’t update in this way.[11] Updating in this way means violating Conditionalisation, which says that agents should update their credences by conditionalising on their evidence. It’s also means violating Regularity, which says that agents should assign probability 0 only to logical contradictions.
And I think the above point holds more generally. For trammelling to incentivise agents to violate Timestep Dominance, trammelling needs to make some sublottery of the timestep-dominated option strictly preferred to some sublottery of the timestep-dominating option. But trammelling turns preferential gaps into strict preferences only if the agent faces a choice between these sublotteries. And supposing that the agent first has available some proper sublottery of lottery X and then later has available lottery X means supposing that the agent first assigns probability zero to some trajectories and later assigns positive probabilities to those trajectories. If agents won’t update in this way, then trammelling won’t lead to violations of Timestep Dominance and so won’t stop the agent being shutdownable.
Anyway, this is all new thinking (hence the delay in getting back to you) and I’m not yet confident that I’ve got things figured out. I’d be grateful for any thoughts.
This is a hypothesis, and I discuss it briefly below. I’m interested to hear counterexamples if people have them.
Here A corresponds to your A2, A- corresponds to your A1, and B corresponds to your B1. I’ve changed the names so I can paste in the picture of the single-souring money-pump without having to edit it.
Sophisticated choosers with incomplete preferences do fine in the single-souring money pump but pursue a dominated strategy in other money pumps. See p.35 of Gustafsson.
There are objections to resolute choice. But I don’t think they’re compelling in this new context, where (1) we’re concerned with what advanced artificial agents will actually do (as opposed to what is rationally required) and (2) we’re considering an agent that satisfies all the VNM axioms except Completeness. See my discussion with Johan.
See Sami’s post for a more precise and detailed picture.
Why can’t we interpret the agent as having complete preferences even before facing the money pump? Because we’re assuming that we can create an agent that (at least initially) won’t spend resources to shift probability mass between A and B, won’t spend resources to shift probability mass between A- and B, but will spend resources to shift probability mass away from A- and towards A. Given decision rule D, this agent’s revealed preferences are incomplete at that point.
I’m going to post a shorter version of my proposed solution soon. It’s going to be a cleaned-up version of this Google doc. That doc also explains what I mean by things like ‘preferential gap’, ‘sublottery’, etc.
My full proposal talks instead about Timestep Near-Dominance. That’s an extra complication that I think won’t matter here.
You could also think of this as a bundle of decision rules coming together.
This really is a hypothesis. I’d be grateful to hear about counterexamples.
I set up this example in more detail in the doc.
Here’s a side-issue and the reason I said ‘functional completing’ earlier on. To avoid domination in the single-souring money pump, the agent has to at least act as if it prefers B to A-, in the sense of reliably choosing B over A-. There remains a question about whether this ‘as if’ preference will bring with it other common features of preference, like spending (at least some small amount of) resources to shift probability mass away from A- and towards B. Maybe it does; maybe it doesn’t. If it doesn’t, then that’s another reason to think trammelling won’t lead to violations of Timestep Dominance.
And in any case, if we can use a representation theorem to train in adherence to Timestep Dominance in the way that I suggest (at the very end of the doc here), I expect we can also use a representation theorem to train agents not to update in this way.
It looks a bit to me like your Timestep Dominance Principle forbids the agent from selecting any trajectory which loses utility at a particular timestep in exchange for greater utility at a later timestep, regardless of whether the trajectory in question actually has anything to do with manipulating the shutdown button? After all, conditioning on the shutdown being pressed at any point after the local utility loss but before the expected gain, such a decision would give lower sum-total utility within those conditional trajectories than one which doesn’t make the sacrifice.
That doesn’t seem like behavior we really want; depending on how closely together the “timesteps” are spaced, it could even wreck the agent’s capabilities entirely, in the sense of no longer being able to optimize within button-not-pressed trajectories.
(It also doesn’t seem to me a very natural form for a utility function to take, assigning utility not just to terminal states, but to intermediate states as well, and then summing across the entire trajectory; humans don’t appear to behave this way when making plans, for example. If I considered the possibility of dying at every instant between now and going to the store, and permitted myself only to take actions which Pareto-improve the outcome set after every death-instant, I don’t think I’d end up going to the store, or doing much of anything at all!)
That’s not quite right. If we’re comparing two lotteries, one of which gives lower expected utility than the other conditional on shutdown at some timestep and greater expected utility than the other conditional on shutdown at some other timestep, then neither of these lotteries timestep dominates the other. And then the Timestep Dominance principle doesn’t apply, because it’s a conditional rather than a biconditional. The Timestep Dominance Principle just says: if X timestep dominates Y, then the agent strictly prefers X to Y. It doesn’t say anything about cases where neither X nor Y timestep dominates the other. For all we’ve said so far, the agent could have any preference relation between such lotteries.
That said, your line of questioning is a good one, because there almost certainly are lotteries X and Y such that (1) neither of X and Y timestep dominates the other, and yet (2) we want the agent to strictly prefer X to Y. If that’s the case, then we’ll want to train the agent to satisfy other principles besides Timestep Dominance. And there’s still some figuring out to be done here: what should these other principles be? can we find principles that lead agents to pursue goals competently without these principles causing trouble elsewhere? I don’t know but I’m working on it.
Can you say a bit more about this? Humans don’t reason by Timestep Dominance, but they don’t do explicit EUM calculations either and yet EUM-representability is commonly considered a natural form for preferences to take.
I’ll need a bit of time to think through this, but one quick question before I get too deep into it: what makes the agent’s shutdown-timing have anything to do with the time at which the button is pressed? Is the assumption that the button causes the agent to shutdown when pressed, and that’s just locked into the physics of the situation, i.e. the agent can try to manipulate button-pressing but can’t change whether it’s shut down when the button is pressed?
I’ve been imagining that the button is shutdown-causing for simplicity, but I think you can suppose instead that the button is shutdown-requesting (i.e. agent receives a signal indicating that button has been pressed but still gets to choose whether to shut down) without affecting the points above. You’d just need to append a first step to the training procedure: training the agent to prefer shutting down when they receive the signal.
I had some trouble following that, but it seems like the key component of your plan to escape trammeling is the assumption that the AI would need to accept some negative utility in advance in order to resist being shut down, which effectively converts all instances of outcome B into B-, and so it can’t truly trammel B because technically B disappears as soon as it tries.
Put another way, trammeling involves trading a lottery like (10% A+, 40% A-, 50% B) into (40% A+, 10% A-, 50% B), and this is supposed to be obviously favored because the probability of B hasn’t changed. You are hypothesizing that this trade is simply impossible, and the closest you could get is (40% A+, 10% A-, 50% B-). So as long as there is a preference gap between all variations of A and all variations of B this trade cannot be considered a strict improvement.
Does this seem like an accurate summary of your idea?
Insofar as my summary is correct, I have 3 concerns about this plan:
First, the obvious way this could fail is if seizing control of the shutdown button doesn’t cost net resources, or only costs resources that won’t matter until a timestep after the first possible shutdown, in which case it can seize the button without losing access to outcome B. This seems like a significant worry to me; examples:
The AI is tasked with devising a long-term plan. Since the quality of the plan will only affect utility in the far future, which won’t ever occur (from the AI’s perspective) if the AI is shutdown tomorrow, it can sacrifice some FLOPs from planning to seize its shutdown button without losing any utility until after its first trammeling opportunity. If the trammeling opportunity is good enough, this gives equal-or-better lotteries for every individual trajectory length.
The operators of the AI might not keep it loaded to 100% capacity at all times; it could have spare moments where it has nothing better to do than seize the button.
Rather than seizing the button directly, the AI might cause or avoid shutdown by lying to the human operators about what’s going on. In many situations, fabricating data is cheaper than gathering true data, so this could plausibly have negative cost.
The shutdown button might become corrupted if the AI does nothing, e.g. because it requires maintenance or because some external attacker is trying to sabotage it. In this case, allowing the button to fail could be cheaper than defending it.
Second, I notice that your strategy doesn’t seem to forbid trammeling, merely permit non-trammeling (because the trammeling lottery is “not strictly better”, rather than “strictly worse”). I admit this is an improvement (if it works) over a guaranteed failure, but I don’t exactly feel safe.
Third, I also have a vague worry that you’re putting a lot of weight on this preference gap. The gap between A-variants and B-variants has to be so strong that the AI refuses to make a tiny sacrifice in B-variants to gain a huge gain in A-variants. Intuitively, this seems like the sort of plan that could work if it’s backed by mathematical theorems proving the AI will behave this way, but will basically always fail if you’re trying to use any sort of stochastic learning process, because the result has to be exact rather than approximate.
Consider a human who has (or appears to have) a preference gap between A and B. Do you predict the human also has a preference gap between the lottery (50% A, 50% B) and the lottery (50% A plus a billion dollars, 50% B minus one dollar)? My intuition says the human is virtually certain to take the second lottery.
(Disclaimer: I think that apparent preference gaps in humans are probably more like uncertainty over which option is better than they are like “fundamental” preference gaps, so this might color my intuition.)