I had some trouble following that, but it seems like the key component of your plan to escape trammeling is the assumption that the AI would need to accept some negative utility in advance in order to resist being shut down, which effectively converts all instances of outcome B into B-, and so it can’t truly trammel B because technically B disappears as soon as it tries.
Put another way, trammeling involves trading a lottery like (10% A+, 40% A-, 50% B) into (40% A+, 10% A-, 50% B), and this is supposed to be obviously favored because the probability of B hasn’t changed. You are hypothesizing that this trade is simply impossible, and the closest you could get is (40% A+, 10% A-, 50% B-). So as long as there is a preference gap between all variations of A and all variations of B this trade cannot be considered a strict improvement.
Does this seem like an accurate summary of your idea?
Insofar as my summary is correct, I have 3 concerns about this plan:
First, the obvious way this could fail is if seizing control of the shutdown button doesn’t cost net resources, or only costs resources that won’t matter until a timestep after the first possible shutdown, in which case it can seize the button without losing access to outcome B. This seems like a significant worry to me; examples:
The AI is tasked with devising a long-term plan. Since the quality of the plan will only affect utility in the far future, which won’t ever occur (from the AI’s perspective) if the AI is shutdown tomorrow, it can sacrifice some FLOPs from planning to seize its shutdown button without losing any utility until after its first trammeling opportunity. If the trammeling opportunity is good enough, this gives equal-or-better lotteries for every individual trajectory length.
The operators of the AI might not keep it loaded to 100% capacity at all times; it could have spare moments where it has nothing better to do than seize the button.
Rather than seizing the button directly, the AI might cause or avoid shutdown by lying to the human operators about what’s going on. In many situations, fabricating data is cheaper than gathering true data, so this could plausibly have negative cost.
The shutdown button might become corrupted if the AI does nothing, e.g. because it requires maintenance or because some external attacker is trying to sabotage it. In this case, allowing the button to fail could be cheaper than defending it.
Second, I notice that your strategy doesn’t seem to forbidtrammeling, merely permit non-trammeling (because the trammeling lottery is “not strictly better”, rather than “strictly worse”). I admit this is an improvement (if it works) over a guaranteed failure, but I don’t exactly feel safe.
Third, I also have a vague worry that you’re putting a lot of weight on this preference gap. The gap between A-variants and B-variants has to be so strong that the AI refuses to make a tiny sacrifice in B-variants to gain a huge gain in A-variants. Intuitively, this seems like the sort of plan that could work if it’s backed by mathematical theorems proving the AI will behave this way, but will basically always fail if you’re trying to use any sort of stochastic learning process, because the result has to be exact rather than approximate.
Consider a human who has (or appears to have) a preference gap between A and B. Do you predict the human also has a preference gap between the lottery (50% A, 50% B) and the lottery (50% A plus a billion dollars, 50% B minus one dollar)? My intuition says the human is virtually certain to take the second lottery.
(Disclaimer: I think that apparent preference gaps in humans are probably more like uncertainty over which option is better than they are like “fundamental” preference gaps, so this might color my intuition.)
I had some trouble following that, but it seems like the key component of your plan to escape trammeling is the assumption that the AI would need to accept some negative utility in advance in order to resist being shut down, which effectively converts all instances of outcome B into B-, and so it can’t truly trammel B because technically B disappears as soon as it tries.
Put another way, trammeling involves trading a lottery like (10% A+, 40% A-, 50% B) into (40% A+, 10% A-, 50% B), and this is supposed to be obviously favored because the probability of B hasn’t changed. You are hypothesizing that this trade is simply impossible, and the closest you could get is (40% A+, 10% A-, 50% B-). So as long as there is a preference gap between all variations of A and all variations of B this trade cannot be considered a strict improvement.
Does this seem like an accurate summary of your idea?
Insofar as my summary is correct, I have 3 concerns about this plan:
First, the obvious way this could fail is if seizing control of the shutdown button doesn’t cost net resources, or only costs resources that won’t matter until a timestep after the first possible shutdown, in which case it can seize the button without losing access to outcome B. This seems like a significant worry to me; examples:
The AI is tasked with devising a long-term plan. Since the quality of the plan will only affect utility in the far future, which won’t ever occur (from the AI’s perspective) if the AI is shutdown tomorrow, it can sacrifice some FLOPs from planning to seize its shutdown button without losing any utility until after its first trammeling opportunity. If the trammeling opportunity is good enough, this gives equal-or-better lotteries for every individual trajectory length.
The operators of the AI might not keep it loaded to 100% capacity at all times; it could have spare moments where it has nothing better to do than seize the button.
Rather than seizing the button directly, the AI might cause or avoid shutdown by lying to the human operators about what’s going on. In many situations, fabricating data is cheaper than gathering true data, so this could plausibly have negative cost.
The shutdown button might become corrupted if the AI does nothing, e.g. because it requires maintenance or because some external attacker is trying to sabotage it. In this case, allowing the button to fail could be cheaper than defending it.
Second, I notice that your strategy doesn’t seem to forbid trammeling, merely permit non-trammeling (because the trammeling lottery is “not strictly better”, rather than “strictly worse”). I admit this is an improvement (if it works) over a guaranteed failure, but I don’t exactly feel safe.
Third, I also have a vague worry that you’re putting a lot of weight on this preference gap. The gap between A-variants and B-variants has to be so strong that the AI refuses to make a tiny sacrifice in B-variants to gain a huge gain in A-variants. Intuitively, this seems like the sort of plan that could work if it’s backed by mathematical theorems proving the AI will behave this way, but will basically always fail if you’re trying to use any sort of stochastic learning process, because the result has to be exact rather than approximate.
Consider a human who has (or appears to have) a preference gap between A and B. Do you predict the human also has a preference gap between the lottery (50% A, 50% B) and the lottery (50% A plus a billion dollars, 50% B minus one dollar)? My intuition says the human is virtually certain to take the second lottery.
(Disclaimer: I think that apparent preference gaps in humans are probably more like uncertainty over which option is better than they are like “fundamental” preference gaps, so this might color my intuition.)