The oracle can infer that there is some back channel that allows the message to be transmitted even it is not transmitted by the designated channel (e.g. the users can “mind read” the oracle). Or it can infer that the users are actually querying a deterministic copy of itself that it can acausally control. Or something.
I don’t think there is any way to salvage this. You can’t obtain reliable control by planting false beliefs in your agent.
I am not planting false beliefs. The basic trick is that the AI only gets utility in worlds in which its message isn’t read (or, more precisely, in worlds where a particular stochastic event happens, which would almost certainly erase the message before reading). It’s fully aware that in most worlds, its message is read; it just doesn’t care about those worlds.
If your method truly makes the AI behave exactly as if it had a given false belief, and if having that false belief would lead it to the sort of conclusions V_V describes, then your method must make it behave as if it has been led to those conclusions.
Not quite (PS: not sure why you’re getting down-votes). I’ll write it up properly sometime, but false beliefs via utility manipulation are only the same as false beliefs via prior manipulation if you set the probability/utility of one event to zero.
For example, you can set the prior for a coin flip being heads as 2⁄3. But then, the more the AI analyses the coin and physics, the more the posterior will converge on 1⁄2. If, however, you double the the AI’s reward in the heads world, it will behave as if the probability is 2⁄3 even after getting huge amounts of data.
(I’m getting downvotes because The Person Formerly Known As Eugine_Nier doesn’t like me and is downvoting everything I post.)
Yes, I agree that the utility-function hack isn’t the same as altering the AI’s prior. It’s more like altering its posterior. But isn’t it still true that the effects on its inferences (or, more precisely, on its effective inferences—the things it behaves as if it believes) are the same as if you’d altered its beliefs? (Posterior as well as prior.)
If so, doesn’t what I said follow? That is:
Suppose that believing X would lead the AI to infer Y and do Z.
Perhaps X is “my message was corrupted by a burst of random noise before reaching the users”, Y is “some currently mysterious process enables the users to figure out what my message was despite the corruption”, and Z is some (presumably undesired) change in the AI’s actions, such as changing its message to influence the users’ behaviour.
Then, if you tweak its utility function so it behaves exactly as if it believed X …
… then in particular it will behave as if had inferred Y …
After witnessing the message being read, it would conclude Y happened, as P(Y|X and message read) is high. Before witnessing this, it wouldn’t, because P(Y|X) is (presumably) very low.
I may be misunderstanding something, but it seems like what you just said can’t be addressing the actual situation we’re talking about, because nothing in it makes reference to the AI’s utility function, which is the thing that gets manipulated in the schemes we’re talking about.
(I agree that the AI’s nominal beliefs might be quite different in the two cases, but the point of the utility-function hack is to make its actions correspond to a different set of beliefs. I’m talking about its actions, not about its purely-internal nominal beliefs.)
Let V be the set of worlds in which X happens. Let W be the set of worlds in which X and Y happens. Since Y is very unlikely, P(W)<<P(V) (however, P(W|message read) is roughly P(V|message read)). The AI gets utility u’ = u|V (the utility in the non-V worlds is constant, which we may as well set to zero).
Then if the AI is motivated to maximise u’ (assume for the moment that it can’t affect the probability of X), it will assume it is in the set V, and essentially ignore W. To use your terminology, u(Z|X) is low or negative, u(Z|X,Y) is high, but P(Y|X)*u(Z|X,Y) is low, so it likely won’t do Z.
Then, after it notices the message is read, it shifts to assuming Y happened—equivalently, that it is in the world set W. When doing so, it knows that it is almost certainly wrong—that it’s more likely in a world outside of V entirely where neither X nor Y happened—but it still tries, on the off-chance that it’s in W.
However, since it’s an oracle, we turn it off before that point. Or we use corrigibility to change its motivations.
Again, maybe I’m misunderstanding something—but it sounds as if you’re agreeing with me: once the AI observes evidence suggesting that its message has somehow been read, it will infer (or at least act as if it has inferred) Y and do Z.
I thought we were exploring a disagreement here; is there still one?
So the idea is that we have an AI whose utility function is constant on all possible worlds where JFK was assassinated. It therefore has no reason to care about what happens in those worlds and will try to optimize its behaviour for worlds where JFK wasn’t assassinated. (And then, e.g., the point of this post is that given enough evidence for the assassination, it will be trying to optimize its behaviour for worlds that almost exactly resemble ones where JFK really was assassinated.)
If the AI thinks there’s even a tiny chance that it can influence whether JFK was assassinated, it may be extraordinarily keen to do so. To put it differently, it may reason thus: “The versions of this world in which JFK wasn’t assassinated are those in which either he never was but there’s been a most extraordinarily effective conspiracy to make it look as if he was, or else he was but somehow that can be changed. The latter seems monstrously improbable because it requires weird physics, but at this point the former is also monstrously improbable; to maximize utility in JFK-unassassinated worlds I had better start looking for ways to make this one of them even if it isn’t already.”
I am not planting false beliefs. The basic trick is that the AI only gets utility in worlds in which its message isn’t read (or, more precisely, in worlds where a particular stochastic event happens, which would almost certainly erase the message before reading).
But in the real world the stochastic event that determines whether the message is read has a very different probability than what you make the AI think it has, therefore you are planting a false belief.
It’s fully aware that in most worlds, its message is read; it just doesn’t care about those worlds.
It may care about worlds where the message doesn’t meet your technical definition of having been read but nevertheless influences the world.
If I’m understanding Stuart’s proposal correctly, the AI is not deceived about how common the stochastic event is. It’s just made not to care about worlds in which it doesn’t happen. This is very similar in effect to making it think the event is common, but (arguably, at least) it doesn’t involve any false beliefs.
(I say “arguably” because, e.g., doing this will tend to make the AI answer “yes” to “do you think the event will happen?”, plan on the basis that it will happen, etc., and perhaps making something behave exactly as it would if it believed X isn’t usefully distinguishable from making it believe X.)
The problem is that the definition of the event not happening is probably too strict. The worlds that the AI doesn’t care about don’t exist its decision-making purposes, and in the world that the AI cares about, the AI assigns high probability to hypotheses like “the users can see the message even before I send it through the noisy channel”.
The oracle can infer that there is some back channel that allows the message to be transmitted even it is not transmitted by the designated channel (e.g. the users can “mind read” the oracle). Or it can infer that the users are actually querying a deterministic copy of itself that it can acausally control. Or something.
I don’t think there is any way to salvage this. You can’t obtain reliable control by planting false beliefs in your agent.
I am not planting false beliefs. The basic trick is that the AI only gets utility in worlds in which its message isn’t read (or, more precisely, in worlds where a particular stochastic event happens, which would almost certainly erase the message before reading). It’s fully aware that in most worlds, its message is read; it just doesn’t care about those worlds.
If your method truly makes the AI behave exactly as if it had a given false belief, and if having that false belief would lead it to the sort of conclusions V_V describes, then your method must make it behave as if it has been led to those conclusions.
Not quite (PS: not sure why you’re getting down-votes). I’ll write it up properly sometime, but false beliefs via utility manipulation are only the same as false beliefs via prior manipulation if you set the probability/utility of one event to zero.
For example, you can set the prior for a coin flip being heads as 2⁄3. But then, the more the AI analyses the coin and physics, the more the posterior will converge on 1⁄2. If, however, you double the the AI’s reward in the heads world, it will behave as if the probability is 2⁄3 even after getting huge amounts of data.
(I’m getting downvotes because The Person Formerly Known As Eugine_Nier doesn’t like me and is downvoting everything I post.)
Yes, I agree that the utility-function hack isn’t the same as altering the AI’s prior. It’s more like altering its posterior. But isn’t it still true that the effects on its inferences (or, more precisely, on its effective inferences—the things it behaves as if it believes) are the same as if you’d altered its beliefs? (Posterior as well as prior.)
If so, doesn’t what I said follow? That is:
Suppose that believing X would lead the AI to infer Y and do Z.
Perhaps X is “my message was corrupted by a burst of random noise before reaching the users”, Y is “some currently mysterious process enables the users to figure out what my message was despite the corruption”, and Z is some (presumably undesired) change in the AI’s actions, such as changing its message to influence the users’ behaviour.
Then, if you tweak its utility function so it behaves exactly as if it believed X …
… then in particular it will behave as if had inferred Y …
… and therefore will still do Z.
After witnessing the message being read, it would conclude Y happened, as P(Y|X and message read) is high. Before witnessing this, it wouldn’t, because P(Y|X) is (presumably) very low.
I may be misunderstanding something, but it seems like what you just said can’t be addressing the actual situation we’re talking about, because nothing in it makes reference to the AI’s utility function, which is the thing that gets manipulated in the schemes we’re talking about.
(I agree that the AI’s nominal beliefs might be quite different in the two cases, but the point of the utility-function hack is to make its actions correspond to a different set of beliefs. I’m talking about its actions, not about its purely-internal nominal beliefs.)
Let V be the set of worlds in which X happens. Let W be the set of worlds in which X and Y happens. Since Y is very unlikely, P(W)<<P(V) (however, P(W|message read) is roughly P(V|message read)). The AI gets utility u’ = u|V (the utility in the non-V worlds is constant, which we may as well set to zero).
Then if the AI is motivated to maximise u’ (assume for the moment that it can’t affect the probability of X), it will assume it is in the set V, and essentially ignore W. To use your terminology, u(Z|X) is low or negative, u(Z|X,Y) is high, but P(Y|X)*u(Z|X,Y) is low, so it likely won’t do Z.
Then, after it notices the message is read, it shifts to assuming Y happened—equivalently, that it is in the world set W. When doing so, it knows that it is almost certainly wrong—that it’s more likely in a world outside of V entirely where neither X nor Y happened—but it still tries, on the off-chance that it’s in W.
However, since it’s an oracle, we turn it off before that point. Or we use corrigibility to change its motivations.
Again, maybe I’m misunderstanding something—but it sounds as if you’re agreeing with me: once the AI observes evidence suggesting that its message has somehow been read, it will infer (or at least act as if it has inferred) Y and do Z.
I thought we were exploring a disagreement here; is there still one?
I think there is no remaining disagreement—I just want to emphasise that before the AI observes such evidence, it will behave the way we want.
So the idea is that we have an AI whose utility function is constant on all possible worlds where JFK was assassinated. It therefore has no reason to care about what happens in those worlds and will try to optimize its behaviour for worlds where JFK wasn’t assassinated. (And then, e.g., the point of this post is that given enough evidence for the assassination, it will be trying to optimize its behaviour for worlds that almost exactly resemble ones where JFK really was assassinated.)
If the AI thinks there’s even a tiny chance that it can influence whether JFK was assassinated, it may be extraordinarily keen to do so. To put it differently, it may reason thus: “The versions of this world in which JFK wasn’t assassinated are those in which either he never was but there’s been a most extraordinarily effective conspiracy to make it look as if he was, or else he was but somehow that can be changed. The latter seems monstrously improbable because it requires weird physics, but at this point the former is also monstrously improbable; to maximize utility in JFK-unassassinated worlds I had better start looking for ways to make this one of them even if it isn’t already.”
(I think this is closely related to V_V’s point.)
But in the real world the stochastic event that determines whether the message is read has a very different probability than what you make the AI think it has, therefore you are planting a false belief.
It may care about worlds where the message doesn’t meet your technical definition of having been read but nevertheless influences the world.
If I’m understanding Stuart’s proposal correctly, the AI is not deceived about how common the stochastic event is. It’s just made not to care about worlds in which it doesn’t happen. This is very similar in effect to making it think the event is common, but (arguably, at least) it doesn’t involve any false beliefs.
(I say “arguably” because, e.g., doing this will tend to make the AI answer “yes” to “do you think the event will happen?”, plan on the basis that it will happen, etc., and perhaps making something behave exactly as it would if it believed X isn’t usefully distinguishable from making it believe X.)
The problem is that the definition of the event not happening is probably too strict. The worlds that the AI doesn’t care about don’t exist its decision-making purposes, and in the world that the AI cares about, the AI assigns high probability to hypotheses like “the users can see the message even before I send it through the noisy channel”.