I’m confused. If the AI knows a million digits of pi, and it can prevent Omega from counterfactually mugging me where it knows I will lose money… shouldn’t it try to prevent that from happening? That seems like the right behavior to me. Similarly, if I knew that the AI knows a million digits of pi, then if it gets counterfactually mugged, it shouldn’t give up the money.
If you don’t think one should pay up in counterfactual mugging in general, then my argument won’t land. Rather than arguing that you want to be counterfactually mugged, I’ll try and argue a different decision problem.
Suppose that Omega is running a fairly simple and quick algorithm which is nonetheless able to predict an AI with more processing power, due to using a stronger logic or similar tricks. Omega will put either $10 or $1000 in a box. Our AI can press a button on the box to get either all or half of the money inside. Omega puts in $1000 if it predicts that our AI will take half the money; otherwise, it puts in $10.
We suppose that, since there is a short proof of exactly what Omega does, it is already present in the mathematical database included in the AI’s prior.
If the AI is a value-learning agent, it will take all the money, since it already knows how much money there is—taking less money just has a lower expected utility. So, it will get only $10 from Omega.
If the AI is a policy-approval agent, it will think about what would have a higher expectation in the human’s expectation: taking half, or taking it all. It’s quite possible in this case that it takes all the money.
(Perhaps the argument is that as long as Omega was uncertain about the digit when deciding what game to propose, then you should pay up as necessary, regardless of what you know. But if that’s the argument, then why can’t the AI go through the same reasoning?)
That is part of the argument for paying up in counterfactual mugging, yes. But both us and Omega need to be uncertain about the digit, since if our prior can already predict that Omega is going to ask us for $10 rather than give us any money, there’s no reason for us to pay up. So, it depends on the prior, and can turn out differently if our vs the agent’s prior is used.
If the AI knows the winning numbers for the lottery, then it should buy that ticket for me, even though (if I don’t know that the AI knows the winning numbers) I would disprefer that action.
If I think that the AI tends to be miscalibrated about lottery-ticket beliefs, there is no reason for me to want it to buy the ticket. If I think it is calibrated about lottery-tirket beliefs, I’ll like the policy of buying lottery tickets in such cases, so the AI will buy.
You could argue that an AI which is trying to be helpful will buy lottery tickets in such cases no matter how deluded the humans think it is. But, not only is this not very corrigible behavior, but also it doesn’t make any sense from our perspective to make an AI reason in that way: we don’t want the AI to act in ways which we have good reason to believe are unreliable.
ETA: My strawman-ML-version of your argument is that you would prefer imitation learning instead of inverse reinforcement learning (which differ when the AI and human know different things). This seems wrong to me.
The analogy isn’t perfect, since the AI can still do things to maximize human approval which the human would never have thought of, as well as things which the human could think of but didn’t have the computational resources to do. It does seem like a fairly good analogy, though.
Okay, I think I misunderstood what you were claiming in this post. Based on the following line:
I claimed that an agent which learns your utility function (pretending for a moment that “your utility function” really is a well-defined thing) and attempts to optimize it is still not perfectly aligned with you.
I thought you were arguing, “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this true utility function is still not aligned with you.” (Yes, having written it down I can see that is not what you actually said, but that’s the interpretation I originally ended up with.)
I would now rephrase your claim as “Even assuming we know the true utility function, optimizing it is hard.”
Examples:
You could argue that an AI which is trying to be helpful will buy lottery tickets in such cases no matter how deluded the humans think it is. But, not only is this not very corrigible behavior, but also it doesn’t make any sense from our perspective to make an AI reason in that way: we don’t want the AI to act in ways which we have good reason to believe are unreliable.
Yeah, an AI that optimizes the true utility function probably won’t be corrigible. From a theoretical standpoint, that seems fine—corrigibility seems like an easier target to shoot for, not a necessary aspect of an aligned AI. The reason we don’t want the scenario above is “we have good reason to believe [the AI is] unreliable”, which sounds like the AI is failing to optimize the utility function correctly.
If the AI is a value-learning agent, it will take all the money, since it already knows how much money there is—taking less money just has a lower expected utility. So, it will get only $10 from Omega.
If the AI is a policy-approval agent, it will think about what would have a higher expectation in the human’s expectation: taking half, or taking it all. It’s quite possible in this case that it takes all the money.
This also sounds like the value-learning agent is simply bad at correctly optimizing the true utility function. (It seems to me that all of decision theory is about how to properly optimize a utility function in theory.)
We can go in the opposite extreme, and make PR a broad prior such as the Solomonoff distribution, with no information about our world in particular.
I believe the observation has been made before that running UDT on such a prior could have weird results.
Again, seems like this proposal for making an aligned AI is just bad at optimizing the true utility function.
So I guess the way I would summarize this post:
Value learning is hard.
Even if you know the correct utility function, optimizing it is hard.
Instead of trying to value learn and then optimize, just go straight for the policy instead, which is safer than relying on accurately decomposing a human into two different things that are both difficult to learn and have weird interactions with each other.
I thought you were arguing, “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this true utility function is still not aligned with you.” (Yes, having written it down I can see that is not what you actually said, but that’s the interpretation I originally ended up with.)
I would correct it to “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this in expectation according to some prior is still not aligned with you.”
I would now rephrase your claim as “Even assuming we know the true utility function, optimizing it is hard.”
This part is tricky for me to interpret.
On the one hand, yes: specifically, even if you have all the processing power you need, you still need to optimize via a particular prior (AIXI optimizes via Solomonoff induction) since you can’t directly see what the consequences of your actions will be. So, I’m specifically pointing at an aspect of “optimizing it is hard” which is about having a good prior. You could say that “utility” is the true target, and “expected utility” is the proxy which you have to use in decision theory.
On the other hand, this might be a misleading way of framing the problem. It suggests that something with a perfect prior (magically exactly equal to the universe we’re actually in) would be perfectly aligned: “If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned.” This isn’t necessarily objectionable, but it is not the notion of alignment in the post.
If the AI magically has the “true universe” prior, this gives humans no reason to trust it. The humans might reasonably conclude that it is overconfident, and want to shut it down. If it justifiably has the true universe prior, and can explain why the prior must be right in a way that humans can understand, then the AI is aligned in the sense of the post.
The Jeffrey-Bolker rotation (mentioned in the post) gives me some reason to think of the prior and the utility function as one object, so that it doesn’t make sense to think about “the true human utility function” in isolation. None of my choice behavior (be it revealed preferences or verbally claimed preferences etc) can differentiate between me assigning small probability to a set of possibilities (but caring moderately about what happens in those possibilities) and assigning a moderate probability (but caring very little what happens one way or another in those worlds). So, I’m not even sure it is sensible to think of UH alone as capturing human preferences; maybe UH doesn’t really make sense apart from PH.
So, to summarize,
1. I agree that “even assuming we know the true utility function, optimizing it is hard”—but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.
2. Even under the idealized assumption that humans are perfectly coherent decision-theoretic agents, I’m not sure it makes sense to say there’s a “true human utility function”—the VNM theorem only gets a UH which is unique up to such-and-such by assuming a fixed notion of probability. The Jeffrey-Bolker representation theorem, which justifies rational agents having probability and utility functions in one theorem rather than justifying the two independently, shows that we can do this “rotation” which shifts which part of the preferences are represented in the probability vs in the utility, without changing the underlying preferences.
3. If we think of the objective as “building AI such that there is a good argument for humans trusting that the AI has human interest in mind” rather than “building AI which optimizes human utility”, then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don’t actually have to figure out which part of preferences are “probability” vs “utility”.
It suggests that something with a perfect prior (magically exactly equal to the universe we’re actually in) would be perfectly aligned: “If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned.” This isn’t necessarily objectionable, but it is not the notion of alignment in the post.
If the AI magically has the “true universe” prior, this gives humans no reason to trust it. The humans might reasonably conclude that it is overconfident, and want to shut it down. If it justifiably has the true universe prior, and can explain why the prior must be right in a way that humans can understand, then the AI is aligned in the sense of the post.
Sure. I was claiming that it is also a reasonable notion of alignment. My reason for not using that notion of alignment is that it doesn’t seem practically realizable.
However, if we could magically give the AI the “true universe” prior with the “true utility function”, I would be happy and say we were done, even if it wasn’t justifiable and couldn’t explain it to humans. I agree it would not be aligned in the sense of the post.
So, I’m not even sure it is sensible to think of UH alone as capturing human preferences; maybe UH doesn’t really make sense apart from PH.
This seems to argue that if my AI knew the winning lottery numbers, but didn’t have a chance to tell me how it knows this, then it shouldn’t buy the winning lottery ticket. I agree the Jeffrey-Bolker rotation seems to indicate that we should think of probutilities instead of probabilities and utilities separately, but it seems like there really are some very clear actual differences in the real world, and we should account for it somehow. Perhaps one difference is that probabilities change in response to new information, whereas (idealized) utility functions don’t. (Obviously humans don’t have idealized utility functions, but this is all a theoretical exercise anyway.)
I agree that “even assuming we know the true utility function, optimizing it is hard”—but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.
Thanks for clarifying, that’s clearer to me now.
If we think of the objective as “building AI such that there is a good argument for humans trusting that the AI has human interest in mind” rather than “building AI which optimizes human utility”, then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don’t actually have to figure out which part of preferences are “probability” vs “utility”.
I generally agree with the objective you propose (for practical reasons). The obvious way to do this is to do imitation learning, where (to a first approximation) you just copy the human’s policy. (Or alternatively, have the policy that a human would approve of you having.) This won’t let you exceed human intelligence, which seems like a pretty big problem. Do you expect an AI using policy alignment to do better than humans at tasks? If so, how is it doing better? My normal answer to this in the EV framework is “it has better estimates of probabilities of future states”, but we can’t do that any more. Perhaps you’re hoping that the AI can explain its plan to a human, and the human will then approve of it even though they wouldn’t have before the explanation. In that case, the human’s probutilities have changed, which means that policy alignment is now “alignment to a thing that I can manipulate”, which seems bad.
Fwiw I am generally in favor of approaches along the lines of policy alignment, I’m more confused about the theory behind it here.
I’m not even sure whether you are closer or further from understanding what I meant, now. I think you are probably closer, but stating it in a way I wouldn’t. I see that I need to do some careful disambiguation of background assumptions and language.
Instead of trying to value learn and then optimize, just go straight for the policy instead, which is safer than relying on accurately decomposing a human into two different things that are both difficult to learn and have weird interactions with each other.
This part, at least, is getting at the same intuition I’m coming from. However, I can only assume that you are confused why I would have set up things the way I did in the post if this was my point, since I didn’t end up talking much about directly learning the policies. (I am thinking I’ll write another post to make that connection clearer.)
I will have to think harder about the difference between how you’re framing things and how I would frame things, to try to clarify more.
I’m not even sure whether you are closer or further from understanding what I meant, now.
:(
I can only assume that you are confused why I would have set up things the way I did in the post if this was my point, since I didn’t end up talking much about directly learning the policies.
My assumption was that you were arguing for why learning policies directly (assuming we could do it) has advantages over the default approach of value learning + optimization. That framing seems to explain most of the post.
Separately, I still don’t understand the counterfactual mugging case. (Disclaimer, I haven’t gone through any math around counterfactual mugging.) It seems really strange that if the human was certain about the digit, they wouldn’t pay up, but if the human is uncertain about the digit but is certain that the AI knows the digit, then the human would not want the AI to intervene. But possibly it’s not worth getting into this detail.
Omega will put either $10 or $1000 in a box. Our AI can press a button on the box to get either all or half of the money inside. Omega puts in $1000 if it predicts that our AI will take half the money; otherwise, it puts in $10.
We suppose that, since there is a short proof of exactly what Omega does, it is already present in the mathematical database included in the AI’s prior.
If the AI is a value-learning agent, it will take all the money, since it already knows how much money there is—taking less money just has a lower expected utility. So, it will get only $10 from Omega.
If the AI is a policy-approval agent, it will think about what would have a higher expectation in the human’s expectation: taking half, or taking it all. It’s quite possible in this case that it takes all the money.
I think assuming that you have access to the proof of what Omega does means that you have already determined your own behavior. Presumably, “what Omega does” depends on your own policy, so if you have a proof about what Omega does, that proof also determines your action, and there is nothing left for the agent to consider.
To be clear, I think it’s reasonable to consider AIs that try to figure out proofs of “what Omega does”, but if that’s taken to be _part of the prior_, then it seems you no longer have the chance to (acausally) influence what Omega does. And if it’s not part of the prior, then I think a value-learning agent with a good decision theory can get the $500.
I think assuming that you have access to the proof of what Omega does means that you have already determined your own behavior.
You may not recognize it as such, especially if Omega is using a different axiom system than you. So, you can still be ignorant of what you’ll do while knowing what Omega’s prediction of you is. This makes it impossible for your probability distribution to treat the two as correlated anymore.
but if that’s taken to be _part of the prior_, then it seems you no longer have the chance to (acausally) influence what Omega does
Yeah, that’s the problem here.
And if it’s not part of the prior, then I think a value-learning agent with a good decision theory can get the $500.
Only if the agent takes that one proof out of the prior, but still has enough structure in the prior to see how the decision problem plays out. This is the problem of constructing a thin prior. You can (more or less) solve any decision problem by making the agent sufficiently updateless, but you run up against the problem of making it too updateless, at which point it behaves in absurd ways (lacking enough structure to even understand the consequences of policies correctly).
Hence the intuition that the correct prior to be updateless with respect to is the human one (which is, essentially, the main point of the post).
If you don’t think one should pay up in counterfactual mugging in general, then my argument won’t land. Rather than arguing that you want to be counterfactually mugged, I’ll try and argue a different decision problem.
Suppose that Omega is running a fairly simple and quick algorithm which is nonetheless able to predict an AI with more processing power, due to using a stronger logic or similar tricks. Omega will put either $10 or $1000 in a box. Our AI can press a button on the box to get either all or half of the money inside. Omega puts in $1000 if it predicts that our AI will take half the money; otherwise, it puts in $10.
We suppose that, since there is a short proof of exactly what Omega does, it is already present in the mathematical database included in the AI’s prior.
If the AI is a value-learning agent, it will take all the money, since it already knows how much money there is—taking less money just has a lower expected utility. So, it will get only $10 from Omega.
If the AI is a policy-approval agent, it will think about what would have a higher expectation in the human’s expectation: taking half, or taking it all. It’s quite possible in this case that it takes all the money.
That is part of the argument for paying up in counterfactual mugging, yes. But both us and Omega need to be uncertain about the digit, since if our prior can already predict that Omega is going to ask us for $10 rather than give us any money, there’s no reason for us to pay up. So, it depends on the prior, and can turn out differently if our vs the agent’s prior is used.
If I think that the AI tends to be miscalibrated about lottery-ticket beliefs, there is no reason for me to want it to buy the ticket. If I think it is calibrated about lottery-tirket beliefs, I’ll like the policy of buying lottery tickets in such cases, so the AI will buy.
You could argue that an AI which is trying to be helpful will buy lottery tickets in such cases no matter how deluded the humans think it is. But, not only is this not very corrigible behavior, but also it doesn’t make any sense from our perspective to make an AI reason in that way: we don’t want the AI to act in ways which we have good reason to believe are unreliable.
The analogy isn’t perfect, since the AI can still do things to maximize human approval which the human would never have thought of, as well as things which the human could think of but didn’t have the computational resources to do. It does seem like a fairly good analogy, though.
Okay, I think I misunderstood what you were claiming in this post. Based on the following line:
I thought you were arguing, “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this true utility function is still not aligned with you.” (Yes, having written it down I can see that is not what you actually said, but that’s the interpretation I originally ended up with.)
I would now rephrase your claim as “Even assuming we know the true utility function, optimizing it is hard.”
Examples:
Yeah, an AI that optimizes the true utility function probably won’t be corrigible. From a theoretical standpoint, that seems fine—corrigibility seems like an easier target to shoot for, not a necessary aspect of an aligned AI. The reason we don’t want the scenario above is “we have good reason to believe [the AI is] unreliable”, which sounds like the AI is failing to optimize the utility function correctly.
This also sounds like the value-learning agent is simply bad at correctly optimizing the true utility function. (It seems to me that all of decision theory is about how to properly optimize a utility function in theory.)
Again, seems like this proposal for making an aligned AI is just bad at optimizing the true utility function.
So I guess the way I would summarize this post:
Value learning is hard.
Even if you know the correct utility function, optimizing it is hard.
Instead of trying to value learn and then optimize, just go straight for the policy instead, which is safer than relying on accurately decomposing a human into two different things that are both difficult to learn and have weird interactions with each other.
Is this right?
I would correct it to “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this in expectation according to some prior is still not aligned with you.”
This part is tricky for me to interpret.
On the one hand, yes: specifically, even if you have all the processing power you need, you still need to optimize via a particular prior (AIXI optimizes via Solomonoff induction) since you can’t directly see what the consequences of your actions will be. So, I’m specifically pointing at an aspect of “optimizing it is hard” which is about having a good prior. You could say that “utility” is the true target, and “expected utility” is the proxy which you have to use in decision theory.
On the other hand, this might be a misleading way of framing the problem. It suggests that something with a perfect prior (magically exactly equal to the universe we’re actually in) would be perfectly aligned: “If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned.” This isn’t necessarily objectionable, but it is not the notion of alignment in the post.
If the AI magically has the “true universe” prior, this gives humans no reason to trust it. The humans might reasonably conclude that it is overconfident, and want to shut it down. If it justifiably has the true universe prior, and can explain why the prior must be right in a way that humans can understand, then the AI is aligned in the sense of the post.
The Jeffrey-Bolker rotation (mentioned in the post) gives me some reason to think of the prior and the utility function as one object, so that it doesn’t make sense to think about “the true human utility function” in isolation. None of my choice behavior (be it revealed preferences or verbally claimed preferences etc) can differentiate between me assigning small probability to a set of possibilities (but caring moderately about what happens in those possibilities) and assigning a moderate probability (but caring very little what happens one way or another in those worlds). So, I’m not even sure it is sensible to think of UH alone as capturing human preferences; maybe UH doesn’t really make sense apart from PH.
So, to summarize,
1. I agree that “even assuming we know the true utility function, optimizing it is hard”—but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.
2. Even under the idealized assumption that humans are perfectly coherent decision-theoretic agents, I’m not sure it makes sense to say there’s a “true human utility function”—the VNM theorem only gets a UH which is unique up to such-and-such by assuming a fixed notion of probability. The Jeffrey-Bolker representation theorem, which justifies rational agents having probability and utility functions in one theorem rather than justifying the two independently, shows that we can do this “rotation” which shifts which part of the preferences are represented in the probability vs in the utility, without changing the underlying preferences.
3. If we think of the objective as “building AI such that there is a good argument for humans trusting that the AI has human interest in mind” rather than “building AI which optimizes human utility”, then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don’t actually have to figure out which part of preferences are “probability” vs “utility”.
Sure. I was claiming that it is also a reasonable notion of alignment. My reason for not using that notion of alignment is that it doesn’t seem practically realizable.
However, if we could magically give the AI the “true universe” prior with the “true utility function”, I would be happy and say we were done, even if it wasn’t justifiable and couldn’t explain it to humans. I agree it would not be aligned in the sense of the post.
This seems to argue that if my AI knew the winning lottery numbers, but didn’t have a chance to tell me how it knows this, then it shouldn’t buy the winning lottery ticket. I agree the Jeffrey-Bolker rotation seems to indicate that we should think of probutilities instead of probabilities and utilities separately, but it seems like there really are some very clear actual differences in the real world, and we should account for it somehow. Perhaps one difference is that probabilities change in response to new information, whereas (idealized) utility functions don’t. (Obviously humans don’t have idealized utility functions, but this is all a theoretical exercise anyway.)
Thanks for clarifying, that’s clearer to me now.
I generally agree with the objective you propose (for practical reasons). The obvious way to do this is to do imitation learning, where (to a first approximation) you just copy the human’s policy. (Or alternatively, have the policy that a human would approve of you having.) This won’t let you exceed human intelligence, which seems like a pretty big problem. Do you expect an AI using policy alignment to do better than humans at tasks? If so, how is it doing better? My normal answer to this in the EV framework is “it has better estimates of probabilities of future states”, but we can’t do that any more. Perhaps you’re hoping that the AI can explain its plan to a human, and the human will then approve of it even though they wouldn’t have before the explanation. In that case, the human’s probutilities have changed, which means that policy alignment is now “alignment to a thing that I can manipulate”, which seems bad.
Fwiw I am generally in favor of approaches along the lines of policy alignment, I’m more confused about the theory behind it here.
I’m not even sure whether you are closer or further from understanding what I meant, now. I think you are probably closer, but stating it in a way I wouldn’t. I see that I need to do some careful disambiguation of background assumptions and language.
This part, at least, is getting at the same intuition I’m coming from. However, I can only assume that you are confused why I would have set up things the way I did in the post if this was my point, since I didn’t end up talking much about directly learning the policies. (I am thinking I’ll write another post to make that connection clearer.)
I will have to think harder about the difference between how you’re framing things and how I would frame things, to try to clarify more.
:(
My assumption was that you were arguing for why learning policies directly (assuming we could do it) has advantages over the default approach of value learning + optimization. That framing seems to explain most of the post.
Did you mean to say that it’s quite possible that it takes half the money?
Separately, I still don’t understand the counterfactual mugging case. (Disclaimer, I haven’t gone through any math around counterfactual mugging.) It seems really strange that if the human was certain about the digit, they wouldn’t pay up, but if the human is uncertain about the digit but is certain that the AI knows the digit, then the human would not want the AI to intervene. But possibly it’s not worth getting into this detail.
I think assuming that you have access to the proof of what Omega does means that you have already determined your own behavior. Presumably, “what Omega does” depends on your own policy, so if you have a proof about what Omega does, that proof also determines your action, and there is nothing left for the agent to consider.
To be clear, I think it’s reasonable to consider AIs that try to figure out proofs of “what Omega does”, but if that’s taken to be _part of the prior_, then it seems you no longer have the chance to (acausally) influence what Omega does. And if it’s not part of the prior, then I think a value-learning agent with a good decision theory can get the $500.
You may not recognize it as such, especially if Omega is using a different axiom system than you. So, you can still be ignorant of what you’ll do while knowing what Omega’s prediction of you is. This makes it impossible for your probability distribution to treat the two as correlated anymore.
Yeah, that’s the problem here.
Only if the agent takes that one proof out of the prior, but still has enough structure in the prior to see how the decision problem plays out. This is the problem of constructing a thin prior. You can (more or less) solve any decision problem by making the agent sufficiently updateless, but you run up against the problem of making it too updateless, at which point it behaves in absurd ways (lacking enough structure to even understand the consequences of policies correctly).
Hence the intuition that the correct prior to be updateless with respect to is the human one (which is, essentially, the main point of the post).