Any system can be modeled as maximizing some utility function, therefore utility maximization is not a very useful model
Corrigibility is possible, but utility maximization is incompatible with corrigibility, therefore we need some non-utility-maximizer kind of agent to achieve corrigibility
These two claims should probably not both be true! If any system can be modeled as maximizing a utility function, and it is possible to build a corrigible system, then naively the corrigible system can be modeled as maximizing a utility function.
I expect that many peoples’ intuitive mental models around utility maximization boil down to “boo utility maximizer models”, and they would therefore intuitively expect both the above claims to be true at first glance. But on examination, the probable-incompatibility is fairly obvious, so the two claims might make a useful test to notice when one is relying on yay/boo reasoning about utilities in an incoherent way.
FWIW I endorse the second claim when the utility function depends exclusively on the state of the world in the distant future, whereas I endorse the first claim when the utility function can depend on anything whatsoever (e.g. what actions I’m taking right this second). (details)
I wish we had different terms for those two things. That might help with any alleged yay/boo reasoning.
(When Eliezer talks about utility functions, he seems to assume that it depends exclusively on the state of the world in the distant future.)
Consider a homomorphically encrypted computation running somewhere in the cloud. The computations correspond to running an AGI. Now from the outside, you can still model the AGI based on how it behaves, as an expected utility maximizer, if you have a lot of observational data about the AGI (or at least let’s take this as a reasonable assumption).
No matter how closely you look at the computations, you will not be able to figure out how to change these computations in order to make the AGI aligned if it was not aligned already (Also, let’s assume that you are some sort of Cartesian agent, otherwise you would probably already be dead if you were running these kinds of computations).
So, my claim is not that modeling a system as an expected utility maximizer can’t be useful. Instead, I claim that this model is incomplete. At least with regard to the task of computing an update to the system, such that when we apply this update to the system, it would become aligned.
Of course, you can model any system, as an expected utility maximizer. But just because I can use the “high level” conceptual model of expected utility maximization, to model the behavior of a system very well. But behavior is not the only thing that we care about, we actually care about being able to understand the internal workings of the system, such that it becomes much easier to think about how to align the system.
So the following seems to be beside the point unless I am <missing/misunderstanding> something:
These two claims should probably not both be true! If any system can be modeled as maximizing a utility function, and it is possible to build a corrigible system, then naively the corrigible system can be modeled as maximizing a utility function.
Maybe I have missed the fact that the claim you listed says that expected utility maximization is not very useful. And I’m saying it can be useful, it might just not be sufficient at all to actually align a particular AGI system. Even if you can do it arbitrarily well.
I am not an expert, but as I remember it, it was a claim that “any system that follows certain axioms can be modeled as maximizing some utility function”. The axioms assumed that there were no circular preferences—if someone prefers A to B, B to C, and C to A, it is impossible to define a utility function such that u(A) > u(B) > u(C) > u(A) -- and that if the system says that A > B > C, it can decide between e.g. a 100% chance of B, and a 50% chance of A with a 50% chance of C, again in a way that is consistent.
I am not sure how this works when the system is allowed to take current time into account, for example when it is allowed to prefer A to B on Monday but prefer B to A on Tuesday. I suppose that in such situation any system can trivially be modeled by a utility function that at each moment assigns utility 1 to what the system actually did in that moment, and utility 0 to everything else.
Corrigibility is incompatible with assigning utility to everything in advance. A system that has preferences about future will also have a preference about not having its utility function changed. (For the same reason people have a preference not to be brainwashed, or not to take drugs, even if after brainwashing they are happy about having been brainwashed, and after getting addicted they do want more drugs.)
Corrigible system would be like: “I prefer A to B at this moment, but if humans decide to fix me and make me prefer B to A, then I prefer B to A”. In other words, it doesn’t have values for u(A) and u(B), or it doesn’t always act according to those values. A consistent system that currently prefers A to B would prefer not to be fixed.
A utility function represents preference elicited in a large collection of situations, each a separate choice between events that happens with incomplete information, as an event is not a particular point. This preference needs to be consistent across different situations to be representable by expected utility of a single utility function.
Once formulated, a utility function can be applied to a single choice/situation, such as a choice of a policy. But a system that only ever makes a single choice is not a natural fit for expected utility frame, and that’s the kind of system that usually appears in “any system can be modeled as maximizing some utility function”. So it’s not enough to maximize something once, or in a narrow collection of situations, the situations the system is hypothetically exposed to need to be about as diverse as choices between any pair of events, with some of the events very large, corresponding to unreasonably incomplete information, all drawn across the same probability space.
One place this mismatch of frames happens is with updateless decision theory. An updateless decision is a choice of a single policy, once and for all, so there is no reason for it to be guided by expected utility, even though it could be. The utility function for the updateless choice of policy would then need to be obtained elsewhere, in a setting that has all these situations with separate (rather than all enacting a single policy) and mutually coherent choices under uncertainty. But once an updateless policy is settled (by a policy-level decision), actions implied by it (rather than action-level decisions in expected utility frame) no longer need to be coherent. Not being coherent, they are not representable by an action-level utility function.
So by embracing updatelessness, we lose the setting that would elicit utility if the actions were instead individual mutually coherent decisions. And conversely, by embracing coherence of action-level decisions, we get an implied policy that’s not updatelessly optimal with respect to the very precise outcomes determined by any given whole policy. So an updateless agent founded on expected utility maximization implicitly references a different non-updateless agent whose preference is elicited by making separate action-level decisions under a much greater uncertainty than the policy-level alternatives the updateless agent considers.
I don’t think claim 1 is wrong, but it does clash with claim 2.
That means any system that has to be corrigible cannot be a system that maximizes a simple utility function (1 dimension), or put another way “whatever utility function is maximizes must be along multiple dimensions”.
Which seems to be pretty much what humans do, we have really complex utility functions, and everything seems to be ever changing and we have some control over it ourselves (and sometimes that goes wrong and people end up maxing out a singular dimension at the cost of everything else).
Note to self: Think more about this and if possible write up something more coherent and explanatory.
Consider two claims:
Any system can be modeled as maximizing some utility function, therefore utility maximization is not a very useful model
Corrigibility is possible, but utility maximization is incompatible with corrigibility, therefore we need some non-utility-maximizer kind of agent to achieve corrigibility
These two claims should probably not both be true! If any system can be modeled as maximizing a utility function, and it is possible to build a corrigible system, then naively the corrigible system can be modeled as maximizing a utility function.
I expect that many peoples’ intuitive mental models around utility maximization boil down to “boo utility maximizer models”, and they would therefore intuitively expect both the above claims to be true at first glance. But on examination, the probable-incompatibility is fairly obvious, so the two claims might make a useful test to notice when one is relying on yay/boo reasoning about utilities in an incoherent way.
FWIW I endorse the second claim when the utility function depends exclusively on the state of the world in the distant future, whereas I endorse the first claim when the utility function can depend on anything whatsoever (e.g. what actions I’m taking right this second). (details)
I wish we had different terms for those two things. That might help with any alleged yay/boo reasoning.
(When Eliezer talks about utility functions, he seems to assume that it depends exclusively on the state of the world in the distant future.)
Expected Utility Maximization is Not Enough
Consider a homomorphically encrypted computation running somewhere in the cloud. The computations correspond to running an AGI. Now from the outside, you can still model the AGI based on how it behaves, as an expected utility maximizer, if you have a lot of observational data about the AGI (or at least let’s take this as a reasonable assumption).
No matter how closely you look at the computations, you will not be able to figure out how to change these computations in order to make the AGI aligned if it was not aligned already (Also, let’s assume that you are some sort of Cartesian agent, otherwise you would probably already be dead if you were running these kinds of computations).
So, my claim is not that modeling a system as an expected utility maximizer can’t be useful. Instead, I claim that this model is incomplete. At least with regard to the task of computing an update to the system, such that when we apply this update to the system, it would become aligned.
Of course, you can model any system, as an expected utility maximizer. But just because I can use the “high level” conceptual model of expected utility maximization, to model the behavior of a system very well. But behavior is not the only thing that we care about, we actually care about being able to understand the internal workings of the system, such that it becomes much easier to think about how to align the system.
So the following seems to be beside the point unless I am <missing/misunderstanding> something:
Maybe I have missed the fact that the claim you listed says that expected utility maximization is not very useful. And I’m saying it can be useful, it might just not be sufficient at all to actually align a particular AGI system. Even if you can do it arbitrarily well.
I am not an expert, but as I remember it, it was a claim that “any system that follows certain axioms can be modeled as maximizing some utility function”. The axioms assumed that there were no circular preferences—if someone prefers A to B, B to C, and C to A, it is impossible to define a utility function such that u(A) > u(B) > u(C) > u(A) -- and that if the system says that A > B > C, it can decide between e.g. a 100% chance of B, and a 50% chance of A with a 50% chance of C, again in a way that is consistent.
I am not sure how this works when the system is allowed to take current time into account, for example when it is allowed to prefer A to B on Monday but prefer B to A on Tuesday. I suppose that in such situation any system can trivially be modeled by a utility function that at each moment assigns utility 1 to what the system actually did in that moment, and utility 0 to everything else.
Corrigibility is incompatible with assigning utility to everything in advance. A system that has preferences about future will also have a preference about not having its utility function changed. (For the same reason people have a preference not to be brainwashed, or not to take drugs, even if after brainwashing they are happy about having been brainwashed, and after getting addicted they do want more drugs.)
Corrigible system would be like: “I prefer A to B at this moment, but if humans decide to fix me and make me prefer B to A, then I prefer B to A”. In other words, it doesn’t have values for u(A) and u(B), or it doesn’t always act according to those values. A consistent system that currently prefers A to B would prefer not to be fixed.
I think John’s 1st bullet point was referring to an argument you can find in https://www.lesswrong.com/posts/NxF5G6CJiof6cemTw/coherence-arguments-do-not-entail-goal-directed-behavior and related.
A utility function represents preference elicited in a large collection of situations, each a separate choice between events that happens with incomplete information, as an event is not a particular point. This preference needs to be consistent across different situations to be representable by expected utility of a single utility function.
Once formulated, a utility function can be applied to a single choice/situation, such as a choice of a policy. But a system that only ever makes a single choice is not a natural fit for expected utility frame, and that’s the kind of system that usually appears in “any system can be modeled as maximizing some utility function”. So it’s not enough to maximize something once, or in a narrow collection of situations, the situations the system is hypothetically exposed to need to be about as diverse as choices between any pair of events, with some of the events very large, corresponding to unreasonably incomplete information, all drawn across the same probability space.
One place this mismatch of frames happens is with updateless decision theory. An updateless decision is a choice of a single policy, once and for all, so there is no reason for it to be guided by expected utility, even though it could be. The utility function for the updateless choice of policy would then need to be obtained elsewhere, in a setting that has all these situations with separate (rather than all enacting a single policy) and mutually coherent choices under uncertainty. But once an updateless policy is settled (by a policy-level decision), actions implied by it (rather than action-level decisions in expected utility frame) no longer need to be coherent. Not being coherent, they are not representable by an action-level utility function.
So by embracing updatelessness, we lose the setting that would elicit utility if the actions were instead individual mutually coherent decisions. And conversely, by embracing coherence of action-level decisions, we get an implied policy that’s not updatelessly optimal with respect to the very precise outcomes determined by any given whole policy. So an updateless agent founded on expected utility maximization implicitly references a different non-updateless agent whose preference is elicited by making separate action-level decisions under a much greater uncertainty than the policy-level alternatives the updateless agent considers.
Completely off the cuff take:
I don’t think claim 1 is wrong, but it does clash with claim 2.
That means any system that has to be corrigible cannot be a system that maximizes a simple utility function (1 dimension), or put another way “whatever utility function is maximizes must be along multiple dimensions”.
Which seems to be pretty much what humans do, we have really complex utility functions, and everything seems to be ever changing and we have some control over it ourselves (and sometimes that goes wrong and people end up maxing out a singular dimension at the cost of everything else).
Note to self: Think more about this and if possible write up something more coherent and explanatory.