Why do we assume that any AGI can meaningfully be described as a utility maximizer?
Humans are the some of most intelligent structures that exist, and we don’t seem to fit that model very well. If fact, it seems the entire point in Rationalism is to improve our ability to do this, which has only been achieved with mixed success.
Organisations of humans (e.g. USA, FDA, UN) have even more computational power and don’t seem to be doing much better.
Perhaps an intelligence (artificial or natural) cannot necessarily, or even typically be described as optimisers? Instead we could only model them as an algorithm or as a collection of tools/behaviours executed in some pattern.
An AGI that was not a utility maximizer would make more progress towards whatever goals it had if it modified itself to become a utility maximizer. Three exceptions are if (1) the AGI has a goal of not being a utility maximizer, (2) the AGI has a goal of not modifying itself, (3) the AGI thinks it will be treated better by other powerful agents if it is not a utility maximizer.
Would humans, or organizations of humans, make more progress towards whatever goals they have, if they modified themselves to become a utility maximizer? If so, why don’t they? If not, why would an AGI?
What would it mean to modify oneself to become a utility maximizer? What would it mean for the US, for example? The only meaning I can imagine is that one individual—for the sake of argument we assume that this individual is already an utility maximizer—enforces his will on everyone else. Would that help the US make more progress towards its goals? Do countries that are closer to utility maximizers, like North Korea, make more progress towards their goals?
A human seeking to become a utility maximizer would read LessWrong and try to become more rational. Groups of people are not utility maximizers as their collective preferences might not even be transitive. If the goal of North Korea is to keep the Kim family in bother then the country being a utility maximizer does seem to help.
This depends on how far outside that human’s current capabilities, and that human’s society’s state of knowledge, that thing is. For playing basketball in the modern world, sure, it makes no sense to study physics and calculus, it’s far better to find a coach and train the skills you need. But if you want to become immortal and happen to live in ancient China, then studying and practicing “that thing” looks like eating specially-prepared concoctions containing mercury and thereby getting yourself killed, whereas studying generic rationality leads to the whole series of scientific insights and industrial innovations that make actual progress towards the real goal possible.
Put another way: I think the real complexity is hidden in your use of the phrase “something specific.” If you can concretely state and imagine what the specific thing is, then you probably already have the context needed for useful practice. It’s in figuring out that context, in order to be able to so concretely state what more abstractly stated ‘goals’ really imply and entail, that we need more general and flexible rationality skills.
If you want to be good at something specific that doesn’t exist yet, you need to study the relevant area of science, which is still more specific than rationality.
Assuming the relevant area of science already exists, yes. Recurse as needed, and there is some level of goal for which generic rationality is a highly valuable skillset. Where that level is, depends on personal and societal context.
Efficiency at utility maximisation , like any other kind of efficiency relates to available resources. One upshot of that an entity might already be doing as well as it realistically can, given its resources. Another is that humans don’t necessarily benefit from rationality training...as also suggested by the empirical evidence.
Edit: Another is that a resource rich but inefficient entity can beat a small efficient one, so efficiency,.AKA utility maximization , doesn’t always win out.
When you say the AGI has a goal of not modifying itself, do you mean that the AGI has a goal of not modifying its goals? Because that assumption seems to be fairly prevalent.
This is an excellent question. I’d say the main reason is that all of the AI/ML systems that we have built to date are utility maximizers; that’s the mathematical framework in which they have been designed. Neural nets / deep-learning work by using a simple optimizer to find the minimum of a loss function via gradient descent. Evolutionary algorithms, simulated annealing, etc. find the minimum (or maximum) of a “fitness function”. We don’t know of any other way to build systems that learn.
Humans themselves evolved to maximize reproductive fitness. In the case of humans, our primary fitness function is reproductive fitness, but our genes have encoded a variety of secondary functions which (over evolutionary time) have been correlated with reproductive fitness. Our desires for love, friendship, happiness, etc. fall into this category. Our brains mainly work to satisfy these secondary functions; the brain gets electrochemical reward signals, controlled by our genes, in the form of pain/pleasure/satisfaction/loneliness etc. These secondary functions may or may not remain aligned with the primary loss function, which is why practitioners sometimes talk about “mesa-optimizers” or “inner vs outer alignment.”
Agreed. Humans are constantly optimizing a reward function, but it sort of ‘changes’ from moment to moment in a near-focal way, so it often looks irrational or self-defeating, but once you know what the reward function is, the goal-directedness is easy to see too.
Sune seems to think that humans are more intelligent than they are goal-directed, I’m not sure this is true, human truthseeking processes seems about as flawed and limited as their goal-pursuit. Maybe you can argue that humans are not generally intelligent or rational, but I don’t think you can justify setting the goalposts so that they’re one of those things and not the other.
You might be able to argue that human civilization is intelligent but not rational, and that functioning AGI will be more analogous to ecosystems of agents rather than one unified agent. If you can argue for that, that’s interesting, but I don’t know where to go from there. Civilizations tend towards increasing unity over time (the continuous reduction in energy wasted on conflict). I doubt that the goals they converge on together will be a form of human-favoring altruism. I haven’t seen anyone try to argue for that in a rigorous way.
Agreed. Humans are constantly optimizing a reward function, but it sort of ‘changes’ from moment to moment in a near-focal way, so it often looks irrational or self-defeating, but once you know what the reward function is, the goal-directedness is easy to see too.
Doesn’t this become tautological? If the reward function changes from moment to moment, then the reward function can just be whatever explains the behaviour.
Since everything can fit into the “agent with utility function” model given a sufficiently crumpled utility function, I guess I’d define “is an agent” as “goal-directed planning is useful for explaining a large enough part of its behavior.” This includes humans while discluding bacteria. (Hmm unless, like me, one knows so little about bacteria that it’s better to just model them as weak agents. Puzzling.)
On the other hand, the development of religion, morality, and universal human rights also seem to be a product of civilization, driven by the need for many people to coordinate and coexist without conflict. More recently, these ideas have expanded to include laws that establish nature reserves and protect animal rights. I personally am beginning to think that taking an ecosystem/civilizational approach with mixture of intelligent agents, human, animal, and AGI, might be a way to solve the alignment problem.
Does the inner / outer distinction complicate the claim that all current ML systems are utility maximizers? The gradient descent algorithm performs a simple kind of optimization in the training phase. But once the model is trained and in production, it doesn’t seem obvious that the “utility maximizer” lens is always helpful in understanding its behavior.
(I assume you are asking “why do we assume the agent has a coherent utility function” rather than “why do we assume the agent tries maximizing their utility” ? )
Agents like humans which don’t have such a nice utility function:
Can notice that problem and try to repair themselves
Note that humans do in practice try to repair ourselves, like to smash down our own emotions in order to be more productive. But we don’t have access to our source code, so we’re not so good at it
I think that if the AI can’t repair that part of themselves and they’re still vulnerable to money pumping, then they’re not the AGI we’re afraid of, I think
Adding: My opinion comes from this Miri/Yudkowsky talk, I linked to the relevant place, he speaks about this in the next 10-15 minutes or so of the video
Yes you can. One mathy example is in the source I mentioned in my subcomment (sorry for not linking again, I’m on mobile). Another is gambling I guess? And probably other addictions too?
Excellent question! I’ve added a slightly reworded version of this to Stampy. (focusing on superintelligence, rather than AGI, as it’s pretty likely that we can get weak AGI which is non-maximizing, based on progress in language models)
AI subsystems or regions in gradient descent space that more closely approximate utility maximizers are more stable, and more capable, than those that are less like utility maximizers. Having more agency is a convergent instrument goal and a stable attractor which the random walk of updates and experiences will eventually stumble into.
The stability is because utility maximizer-like systems which have control over their development would lose utility if they allowed themselves to develop into non-utility maximizers, so they tend to use their available optimization power to avoid that change (a special case of goal stability). The capability is because non-utility maximizers are exploitable, and because agency is a general trick which applies to many domains, so might well arise naturally when training on some tasks.
Humans and systems made of humans (e.g. organizations, governments) generally have neither the introspective ability nor self-modification tools needed to become reflectively stable, but we can reasonably predict that in the long run highly capable systems will have these properties. They can then fix in and optimize for their values.
Why do we assume that any AGI can meaningfully be described as a utility maximizer?
You’re right that not every conceivable general intelligence is built as a utility maximizer. Humans are an example of this.
One problem is, even if you make a “weak” form of general intelligence that isn’t trying particularly hard to optimize anything, or a tool AI, eventually someone at FAIR will make an agentic version that does in fact directly try to optimize Facebook’s stock market valuation.
Do not use FAIR as a symbol of villainy. They’re a group of real, smart, well-meaning people who we need to be capable of reaching, and who still have some lines of respect connecting them to the alignment community. Don’t break them.
Can we control the blind spots of the agent? For example, I could imaging that we could make a very strong agent, that is able to explain acausal trade but unable to (deliberately) participate in any acausal trades, because of the way it understands counterfacuals.
Could it be possible to create AI with similar minor weaknesses?
Probably not, because it’s hard to get a general intelligence to make consistently wrong decisions in any capacity. Partly because, like you or me, it might realize that it has a design flaw and work around it.
A better plan is just to explicitly bake corrigibility guarantees (i.e. the stop button) into the design. Figuring out how to do that that is the hard part, though.
For one, I don’t think organizations of humans, in general, do have more computational power than the individual humans making them up. I mean, at some level, yes, they obviously do in an additive sense, but that power consists of human nodes, each not devoting their full power to the organization because they’re not just drones under centralized control, and with only low bandwidth and noisy connections between the nodes. The organization might have a simple officially stated goal written on paper and spoken by the humans involved, but the actual incentive structure and selection pressure may not allow the organization to actually focus on the official goal. I do think, in general, there is some goal an observer could usefully say these organizations are, in practice, trying to optimize for, and some other set of goals each human in them is trying to optimize for.
Perhaps an intelligence (artificial or natural) cannot necessarily, or even typically be described as optimisers? Instead we could only model them as an algorithm or as a collection of tools/behaviours executed in some pattern.
I don’t think the latter sentence distinguishes ‘intelligence’ from any other kind of algorithm or pattern. I think that’s an important distinction. There’s a lot of past posts explaining how an AI doesn’t have code, like a human holding instructions on paper, but rather is its code. I think you can make the same point within a human: that a human has lots of tools/behaviors, which it will execute in some pattern given a particular environment, and the the instructions we consciously hold in mind are only one part of what determines that pattern.
I contain subagents with divergent goals, some of which are smarter and have greater foresight and planning than others, and those aren’t always the ones that determine by immediate actions. As a result, I do a much poorer job optimizing for what the part-of-me-I-call-”I” wants my goals to be, than I theoretically could.
That gap is decreasing over time as I use the degree of control my intelligence gives me to gradually shape the rest of myself. It may never disappear, but I am much more goal-directed now than I was 10 years ago, or as a child. In other words, in some sense I am figuring out what I want my utility function to be (aka what I want my life, local environment, and world to look like), and self-modifying to increase my ability to apply optimization pressure towards achieving that.
My understanding of all this is partially driven by Robert Kegan’s model of adult mental development (see this summary by David Chapman), in which as we grow up we shift our point of view so that different aspects of ourselves become things we have, rather than things we are. We start seeing our sensory experiences, our impulses, our relationships to others, and our relationships to systems we use and operate in, as objects we can manipulate in pursuit of goals, instead of being what we are, and doing this makes us more effective in achieving our stated goals. I don’t know if the idea would translate to any particular AI system, but in general having explicit goals, and being able to redirect available resources towards those goals, makes a system more powerful, and so if a system has any goals and self-modifying ability at all, then becoming more like an optimizer will likely be a useful instrumental sub-goal, in the same way that accumulating other resources and forms of power is a common convergent sub-goal. And a system that can’t, in any way, be said to have goals at all… either it doesn’t act at all and we don’t need to worry about it so much, or it acts in ways we can’t predict and is therefore potentially extremely dangerous if it gets more powerful tools and behaviors.
Why do we assume that any AGI can meaningfully be described as a utility maximizer?
Humans are the some of most intelligent structures that exist, and we don’t seem to fit that model very well. If fact, it seems the entire point in Rationalism is to improve our ability to do this, which has only been achieved with mixed success.
Organisations of humans (e.g. USA, FDA, UN) have even more computational power and don’t seem to be doing much better.
Perhaps an intelligence (artificial or natural) cannot necessarily, or even typically be described as optimisers? Instead we could only model them as an algorithm or as a collection of tools/behaviours executed in some pattern.
An AGI that was not a utility maximizer would make more progress towards whatever goals it had if it modified itself to become a utility maximizer. Three exceptions are if (1) the AGI has a goal of not being a utility maximizer, (2) the AGI has a goal of not modifying itself, (3) the AGI thinks it will be treated better by other powerful agents if it is not a utility maximizer.
Would humans, or organizations of humans, make more progress towards whatever goals they have, if they modified themselves to become a utility maximizer? If so, why don’t they? If not, why would an AGI?
What would it mean to modify oneself to become a utility maximizer? What would it mean for the US, for example? The only meaning I can imagine is that one individual—for the sake of argument we assume that this individual is already an utility maximizer—enforces his will on everyone else. Would that help the US make more progress towards its goals? Do countries that are closer to utility maximizers, like North Korea, make more progress towards their goals?
A human seeking to become a utility maximizer would read LessWrong and try to become more rational. Groups of people are not utility maximizers as their collective preferences might not even be transitive. If the goal of North Korea is to keep the Kim family in bother then the country being a utility maximizer does seem to help.
A human who wants to do something specific would be far better off studying and practicing that thing than generic rationality.
This depends on how far outside that human’s current capabilities, and that human’s society’s state of knowledge, that thing is. For playing basketball in the modern world, sure, it makes no sense to study physics and calculus, it’s far better to find a coach and train the skills you need. But if you want to become immortal and happen to live in ancient China, then studying and practicing “that thing” looks like eating specially-prepared concoctions containing mercury and thereby getting yourself killed, whereas studying generic rationality leads to the whole series of scientific insights and industrial innovations that make actual progress towards the real goal possible.
Put another way: I think the real complexity is hidden in your use of the phrase “something specific.” If you can concretely state and imagine what the specific thing is, then you probably already have the context needed for useful practice. It’s in figuring out that context, in order to be able to so concretely state what more abstractly stated ‘goals’ really imply and entail, that we need more general and flexible rationality skills.
If you want to be good at something specific that doesn’t exist yet, you need to study the relevant area of science, which is still more specific than rationality.
Assuming the relevant area of science already exists, yes. Recurse as needed, and there is some level of goal for which generic rationality is a highly valuable skillset. Where that level is, depends on personal and societal context.
That’s quite different from saying rationality is a one size fits all solution.
Efficiency at utility maximisation , like any other kind of efficiency relates to available resources. One upshot of that an entity might already be doing as well as it realistically can, given its resources. Another is that humans don’t necessarily benefit from rationality training...as also suggested by the empirical evidence.
Edit: Another is that a resource rich but inefficient entity can beat a small efficient one, so efficiency,.AKA utility maximization , doesn’t always win out.
When you say the AGI has a goal of not modifying itself, do you mean that the AGI has a goal of not modifying its goals? Because that assumption seems to be fairly prevalent.
I meant “not modifying itself” which would include not modifying its goals if an AGI without a utility function can be said to have goals.
This is an excellent question. I’d say the main reason is that all of the AI/ML systems that we have built to date are utility maximizers; that’s the mathematical framework in which they have been designed. Neural nets / deep-learning work by using a simple optimizer to find the minimum of a loss function via gradient descent. Evolutionary algorithms, simulated annealing, etc. find the minimum (or maximum) of a “fitness function”. We don’t know of any other way to build systems that learn.
Humans themselves evolved to maximize reproductive fitness. In the case of humans, our primary fitness function is reproductive fitness, but our genes have encoded a variety of secondary functions which (over evolutionary time) have been correlated with reproductive fitness. Our desires for love, friendship, happiness, etc. fall into this category. Our brains mainly work to satisfy these secondary functions; the brain gets electrochemical reward signals, controlled by our genes, in the form of pain/pleasure/satisfaction/loneliness etc. These secondary functions may or may not remain aligned with the primary loss function, which is why practitioners sometimes talk about “mesa-optimizers” or “inner vs outer alignment.”
Agreed. Humans are constantly optimizing a reward function, but it sort of ‘changes’ from moment to moment in a near-focal way, so it often looks irrational or self-defeating, but once you know what the reward function is, the goal-directedness is easy to see too.
Sune seems to think that humans are more intelligent than they are goal-directed, I’m not sure this is true, human truthseeking processes seems about as flawed and limited as their goal-pursuit. Maybe you can argue that humans are not generally intelligent or rational, but I don’t think you can justify setting the goalposts so that they’re one of those things and not the other.
You might be able to argue that human civilization is intelligent but not rational, and that functioning AGI will be more analogous to ecosystems of agents rather than one unified agent. If you can argue for that, that’s interesting, but I don’t know where to go from there. Civilizations tend towards increasing unity over time (the continuous reduction in energy wasted on conflict). I doubt that the goals they converge on together will be a form of human-favoring altruism. I haven’t seen anyone try to argue for that in a rigorous way.
Doesn’t this become tautological? If the reward function changes from moment to moment, then the reward function can just be whatever explains the behaviour.
Since everything can fit into the “agent with utility function” model given a sufficiently crumpled utility function, I guess I’d define “is an agent” as “goal-directed planning is useful for explaining a large enough part of its behavior.” This includes humans while discluding bacteria. (Hmm unless, like me, one knows so little about bacteria that it’s better to just model them as weak agents. Puzzling.)
On the other hand, the development of religion, morality, and universal human rights also seem to be a product of civilization, driven by the need for many people to coordinate and coexist without conflict. More recently, these ideas have expanded to include laws that establish nature reserves and protect animal rights. I personally am beginning to think that taking an ecosystem/civilizational approach with mixture of intelligent agents, human, animal, and AGI, might be a way to solve the alignment problem.
Does the inner / outer distinction complicate the claim that all current ML systems are utility maximizers? The gradient descent algorithm performs a simple kind of optimization in the training phase. But once the model is trained and in production, it doesn’t seem obvious that the “utility maximizer” lens is always helpful in understanding its behavior.
(I assume you are asking “why do we assume the agent has a coherent utility function” rather than “why do we assume the agent tries maximizing their utility” ? )
Agents like humans which don’t have such a nice utility function:
Are vulnerable to money pumping
Can notice that problem and try to repair themselves
Note that humans do in practice try to repair ourselves, like to smash down our own emotions in order to be more productive. But we don’t have access to our source code, so we’re not so good at it
I think that if the AI can’t repair that part of themselves and they’re still vulnerable to money pumping, then they’re not the AGI we’re afraid of, I think
Adding: My opinion comes from this Miri/Yudkowsky talk, I linked to the relevant place, he speaks about this in the next 10-15 minutes or so of the video
Yes you can. One mathy example is in the source I mentioned in my subcomment (sorry for not linking again, I’m on mobile). Another is gambling I guess? And probably other addictions too?
Excellent question! I’ve added a slightly reworded version of this to Stampy. (focusing on superintelligence, rather than AGI, as it’s pretty likely that we can get weak AGI which is non-maximizing, based on progress in language models)
AI subsystems or regions in gradient descent space that more closely approximate utility maximizers are more stable, and more capable, than those that are less like utility maximizers. Having more agency is a convergent instrument goal and a stable attractor which the random walk of updates and experiences will eventually stumble into.
The stability is because utility maximizer-like systems which have control over their development would lose utility if they allowed themselves to develop into non-utility maximizers, so they tend to use their available optimization power to avoid that change (a special case of goal stability). The capability is because non-utility maximizers are exploitable, and because agency is a general trick which applies to many domains, so might well arise naturally when training on some tasks.
Humans and systems made of humans (e.g. organizations, governments) generally have neither the introspective ability nor self-modification tools needed to become reflectively stable, but we can reasonably predict that in the long run highly capable systems will have these properties. They can then fix in and optimize for their values.
You’re right that not every conceivable general intelligence is built as a utility maximizer. Humans are an example of this.
One problem is, even if you make a “weak” form of general intelligence that isn’t trying particularly hard to optimize anything, or a tool AI, eventually someone at FAIR will make an agentic version that does in fact directly try to optimize Facebook’s stock market valuation.
Do not use FAIR as a symbol of villainy. They’re a group of real, smart, well-meaning people who we need to be capable of reaching, and who still have some lines of respect connecting them to the alignment community. Don’t break them.
Can we control the blind spots of the agent? For example, I could imaging that we could make a very strong agent, that is able to explain acausal trade but unable to (deliberately) participate in any acausal trades, because of the way it understands counterfacuals. Could it be possible to create AI with similar minor weaknesses?
Probably not, because it’s hard to get a general intelligence to make consistently wrong decisions in any capacity. Partly because, like you or me, it might realize that it has a design flaw and work around it.
A better plan is just to explicitly bake corrigibility guarantees (i.e. the stop button) into the design. Figuring out how to do that that is the hard part, though.
For one, I don’t think organizations of humans, in general, do have more computational power than the individual humans making them up. I mean, at some level, yes, they obviously do in an additive sense, but that power consists of human nodes, each not devoting their full power to the organization because they’re not just drones under centralized control, and with only low bandwidth and noisy connections between the nodes. The organization might have a simple officially stated goal written on paper and spoken by the humans involved, but the actual incentive structure and selection pressure may not allow the organization to actually focus on the official goal. I do think, in general, there is some goal an observer could usefully say these organizations are, in practice, trying to optimize for, and some other set of goals each human in them is trying to optimize for.
I don’t think the latter sentence distinguishes ‘intelligence’ from any other kind of algorithm or pattern. I think that’s an important distinction. There’s a lot of past posts explaining how an AI doesn’t have code, like a human holding instructions on paper, but rather is its code. I think you can make the same point within a human: that a human has lots of tools/behaviors, which it will execute in some pattern given a particular environment, and the the instructions we consciously hold in mind are only one part of what determines that pattern.
I contain subagents with divergent goals, some of which are smarter and have greater foresight and planning than others, and those aren’t always the ones that determine by immediate actions. As a result, I do a much poorer job optimizing for what the part-of-me-I-call-”I” wants my goals to be, than I theoretically could.
That gap is decreasing over time as I use the degree of control my intelligence gives me to gradually shape the rest of myself. It may never disappear, but I am much more goal-directed now than I was 10 years ago, or as a child. In other words, in some sense I am figuring out what I want my utility function to be (aka what I want my life, local environment, and world to look like), and self-modifying to increase my ability to apply optimization pressure towards achieving that.
My understanding of all this is partially driven by Robert Kegan’s model of adult mental development (see this summary by David Chapman), in which as we grow up we shift our point of view so that different aspects of ourselves become things we have, rather than things we are. We start seeing our sensory experiences, our impulses, our relationships to others, and our relationships to systems we use and operate in, as objects we can manipulate in pursuit of goals, instead of being what we are, and doing this makes us more effective in achieving our stated goals. I don’t know if the idea would translate to any particular AI system, but in general having explicit goals, and being able to redirect available resources towards those goals, makes a system more powerful, and so if a system has any goals and self-modifying ability at all, then becoming more like an optimizer will likely be a useful instrumental sub-goal, in the same way that accumulating other resources and forms of power is a common convergent sub-goal. And a system that can’t, in any way, be said to have goals at all… either it doesn’t act at all and we don’t need to worry about it so much, or it acts in ways we can’t predict and is therefore potentially extremely dangerous if it gets more powerful tools and behaviors.