This is an excellent question. I’d say the main reason is that all of the AI/ML systems that we have built to date are utility maximizers; that’s the mathematical framework in which they have been designed. Neural nets / deep-learning work by using a simple optimizer to find the minimum of a loss function via gradient descent. Evolutionary algorithms, simulated annealing, etc. find the minimum (or maximum) of a “fitness function”. We don’t know of any other way to build systems that learn.
Humans themselves evolved to maximize reproductive fitness. In the case of humans, our primary fitness function is reproductive fitness, but our genes have encoded a variety of secondary functions which (over evolutionary time) have been correlated with reproductive fitness. Our desires for love, friendship, happiness, etc. fall into this category. Our brains mainly work to satisfy these secondary functions; the brain gets electrochemical reward signals, controlled by our genes, in the form of pain/pleasure/satisfaction/loneliness etc. These secondary functions may or may not remain aligned with the primary loss function, which is why practitioners sometimes talk about “mesa-optimizers” or “inner vs outer alignment.”
Agreed. Humans are constantly optimizing a reward function, but it sort of ‘changes’ from moment to moment in a near-focal way, so it often looks irrational or self-defeating, but once you know what the reward function is, the goal-directedness is easy to see too.
Sune seems to think that humans are more intelligent than they are goal-directed, I’m not sure this is true, human truthseeking processes seems about as flawed and limited as their goal-pursuit. Maybe you can argue that humans are not generally intelligent or rational, but I don’t think you can justify setting the goalposts so that they’re one of those things and not the other.
You might be able to argue that human civilization is intelligent but not rational, and that functioning AGI will be more analogous to ecosystems of agents rather than one unified agent. If you can argue for that, that’s interesting, but I don’t know where to go from there. Civilizations tend towards increasing unity over time (the continuous reduction in energy wasted on conflict). I doubt that the goals they converge on together will be a form of human-favoring altruism. I haven’t seen anyone try to argue for that in a rigorous way.
Agreed. Humans are constantly optimizing a reward function, but it sort of ‘changes’ from moment to moment in a near-focal way, so it often looks irrational or self-defeating, but once you know what the reward function is, the goal-directedness is easy to see too.
Doesn’t this become tautological? If the reward function changes from moment to moment, then the reward function can just be whatever explains the behaviour.
Since everything can fit into the “agent with utility function” model given a sufficiently crumpled utility function, I guess I’d define “is an agent” as “goal-directed planning is useful for explaining a large enough part of its behavior.” This includes humans while discluding bacteria. (Hmm unless, like me, one knows so little about bacteria that it’s better to just model them as weak agents. Puzzling.)
On the other hand, the development of religion, morality, and universal human rights also seem to be a product of civilization, driven by the need for many people to coordinate and coexist without conflict. More recently, these ideas have expanded to include laws that establish nature reserves and protect animal rights. I personally am beginning to think that taking an ecosystem/civilizational approach with mixture of intelligent agents, human, animal, and AGI, might be a way to solve the alignment problem.
Does the inner / outer distinction complicate the claim that all current ML systems are utility maximizers? The gradient descent algorithm performs a simple kind of optimization in the training phase. But once the model is trained and in production, it doesn’t seem obvious that the “utility maximizer” lens is always helpful in understanding its behavior.
This is an excellent question. I’d say the main reason is that all of the AI/ML systems that we have built to date are utility maximizers; that’s the mathematical framework in which they have been designed. Neural nets / deep-learning work by using a simple optimizer to find the minimum of a loss function via gradient descent. Evolutionary algorithms, simulated annealing, etc. find the minimum (or maximum) of a “fitness function”. We don’t know of any other way to build systems that learn.
Humans themselves evolved to maximize reproductive fitness. In the case of humans, our primary fitness function is reproductive fitness, but our genes have encoded a variety of secondary functions which (over evolutionary time) have been correlated with reproductive fitness. Our desires for love, friendship, happiness, etc. fall into this category. Our brains mainly work to satisfy these secondary functions; the brain gets electrochemical reward signals, controlled by our genes, in the form of pain/pleasure/satisfaction/loneliness etc. These secondary functions may or may not remain aligned with the primary loss function, which is why practitioners sometimes talk about “mesa-optimizers” or “inner vs outer alignment.”
Agreed. Humans are constantly optimizing a reward function, but it sort of ‘changes’ from moment to moment in a near-focal way, so it often looks irrational or self-defeating, but once you know what the reward function is, the goal-directedness is easy to see too.
Sune seems to think that humans are more intelligent than they are goal-directed, I’m not sure this is true, human truthseeking processes seems about as flawed and limited as their goal-pursuit. Maybe you can argue that humans are not generally intelligent or rational, but I don’t think you can justify setting the goalposts so that they’re one of those things and not the other.
You might be able to argue that human civilization is intelligent but not rational, and that functioning AGI will be more analogous to ecosystems of agents rather than one unified agent. If you can argue for that, that’s interesting, but I don’t know where to go from there. Civilizations tend towards increasing unity over time (the continuous reduction in energy wasted on conflict). I doubt that the goals they converge on together will be a form of human-favoring altruism. I haven’t seen anyone try to argue for that in a rigorous way.
Doesn’t this become tautological? If the reward function changes from moment to moment, then the reward function can just be whatever explains the behaviour.
Since everything can fit into the “agent with utility function” model given a sufficiently crumpled utility function, I guess I’d define “is an agent” as “goal-directed planning is useful for explaining a large enough part of its behavior.” This includes humans while discluding bacteria. (Hmm unless, like me, one knows so little about bacteria that it’s better to just model them as weak agents. Puzzling.)
On the other hand, the development of religion, morality, and universal human rights also seem to be a product of civilization, driven by the need for many people to coordinate and coexist without conflict. More recently, these ideas have expanded to include laws that establish nature reserves and protect animal rights. I personally am beginning to think that taking an ecosystem/civilizational approach with mixture of intelligent agents, human, animal, and AGI, might be a way to solve the alignment problem.
Does the inner / outer distinction complicate the claim that all current ML systems are utility maximizers? The gradient descent algorithm performs a simple kind of optimization in the training phase. But once the model is trained and in production, it doesn’t seem obvious that the “utility maximizer” lens is always helpful in understanding its behavior.