Claim: If there’s a way to build AGI, and there’s nothing in particular about its source code or training process that would lead to an intrinsic tendency to kindness as a terminal goal
There’s nothing in particular about the source code of an AI which would lead to an intrinsic tendency to kindness as a terminal goal
There’s nothing in particular about the training process (interpreted narrowly, as in “the specific mechanism by which weights are updated”) of an AI which would lead to an intrinsic tendency to kindness as a terminal goal
There’s nothing in particular about the training data of an AI which would lead to an intrinsic tendency to kindness as a terminal goal
So then the question is “for the type of AI which learns to generalize well enough to be able to model its environment, build and use tools, and seek out new data when it recognizes that its model of the environment is lacking in some specific area, will the training data end up chiseling an intrinsic tendency to kindness into the cognition of that AI?”
It is conceivable that the answer is “no” (as in your example of sociopaths). However, I expect that an AI like the above would be trained at least somewhat based on real-world multi-agent interactions, and I would be a bit surprised if “every individual is a sociopath and it is impossible to reliably signal kindness” was the only equilibrium state, or even the most probable equilibrium state, for real-world multi-agent interactions.
ETA for explicitness: even in the case of brain-like model-based RL, the training data has to come from somewhere, so if we care about the end result, we still have to care about the process which generates that training data.
Yeah, I meant “training process” to include training data and/or training environment. Sorry I didn’t make that explicit.
Here are three ways to pass the very low bar of “there’s at least prima facie reason to think that kindness might arise non-coincidentally and non-endogenously”, and whether I think those reasons actually stand up to scrutiny:
“The AIs are model-based RL, and they have other agents in their training environment.” I don’t think this argument works because I think intrinsic kindness drives are things that need to exist in the AI’s reward function, not just the learned world-model and value function. See for example this comment pointing out among other things that if AlphaZero had other agents in its training environment (and not just copies of itself), it wouldn’t learn kindness. Likewise, we have pocket calculators in our training environment, and we learn to appreciate their usefulness and to skillfully interact with them and repair them when broken, but we don’t wind up feeling deeply connected to them :)
“The AIs are model-based RL, and the reward function will not be programmed by a human, but rather discovered by a process analogous to animal evolution.” This isn’t impossible and a truly substantive argument if true, but my bet would be against that actually happening mainly because it’s extremely expensive to run outer loops around ML training like that, and meanwhile human programmers are perfectly capable of writing effective reward functions, they do it all the time in the RL literature today. I also think humans writing the reward function has the potential to turn out better than allowing an outer-loop search to write the reward function, if only we can figure out what we’re doing, cf. here especially the subsection “Is it a good idea to build human-like social instincts by evolving agents in a social environment?”.
See for example this comment pointing out among other things that if AlphaZero had other agents in its training environment (and not just copies of itself), it wouldn’t learn kindness
AlphaZero is playing a zero-sum game—as such, I wouldn’t expect it to learn anything along the lines of cooperativeness or kindness, because the only way it can win is if other agents lose, and the amount it wins is the same amount that other agents lose.
If AlphaZero was trained on a non-zero-sum game (e.g. in an environment where some agents were trying to win a game of Go, and others were trying to ensure that the board had a smiley-face made of black stones on a background of white stones somewhere on the board), it would learn how to model the preferences of other agents and figure out ways to achieve its own goals in a way that also allowed the other agents to achieve their goals.
I think intrinsic kindness drives are things that need to exist in the AI’s reward function, not just the learned world-model and value function
I think this implies that if one wanted to figure out why sociopaths are different than neurotypical people, one should look for differences in the reward circuitry of the brain rather than the predictive circuitry. Do you agree with that?
AlphaZero is playing a zero-sum game—as such, I wouldn’t expect it to learn anything along the lines of cooperativeness or kindness, because the only way it can win is if other agents lose, and the amount it wins is the same amount that other agents lose.
OK well AlphaZero doesn’t develop hatred and envy either, but now this conversation is getting silly.
If AlphaZero was trained on a non-zero-sum game (e.g. in an environment where some agents were trying to win a game of Go, and others were trying to ensure that the board had a smiley-face made of black stones on a background of white stones somewhere on the board), it would learn how to model the preferences of other agents and figure out ways to achieve its own goals in a way that also allowed the other agents to achieve their goals.
I’m not sure why you think that. It would learn to anticipate its opponent’s moves, but that’s different from accommodating its opponent’s preferences, unless the opponent has ways to exact revenge? Actually, I’m not sure I understand the setup you’re trying to describe. Which type of agent is AlphaZero in this scenario? What’s the reward function it’s trained on? The “environment” is still a single Go board right?
Anyway, I can think of situations where agents are repeatedly interacting in a non-zero-sum setting but where the parties don’t do anything that looks or feels like kindness over and above optimizing their own interest. One example is: the interaction between craft brewers and their yeast. (I think it’s valid to model yeast as having goals and preferences in a behaviorist sense.)
I think this implies that if one wanted to figure out why sociopaths are different than neurotypical people, one should look for differences in the reward circuitry of the brain rather than the predictive circuitry. Do you agree with that?
OK, low confidence on all this, but I think some people get an ASPD diagnosis purely for having an anger disorder, but the central ASPD person has some variant on “global under-arousal” (which can probably have any number of upstream root causes). That’s what I was guessing here; see also here (“The best physiological indicator of which young people will become violent criminals as adults is a low resting heart rate, says Adrian Raine of the University of Pennsylvania. … Indeed, when Daniel Waschbusch, a clinical psychologist at Penn State Hershey Medical Center, gave the most severely callous and unemotional children he worked with a stimulative medication, their behavior improved”).
Physiological arousal affects all kinds of things, and certainly does feed into the reward function, at least indirectly and maybe also directly.
There’s an additional complication that I think social instincts are in the same category as curiosity drive in that they involve the reward function taking (some aspects of) the learned world-model’s activity as an input (unlike typical RL reward functions which depend purely on exogenous inputs, e.g. Atari points—see “Theory 2” here). So that also complicates the picture of where we should be looking to find a root cause.
So yeah, I think the reward is a central part of the story algorithmically, but that doesn’t necessarily imply that the so-called “reward circuitry of the brain” (by which people usually mean VTA/SNc or sometimes NAc) is the spot where we should be looking for root causes. I don’t know the root cause; again there might be many different root causes in different parts of the brain that all wind up feeding into physiological arousal via different pathways.
The end product of a training run is a result of the source code, the training process, and the training data. For example, the [TinyStories](https://huggingface.co/roneneldan/TinyStories-33M) model can tell stories, but if you try to look through the [training or inference code](https://github.com/EleutherAI/gpt-neox) or [configuration() for the code that gives it the specific ability to tell stories instead of the ability to write code or play [Othello](https://thegradient.pub/othello/), you will not find anything.
As such, the claim would break down to
There’s nothing in particular about the source code of an AI which would lead to an intrinsic tendency to kindness as a terminal goal
There’s nothing in particular about the training process (interpreted narrowly, as in “the specific mechanism by which weights are updated”) of an AI which would lead to an intrinsic tendency to kindness as a terminal goal
There’s nothing in particular about the training data of an AI which would lead to an intrinsic tendency to kindness as a terminal goal
So then the question is “for the type of AI which learns to generalize well enough to be able to model its environment, build and use tools, and seek out new data when it recognizes that its model of the environment is lacking in some specific area, will the training data end up chiseling an intrinsic tendency to kindness into the cognition of that AI?”
It is conceivable that the answer is “no” (as in your example of sociopaths). However, I expect that an AI like the above would be trained at least somewhat based on real-world multi-agent interactions, and I would be a bit surprised if “every individual is a sociopath and it is impossible to reliably signal kindness” was the only equilibrium state, or even the most probable equilibrium state, for real-world multi-agent interactions.
ETA for explicitness: even in the case of brain-like model-based RL, the training data has to come from somewhere, so if we care about the end result, we still have to care about the process which generates that training data.
Yeah, I meant “training process” to include training data and/or training environment. Sorry I didn’t make that explicit.
Here are three ways to pass the very low bar of “there’s at least prima facie reason to think that kindness might arise non-coincidentally and non-endogenously”, and whether I think those reasons actually stand up to scrutiny:
“The AIs are LLMs, trained mostly by imitative learning of human data, and humans are nice sometimes.” I don’t have an opinion about whether this argument is sound, it’s not my area, I focus on brain-like model-based RL. It does seem to be quite a controversy, see for example here. (Note that model-based RL AIs can imitate, but do so in a fundamentally different way from LLM pretraining.)
“The AIs are model-based RL, and they have other agents in their training environment.” I don’t think this argument works because I think intrinsic kindness drives are things that need to exist in the AI’s reward function, not just the learned world-model and value function. See for example this comment pointing out among other things that if AlphaZero had other agents in its training environment (and not just copies of itself), it wouldn’t learn kindness. Likewise, we have pocket calculators in our training environment, and we learn to appreciate their usefulness and to skillfully interact with them and repair them when broken, but we don’t wind up feeling deeply connected to them :)
“The AIs are model-based RL, and the reward function will not be programmed by a human, but rather discovered by a process analogous to animal evolution.” This isn’t impossible and a truly substantive argument if true, but my bet would be against that actually happening mainly because it’s extremely expensive to run outer loops around ML training like that, and meanwhile human programmers are perfectly capable of writing effective reward functions, they do it all the time in the RL literature today. I also think humans writing the reward function has the potential to turn out better than allowing an outer-loop search to write the reward function, if only we can figure out what we’re doing, cf. here especially the subsection “Is it a good idea to build human-like social instincts by evolving agents in a social environment?”.
AlphaZero is playing a zero-sum game—as such, I wouldn’t expect it to learn anything along the lines of cooperativeness or kindness, because the only way it can win is if other agents lose, and the amount it wins is the same amount that other agents lose.
If AlphaZero was trained on a non-zero-sum game (e.g. in an environment where some agents were trying to win a game of Go, and others were trying to ensure that the board had a smiley-face made of black stones on a background of white stones somewhere on the board), it would learn how to model the preferences of other agents and figure out ways to achieve its own goals in a way that also allowed the other agents to achieve their goals.
I think this implies that if one wanted to figure out why sociopaths are different than neurotypical people, one should look for differences in the reward circuitry of the brain rather than the predictive circuitry. Do you agree with that?
OK well AlphaZero doesn’t develop hatred and envy either, but now this conversation is getting silly.
I’m not sure why you think that. It would learn to anticipate its opponent’s moves, but that’s different from accommodating its opponent’s preferences, unless the opponent has ways to exact revenge? Actually, I’m not sure I understand the setup you’re trying to describe. Which type of agent is AlphaZero in this scenario? What’s the reward function it’s trained on? The “environment” is still a single Go board right?
Anyway, I can think of situations where agents are repeatedly interacting in a non-zero-sum setting but where the parties don’t do anything that looks or feels like kindness over and above optimizing their own interest. One example is: the interaction between craft brewers and their yeast. (I think it’s valid to model yeast as having goals and preferences in a behaviorist sense.)
OK, low confidence on all this, but I think some people get an ASPD diagnosis purely for having an anger disorder, but the central ASPD person has some variant on “global under-arousal” (which can probably have any number of upstream root causes). That’s what I was guessing here; see also here (“The best physiological indicator of which young people will become violent criminals as adults is a low resting heart rate, says Adrian Raine of the University of Pennsylvania. … Indeed, when Daniel Waschbusch, a clinical psychologist at Penn State Hershey Medical Center, gave the most severely callous and unemotional children he worked with a stimulative medication, their behavior improved”).
Physiological arousal affects all kinds of things, and certainly does feed into the reward function, at least indirectly and maybe also directly.
There’s an additional complication that I think social instincts are in the same category as curiosity drive in that they involve the reward function taking (some aspects of) the learned world-model’s activity as an input (unlike typical RL reward functions which depend purely on exogenous inputs, e.g. Atari points—see “Theory 2” here). So that also complicates the picture of where we should be looking to find a root cause.
So yeah, I think the reward is a central part of the story algorithmically, but that doesn’t necessarily imply that the so-called “reward circuitry of the brain” (by which people usually mean VTA/SNc or sometimes NAc) is the spot where we should be looking for root causes. I don’t know the root cause; again there might be many different root causes in different parts of the brain that all wind up feeding into physiological arousal via different pathways.