Strong upvote for including more brain-related discussions in the alignment problem and for the deep perspectives on how to do it!
Disclaimer: I have not read the earlier posts but I have been researching BCI and social neuroscience.
I think there’s a Steering Subsystem circuit upstream of jealousy and schadenfreude; and
I think there’s a Steering Subsystem circuit upstream of our sense of compassion for our friends.
It is quite crucial to realize that these subsystems probably are very differently organized, are at very different magnitudes, and have very different functions in every different human. Cognitive neuroscience’s view of the modular brain seems (by the research) to be quite faulty and computational/complexity neuroscience generally seem more successful and are more concerned with reverse-engineering the brain, i.e. identifying neural circuitry associated with different evolutionary behaviours. This should inform how we cannot find “the Gaussian brain” and implement it.
You also mention in post #3 that you have the clean slate vs. pre-slate systems (learning vs. steering subsystems) and a less clear distinction here might be helpful instead of modularization. All learning subsystems are inherently organized in ways that evolutionarily seem to fit into that learning scheme (think neural network architectures) which is in and off itself another pre-seeded mechanism. You might have pointed this out in earlier posts as well, sorry if that’s the case!
I think I’m unusually inclined to emphasize the importance of “social learning by watching people”, compared to “social learning by interacting with people”.
In general, it seems babies learn quite a lot better from social interaction than pure watching. And between the two, they definitely learn better if they have something to imitate what they’re seeing upon. There’s definitely a good point in the speed differentials between human and AGI existence and I think exploring the opportunity of building a crude “loving family” simulator might not be a bad idea, i.e. have it grow up at its own speed in an OpenAI Gym simulation with relevant modalities of loving family.
I’m generally pro the “get AGI to grow up in a healthy environment” but definitely with the perspective that this is pretty hard to do with the Jelling Stones and that it seems plausible to simulate this either in an environment or with pure training data. But the point there is that the training data really needs to be thought of as the “loving family” in its most general sense since it indeed has a large influence on the AGI’s outcomes.
But great work, excited to read the rest of the posts in this series! Of course, I’m open for discussion on these points as well.
It is quite crucial to realize that these subsystems probably are very differently organized, are at very different magnitudes, and have very different functions in every different human.
To be clear, you & I are strongly disagreeing on this point. I think that an adult human’s conception of jealousy and associated behavior comes from an interaction between Steering Subsystem (hypothalamus & brainstem) circuitry (which is essentially identical in every human) and Learning Subsystem (e.g. neocortex) trained models (which can be quite different in different people, depending on life experience).
Cognitive neuroscience’s view of the modular brain seems (by the research) to be quite faulty
I claim that the hypothalamus and brainstem have genetically-hardcoded, somewhat-modular, human-universal structure. I haven’t seen anyone argue against that, have you? I think I’m pretty familiar with criticism of “cognitive neuroscience’s view of the modular brain”, and tend to strongly agree with it, because the criticism is of there being things like “an intuitive physics module” and “an intuitive psychology module” etc. in the neocortex. Not the brainstem, not the hypothalamus, but the neocortex. And I endorse that criticism because I happen to be a big believer in “cortical uniformity”. I think the neocortex learns physics and psychology and everything else from scratch, although there’s also a role for different neural architectures and hyperparameters in different places and at different stages of development.
On the hypothalamus & brainstem side: A good example of what I have in mind is (if I understand correctly, see Graebner 2015): there’s a particular population of neurons in the hypothalamus which seems to implement the following behavior: “If I’m under-nourished, do the following tasks: (1) emit a hunger sensation, (2) start rewarding the neocortex for getting food, (3) reduce fertility, (4) reduce growth, (5) reduce pain sensitivity, etc.” I expect that every non-brain-damaged rat has a similar population of neurons, in the same part of their hypothalamus, wired up the same way (possibly with some variations depending on their genome, which incidentally would probably also wind up affecting the rats’ obesity etc.)
You also mention in post #3 that you have the clean slate vs. pre-slate systems (learning vs. steering subsystems) and a less clear distinction here might be helpful instead of modularization. All learning subsystems are inherently organized in ways that evolutionarily seem to fit into that learning scheme (think neural network architectures) which is in and off itself another pre-seeded mechanism. You might have pointed this out in earlier posts as well, sorry if that’s the case!
That sounds great but I don’t understand what you’re proposing. What are the “relevant modalities of loving family”? I thought the important thing was there being an actual human that could give feedback and answer questions based on their actual human values, and these can’t be simulated because of the chicken-and-egg problem.
To be clear, you & I are strongly disagreeing on this point.
It seems like we are mostly agreeing in the general sense that there are areas of the brain with more individual differentiation and areas with less. The disagreement is probably more in how different this jealousy is exhibited as a result of the neocortical part of the circuit you mention.
Great to hear, then I have nothing to add there! I am quite inclined to believe that the neural architecture and hyperparameter differences are underestimated as a result of Brodmann areas being a thing at all, i.e. I’m a supporter of the broad cortical uniformity argument but against the strict formulation that I feel is way too prescriptive given our current knowledge of the brain’s functions.
And I will say I’m generally inclined to agree with your classification of the brain stem and hypothalamus.
That sounds great but I don’t understand what you’re proposing. What are the “relevant modalities of loving family”? I thought the important thing was there being an actual human that could give feedback and answer questions based on their actual human values, and these can’t be simulated because of the chicken-and-egg problem.
To be clear, my baseline would also be to follow your methodology but I think there’s a lot of opportunity in the “nurture” approach as well. This is mostly related to the idea of open-ended training (e.g. like AlphaZero) and creating a game-like environment where it’s possible to train for the agent. This can to some degree be seen as a sort of IDA proposal since your environment will need to be very complex (e.g. have other agents that are kind or other “aligned trait”, possibly trained from earlier states).
With this sort of setup, the human-giving-feedback is the designer of the environment itself, leading to a form of scalable human oversight probably iterating over many environments and agents, i.e. the IDA part of the idea. And again, there are a lot of holes in this plan, but I feel like it should not be dismissed outright. This post should also inform this process. So a very broad “loving family” proposal, though the name itself doesn’t seem adequate for this sort of approach ;)
The disagreement is probably more in how different this jealousy is exhibited as a result of the neocortical part of the circuit you mention
I think the behavior involves both the hypothalamus/brainstem and the neocortex, but the circuit I’m talking about would be entirely in the hypothalamus/brainstem.
In the normal RL picture, reward function + RL algorithm + environment = learned behavior. The reward function (Steering Subsystem) is separate from the RL algorithm (Learning Subsystem), not only conceptually, but also in actual RL code, and (I claim) anatomically in the brain too. Still, both the reward function and the RL algorithm are inputs into the adult’s jealousy-related behavior, as is the culture they grow up in etc.
I am quite inclined to believe that the neural architecture and hyperparameter differences are underestimated as a result of Brodmann areas being a thing at all, i.e. I’m a supporter of the broad cortical uniformity argument but against the strict formulation that I feel is way too prescriptive given our current knowledge of the brain’s functions.
This is mostly related to the idea of open-ended training (e.g. like AlphaZero) and creating a game-like environment where it’s possible to train for the agent…
I’m still not following this :(
Maybe you’re suggesting that when we’ve trained an AGI to have something like human values, we can make an environment where that first AGI can hang out with younger AGIs and show them the ropes. But once we’ve trained the first AGI, we’re done, right? And if the first AGI doesn’t quite have “human values”, it seems to me that the subsequent generations of AGIs would drift ever farther from “human values”, rather than converging towards “human values”, unless I’m missing something.
“Open-ended training” and “complex environments” seem to me like they would be important ingredients into capabilities but not particularly related for alignment.
If we’re talking about “Social-Instinct AGIs”, I guess we’re supposed to imagine that a toddler gets a lot of experience interacting with NPCs in its virtual environment, and the toddler gets negative reward for inhibiting the NPCs from accomplishing their goals, and positive reward for helping the NPCs accomplish their goals, or something like that. Then later on, the toddler interacts with humans, and it will know to be helpful right off the bat, or at least after less practice. Well, I guess that’s not crazy. I guess I would feel concerned that we wouldn’t do a good enough job programming the NPCs, such that the toddler-AGI learns weird lessons from interacting with them, lessons which don’t generalize to humans in the way we want.
If we’re talking about “Controlled AGIs”, I would just have the normal concern that the AGI would wind up with the wrong goal, and that the problem would manifest as soon as we go out of distribution. For example, the AGI will eventually get new possibilities in its action space that were not available during training, such as the possibility of wireheading itself, the possibility of wireheading the NPC, the possibility of hacking into AWS and self-reproducing, etc. All those possibilities might or might not be appealing (positive valence), depending on details of the AGI’s learned world-model and its history of credit assignment. To be clear, I’m making an argument that it doesn’t solve the whole problem, not an argument that it’s not even a helpful ingredient. Maybe it is, I dunno. I’ll talk about the out-of-distribution problem in Post #14 of the series.
[...]Still, both the reward function and the RL algorithm are inputs into the adult’s jealousy-related behavior[...]
I probably just don’t know enough about jealousy networks to comment here but I’d be curious to see the research here (maybe even in an earlier post).
Does anyone believe in “the strict formulation”?
Hopefully not, but as I mention, often a too-strict formulation imh.
[...]first AGI can hang out with younger AGIs[...]
More the reverse. And again, this is probably taking it farther than I would take this idea but it would be pre-AGI training in an environment with symbolic “aligned” models, learning the ropes from this, being used as the “aligned” model in the next generation and so on. IDA with a heavy RL twist and scalable human oversight in the sense that humans would monitor rewards and environment states instead of providing feedback on every single action. Very flawed but possible.
RE: RE:
Yeah, this is a lot of what the above proposal was also about.
[...] and the toddler gets negative reward for inhibiting the NPCs from accomplishing their goals, and positive reward for helping the NPCs accomplish their goals [...]
As far as I understand from the post, the reward comes only from understanding the reward function before interaction and not after which is the controlling factor for obstructionist behaviour.
Agreed, and again more as an ingredient in the solution than an ends in itself. BNN OOD management is quite interesting so looking forward to that post!
I probably just don’t know enough about jealousy networks to comment here but I’d be curious to see the research here (maybe even in an earlier post).
I don’t think “the research here” exists. I’ll speculate a bit in the next post.
Does anyone believe in “the strict formulation”?
Hopefully not, but as I mention, often a too-strict formulation imh.
Can you point to any particular person who believes in “a too-strict formulation” of cortical uniformity? Famous or not. What did they say? Just curious.
I think he’s thinking of like, NPCs via behavior-cloning co-op MMO players or something. Like it won’t teach all of human values, but plausibly it would teach “the golden rule” and other positive sum things.
(I don’t think that literal strategy works, but behavior-cloning elementary school team sports might get at a surprising fraction of “normal child cooperative behaviors”?)
Strong upvote for including more brain-related discussions in the alignment problem and for the deep perspectives on how to do it!
Disclaimer: I have not read the earlier posts but I have been researching BCI and social neuroscience.
It is quite crucial to realize that these subsystems probably are very differently organized, are at very different magnitudes, and have very different functions in every different human. Cognitive neuroscience’s view of the modular brain seems (by the research) to be quite faulty and computational/complexity neuroscience generally seem more successful and are more concerned with reverse-engineering the brain, i.e. identifying neural circuitry associated with different evolutionary behaviours. This should inform how we cannot find “the Gaussian brain” and implement it.
You also mention in post #3 that you have the clean slate vs. pre-slate systems (learning vs. steering subsystems) and a less clear distinction here might be helpful instead of modularization. All learning subsystems are inherently organized in ways that evolutionarily seem to fit into that learning scheme (think neural network architectures) which is in and off itself another pre-seeded mechanism. You might have pointed this out in earlier posts as well, sorry if that’s the case!
In general, it seems babies learn quite a lot better from social interaction than pure watching. And between the two, they definitely learn better if they have something to imitate what they’re seeing upon. There’s definitely a good point in the speed differentials between human and AGI existence and I think exploring the opportunity of building a crude “loving family” simulator might not be a bad idea, i.e. have it grow up at its own speed in an OpenAI Gym simulation with relevant modalities of loving family.
I’m generally pro the “get AGI to grow up in a healthy environment” but definitely with the perspective that this is pretty hard to do with the Jelling Stones and that it seems plausible to simulate this either in an environment or with pure training data. But the point there is that the training data really needs to be thought of as the “loving family” in its most general sense since it indeed has a large influence on the AGI’s outcomes.
But great work, excited to read the rest of the posts in this series! Of course, I’m open for discussion on these points as well.
Thanks!
To be clear, you & I are strongly disagreeing on this point. I think that an adult human’s conception of jealousy and associated behavior comes from an interaction between Steering Subsystem (hypothalamus & brainstem) circuitry (which is essentially identical in every human) and Learning Subsystem (e.g. neocortex) trained models (which can be quite different in different people, depending on life experience).
I claim that the hypothalamus and brainstem have genetically-hardcoded, somewhat-modular, human-universal structure. I haven’t seen anyone argue against that, have you? I think I’m pretty familiar with criticism of “cognitive neuroscience’s view of the modular brain”, and tend to strongly agree with it, because the criticism is of there being things like “an intuitive physics module” and “an intuitive psychology module” etc. in the neocortex. Not the brainstem, not the hypothalamus, but the neocortex. And I endorse that criticism because I happen to be a big believer in “cortical uniformity”. I think the neocortex learns physics and psychology and everything else from scratch, although there’s also a role for different neural architectures and hyperparameters in different places and at different stages of development.
On the hypothalamus & brainstem side: A good example of what I have in mind is (if I understand correctly, see Graebner 2015): there’s a particular population of neurons in the hypothalamus which seems to implement the following behavior: “If I’m under-nourished, do the following tasks: (1) emit a hunger sensation, (2) start rewarding the neocortex for getting food, (3) reduce fertility, (4) reduce growth, (5) reduce pain sensitivity, etc.” I expect that every non-brain-damaged rat has a similar population of neurons, in the same part of their hypothalamus, wired up the same way (possibly with some variations depending on their genome, which incidentally would probably also wind up affecting the rats’ obesity etc.)
Yeah, maybe see Section 2.3.1, “Learning-from-scratch is NOT “blank slate””.
That sounds great but I don’t understand what you’re proposing. What are the “relevant modalities of loving family”? I thought the important thing was there being an actual human that could give feedback and answer questions based on their actual human values, and these can’t be simulated because of the chicken-and-egg problem.
Thank you for the comprehensive response!
It seems like we are mostly agreeing in the general sense that there are areas of the brain with more individual differentiation and areas with less. The disagreement is probably more in how different this jealousy is exhibited as a result of the neocortical part of the circuit you mention.
Great to hear, then I have nothing to add there! I am quite inclined to believe that the neural architecture and hyperparameter differences are underestimated as a result of Brodmann areas being a thing at all, i.e. I’m a supporter of the broad cortical uniformity argument but against the strict formulation that I feel is way too prescriptive given our current knowledge of the brain’s functions.
And I will say I’m generally inclined to agree with your classification of the brain stem and hypothalamus.
To be clear, my baseline would also be to follow your methodology but I think there’s a lot of opportunity in the “nurture” approach as well. This is mostly related to the idea of open-ended training (e.g. like AlphaZero) and creating a game-like environment where it’s possible to train for the agent. This can to some degree be seen as a sort of IDA proposal since your environment will need to be very complex (e.g. have other agents that are kind or other “aligned trait”, possibly trained from earlier states).
With this sort of setup, the human-giving-feedback is the designer of the environment itself, leading to a form of scalable human oversight probably iterating over many environments and agents, i.e. the IDA part of the idea. And again, there are a lot of holes in this plan, but I feel like it should not be dismissed outright. This post should also inform this process. So a very broad “loving family” proposal, though the name itself doesn’t seem adequate for this sort of approach ;)
Thanks!
I think the behavior involves both the hypothalamus/brainstem and the neocortex, but the circuit I’m talking about would be entirely in the hypothalamus/brainstem.
In the normal RL picture, reward function + RL algorithm + environment = learned behavior. The reward function (Steering Subsystem) is separate from the RL algorithm (Learning Subsystem), not only conceptually, but also in actual RL code, and (I claim) anatomically in the brain too. Still, both the reward function and the RL algorithm are inputs into the adult’s jealousy-related behavior, as is the culture they grow up in etc.
Does anyone believe in “the strict formulation”? I feel like maybe it’s a strawman. For example, here’s Jeff Hawkins: “Mountcastle’s proposal that there is a common cortical algorithm doesn’t mean there are no variations. He knew that. The issue is how much is common in all cortical regions, and how much is different. The evidence suggests that there is a huge amount of commonality.”
I’m still not following this :(
Maybe you’re suggesting that when we’ve trained an AGI to have something like human values, we can make an environment where that first AGI can hang out with younger AGIs and show them the ropes. But once we’ve trained the first AGI, we’re done, right? And if the first AGI doesn’t quite have “human values”, it seems to me that the subsequent generations of AGIs would drift ever farther from “human values”, rather than converging towards “human values”, unless I’m missing something.
“Open-ended training” and “complex environments” seem to me like they would be important ingredients into capabilities but not particularly related for alignment.
(Also, Post #8, Section 8.3.3.1 is different but maybe slightly related.)
RE Alignment and Deep Learning:
If we’re talking about “Social-Instinct AGIs”, I guess we’re supposed to imagine that a toddler gets a lot of experience interacting with NPCs in its virtual environment, and the toddler gets negative reward for inhibiting the NPCs from accomplishing their goals, and positive reward for helping the NPCs accomplish their goals, or something like that. Then later on, the toddler interacts with humans, and it will know to be helpful right off the bat, or at least after less practice. Well, I guess that’s not crazy. I guess I would feel concerned that we wouldn’t do a good enough job programming the NPCs, such that the toddler-AGI learns weird lessons from interacting with them, lessons which don’t generalize to humans in the way we want.
If we’re talking about “Controlled AGIs”, I would just have the normal concern that the AGI would wind up with the wrong goal, and that the problem would manifest as soon as we go out of distribution. For example, the AGI will eventually get new possibilities in its action space that were not available during training, such as the possibility of wireheading itself, the possibility of wireheading the NPC, the possibility of hacking into AWS and self-reproducing, etc. All those possibilities might or might not be appealing (positive valence), depending on details of the AGI’s learned world-model and its history of credit assignment. To be clear, I’m making an argument that it doesn’t solve the whole problem, not an argument that it’s not even a helpful ingredient. Maybe it is, I dunno. I’ll talk about the out-of-distribution problem in Post #14 of the series.
I probably just don’t know enough about jealousy networks to comment here but I’d be curious to see the research here (maybe even in an earlier post).
Hopefully not, but as I mention, often a too-strict formulation imh.
More the reverse. And again, this is probably taking it farther than I would take this idea but it would be pre-AGI training in an environment with symbolic “aligned” models, learning the ropes from this, being used as the “aligned” model in the next generation and so on. IDA with a heavy RL twist and scalable human oversight in the sense that humans would monitor rewards and environment states instead of providing feedback on every single action. Very flawed but possible.
RE: RE:
Yeah, this is a lot of what the above proposal was also about.
As far as I understand from the post, the reward comes only from understanding the reward function before interaction and not after which is the controlling factor for obstructionist behaviour.
Agreed, and again more as an ingredient in the solution than an ends in itself. BNN OOD management is quite interesting so looking forward to that post!
I don’t think “the research here” exists. I’ll speculate a bit in the next post.
Can you point to any particular person who believes in “a too-strict formulation” of cortical uniformity? Famous or not. What did they say? Just curious.
(Or maybe you’re talking about me?)
Any thoughts on how to make those?
I think he’s thinking of like, NPCs via behavior-cloning co-op MMO players or something. Like it won’t teach all of human values, but plausibly it would teach “the golden rule” and other positive sum things.
(I don’t think that literal strategy works, but behavior-cloning elementary school team sports might get at a surprising fraction of “normal child cooperative behaviors”?)