To be clear, you & I are strongly disagreeing on this point.
It seems like we are mostly agreeing in the general sense that there are areas of the brain with more individual differentiation and areas with less. The disagreement is probably more in how different this jealousy is exhibited as a result of the neocortical part of the circuit you mention.
Great to hear, then I have nothing to add there! I am quite inclined to believe that the neural architecture and hyperparameter differences are underestimated as a result of Brodmann areas being a thing at all, i.e. I’m a supporter of the broad cortical uniformity argument but against the strict formulation that I feel is way too prescriptive given our current knowledge of the brain’s functions.
And I will say I’m generally inclined to agree with your classification of the brain stem and hypothalamus.
That sounds great but I don’t understand what you’re proposing. What are the “relevant modalities of loving family”? I thought the important thing was there being an actual human that could give feedback and answer questions based on their actual human values, and these can’t be simulated because of the chicken-and-egg problem.
To be clear, my baseline would also be to follow your methodology but I think there’s a lot of opportunity in the “nurture” approach as well. This is mostly related to the idea of open-ended training (e.g. like AlphaZero) and creating a game-like environment where it’s possible to train for the agent. This can to some degree be seen as a sort of IDA proposal since your environment will need to be very complex (e.g. have other agents that are kind or other “aligned trait”, possibly trained from earlier states).
With this sort of setup, the human-giving-feedback is the designer of the environment itself, leading to a form of scalable human oversight probably iterating over many environments and agents, i.e. the IDA part of the idea. And again, there are a lot of holes in this plan, but I feel like it should not be dismissed outright. This post should also inform this process. So a very broad “loving family” proposal, though the name itself doesn’t seem adequate for this sort of approach ;)
The disagreement is probably more in how different this jealousy is exhibited as a result of the neocortical part of the circuit you mention
I think the behavior involves both the hypothalamus/brainstem and the neocortex, but the circuit I’m talking about would be entirely in the hypothalamus/brainstem.
In the normal RL picture, reward function + RL algorithm + environment = learned behavior. The reward function (Steering Subsystem) is separate from the RL algorithm (Learning Subsystem), not only conceptually, but also in actual RL code, and (I claim) anatomically in the brain too. Still, both the reward function and the RL algorithm are inputs into the adult’s jealousy-related behavior, as is the culture they grow up in etc.
I am quite inclined to believe that the neural architecture and hyperparameter differences are underestimated as a result of Brodmann areas being a thing at all, i.e. I’m a supporter of the broad cortical uniformity argument but against the strict formulation that I feel is way too prescriptive given our current knowledge of the brain’s functions.
This is mostly related to the idea of open-ended training (e.g. like AlphaZero) and creating a game-like environment where it’s possible to train for the agent…
I’m still not following this :(
Maybe you’re suggesting that when we’ve trained an AGI to have something like human values, we can make an environment where that first AGI can hang out with younger AGIs and show them the ropes. But once we’ve trained the first AGI, we’re done, right? And if the first AGI doesn’t quite have “human values”, it seems to me that the subsequent generations of AGIs would drift ever farther from “human values”, rather than converging towards “human values”, unless I’m missing something.
“Open-ended training” and “complex environments” seem to me like they would be important ingredients into capabilities but not particularly related for alignment.
If we’re talking about “Social-Instinct AGIs”, I guess we’re supposed to imagine that a toddler gets a lot of experience interacting with NPCs in its virtual environment, and the toddler gets negative reward for inhibiting the NPCs from accomplishing their goals, and positive reward for helping the NPCs accomplish their goals, or something like that. Then later on, the toddler interacts with humans, and it will know to be helpful right off the bat, or at least after less practice. Well, I guess that’s not crazy. I guess I would feel concerned that we wouldn’t do a good enough job programming the NPCs, such that the toddler-AGI learns weird lessons from interacting with them, lessons which don’t generalize to humans in the way we want.
If we’re talking about “Controlled AGIs”, I would just have the normal concern that the AGI would wind up with the wrong goal, and that the problem would manifest as soon as we go out of distribution. For example, the AGI will eventually get new possibilities in its action space that were not available during training, such as the possibility of wireheading itself, the possibility of wireheading the NPC, the possibility of hacking into AWS and self-reproducing, etc. All those possibilities might or might not be appealing (positive valence), depending on details of the AGI’s learned world-model and its history of credit assignment. To be clear, I’m making an argument that it doesn’t solve the whole problem, not an argument that it’s not even a helpful ingredient. Maybe it is, I dunno. I’ll talk about the out-of-distribution problem in Post #14 of the series.
[...]Still, both the reward function and the RL algorithm are inputs into the adult’s jealousy-related behavior[...]
I probably just don’t know enough about jealousy networks to comment here but I’d be curious to see the research here (maybe even in an earlier post).
Does anyone believe in “the strict formulation”?
Hopefully not, but as I mention, often a too-strict formulation imh.
[...]first AGI can hang out with younger AGIs[...]
More the reverse. And again, this is probably taking it farther than I would take this idea but it would be pre-AGI training in an environment with symbolic “aligned” models, learning the ropes from this, being used as the “aligned” model in the next generation and so on. IDA with a heavy RL twist and scalable human oversight in the sense that humans would monitor rewards and environment states instead of providing feedback on every single action. Very flawed but possible.
RE: RE:
Yeah, this is a lot of what the above proposal was also about.
[...] and the toddler gets negative reward for inhibiting the NPCs from accomplishing their goals, and positive reward for helping the NPCs accomplish their goals [...]
As far as I understand from the post, the reward comes only from understanding the reward function before interaction and not after which is the controlling factor for obstructionist behaviour.
Agreed, and again more as an ingredient in the solution than an ends in itself. BNN OOD management is quite interesting so looking forward to that post!
I probably just don’t know enough about jealousy networks to comment here but I’d be curious to see the research here (maybe even in an earlier post).
I don’t think “the research here” exists. I’ll speculate a bit in the next post.
Does anyone believe in “the strict formulation”?
Hopefully not, but as I mention, often a too-strict formulation imh.
Can you point to any particular person who believes in “a too-strict formulation” of cortical uniformity? Famous or not. What did they say? Just curious.
I think he’s thinking of like, NPCs via behavior-cloning co-op MMO players or something. Like it won’t teach all of human values, but plausibly it would teach “the golden rule” and other positive sum things.
(I don’t think that literal strategy works, but behavior-cloning elementary school team sports might get at a surprising fraction of “normal child cooperative behaviors”?)
Thank you for the comprehensive response!
It seems like we are mostly agreeing in the general sense that there are areas of the brain with more individual differentiation and areas with less. The disagreement is probably more in how different this jealousy is exhibited as a result of the neocortical part of the circuit you mention.
Great to hear, then I have nothing to add there! I am quite inclined to believe that the neural architecture and hyperparameter differences are underestimated as a result of Brodmann areas being a thing at all, i.e. I’m a supporter of the broad cortical uniformity argument but against the strict formulation that I feel is way too prescriptive given our current knowledge of the brain’s functions.
And I will say I’m generally inclined to agree with your classification of the brain stem and hypothalamus.
To be clear, my baseline would also be to follow your methodology but I think there’s a lot of opportunity in the “nurture” approach as well. This is mostly related to the idea of open-ended training (e.g. like AlphaZero) and creating a game-like environment where it’s possible to train for the agent. This can to some degree be seen as a sort of IDA proposal since your environment will need to be very complex (e.g. have other agents that are kind or other “aligned trait”, possibly trained from earlier states).
With this sort of setup, the human-giving-feedback is the designer of the environment itself, leading to a form of scalable human oversight probably iterating over many environments and agents, i.e. the IDA part of the idea. And again, there are a lot of holes in this plan, but I feel like it should not be dismissed outright. This post should also inform this process. So a very broad “loving family” proposal, though the name itself doesn’t seem adequate for this sort of approach ;)
Thanks!
I think the behavior involves both the hypothalamus/brainstem and the neocortex, but the circuit I’m talking about would be entirely in the hypothalamus/brainstem.
In the normal RL picture, reward function + RL algorithm + environment = learned behavior. The reward function (Steering Subsystem) is separate from the RL algorithm (Learning Subsystem), not only conceptually, but also in actual RL code, and (I claim) anatomically in the brain too. Still, both the reward function and the RL algorithm are inputs into the adult’s jealousy-related behavior, as is the culture they grow up in etc.
Does anyone believe in “the strict formulation”? I feel like maybe it’s a strawman. For example, here’s Jeff Hawkins: “Mountcastle’s proposal that there is a common cortical algorithm doesn’t mean there are no variations. He knew that. The issue is how much is common in all cortical regions, and how much is different. The evidence suggests that there is a huge amount of commonality.”
I’m still not following this :(
Maybe you’re suggesting that when we’ve trained an AGI to have something like human values, we can make an environment where that first AGI can hang out with younger AGIs and show them the ropes. But once we’ve trained the first AGI, we’re done, right? And if the first AGI doesn’t quite have “human values”, it seems to me that the subsequent generations of AGIs would drift ever farther from “human values”, rather than converging towards “human values”, unless I’m missing something.
“Open-ended training” and “complex environments” seem to me like they would be important ingredients into capabilities but not particularly related for alignment.
(Also, Post #8, Section 8.3.3.1 is different but maybe slightly related.)
RE Alignment and Deep Learning:
If we’re talking about “Social-Instinct AGIs”, I guess we’re supposed to imagine that a toddler gets a lot of experience interacting with NPCs in its virtual environment, and the toddler gets negative reward for inhibiting the NPCs from accomplishing their goals, and positive reward for helping the NPCs accomplish their goals, or something like that. Then later on, the toddler interacts with humans, and it will know to be helpful right off the bat, or at least after less practice. Well, I guess that’s not crazy. I guess I would feel concerned that we wouldn’t do a good enough job programming the NPCs, such that the toddler-AGI learns weird lessons from interacting with them, lessons which don’t generalize to humans in the way we want.
If we’re talking about “Controlled AGIs”, I would just have the normal concern that the AGI would wind up with the wrong goal, and that the problem would manifest as soon as we go out of distribution. For example, the AGI will eventually get new possibilities in its action space that were not available during training, such as the possibility of wireheading itself, the possibility of wireheading the NPC, the possibility of hacking into AWS and self-reproducing, etc. All those possibilities might or might not be appealing (positive valence), depending on details of the AGI’s learned world-model and its history of credit assignment. To be clear, I’m making an argument that it doesn’t solve the whole problem, not an argument that it’s not even a helpful ingredient. Maybe it is, I dunno. I’ll talk about the out-of-distribution problem in Post #14 of the series.
I probably just don’t know enough about jealousy networks to comment here but I’d be curious to see the research here (maybe even in an earlier post).
Hopefully not, but as I mention, often a too-strict formulation imh.
More the reverse. And again, this is probably taking it farther than I would take this idea but it would be pre-AGI training in an environment with symbolic “aligned” models, learning the ropes from this, being used as the “aligned” model in the next generation and so on. IDA with a heavy RL twist and scalable human oversight in the sense that humans would monitor rewards and environment states instead of providing feedback on every single action. Very flawed but possible.
RE: RE:
Yeah, this is a lot of what the above proposal was also about.
As far as I understand from the post, the reward comes only from understanding the reward function before interaction and not after which is the controlling factor for obstructionist behaviour.
Agreed, and again more as an ingredient in the solution than an ends in itself. BNN OOD management is quite interesting so looking forward to that post!
I don’t think “the research here” exists. I’ll speculate a bit in the next post.
Can you point to any particular person who believes in “a too-strict formulation” of cortical uniformity? Famous or not. What did they say? Just curious.
(Or maybe you’re talking about me?)
Any thoughts on how to make those?
I think he’s thinking of like, NPCs via behavior-cloning co-op MMO players or something. Like it won’t teach all of human values, but plausibly it would teach “the golden rule” and other positive sum things.
(I don’t think that literal strategy works, but behavior-cloning elementary school team sports might get at a surprising fraction of “normal child cooperative behaviors”?)