We programmers figure out how to make a “caring drive” that works up through superintelligence, and put that code in the AI;
We programmers set up a training environment akin to evolution acting on agents in a social environment, and hope that some kind of “caring drive” emerges from that training setup.
And then I argued that 2 is a bad idea for various reasons listed in Section 8.3.3.1 here.
I prefer 1, supplemented by (A) sandbox testing (to the extent possible) and (B) an understanding of how our own code compares and contrasts with how human brains work (to the extent possible), thus allowing us to get nonzero insight out of our massive experience with human motivations.
By contrast, you’re advocating (IIUC) to start with 2, and then do mechanistic interpretability on the artifact that results, thus gaining insight about how a “caring drive” might work. And then the final AGI can be built using approach 1.
Is that right?
If so, I agree that this proposal would be an improvement over “just do 2 and call it a day”.
I’m still not too interested in that approach because:
I think running a single sufficiently-intelligent organism for a single lifetime is analogous to “serious ML model training”, at least in the sense that it might take weeks or months or more of wall-clock time, and therefore running thousands (let alone millions) of serial generations would be impractical, even if it were a good idea in principle.
Worse, once we can run a single sufficiently-intelligent organism for a single lifetime, superhuman AGI will be already here, or at least very nearly so, so the whole evolution simulation is pure “alignment tax” that will happen too late to help.
I’m not even sure doing one evolution simulation is enough, we might need to iterate a bunch of times, and now the years of work become decades of work etc.
(The above might be tied to my idiosyncratic opinions about how AGI is likely to work, involving model-based RL etc. rather than LLMs)
Different topic: I’m curious why you picked parenting-an-infant rather than helping-a-good-friend as your main example. I feel like parenting-an-infant in humans is a combination of pretty simple behaviors / preferences (e.g. wanting the baby to smile) which wouldn’t generalize well to superintelligence; plus a ton of learning parenting norms from one’s culture. Cross-cultural comparisons of parenting are pretty illuminating here. Directly targeting cultural learning (a.k.a. learning norms) would also be an interesting-to-me drive to figure out how it works.
Anyway, nice post, happy for you to be thinking about this stuff :)
By contrast, you’re advocating (IIUC) to start with 2, and then do mechanistic interpretability on the artifact that results, thus gaining insight about how a “caring drive” might work. And then the final AGI can be built using approach 1.
Yes, that’s exactly correct. I haven’t thought about “if we managed to build a sufficiently smart agent with the caring drive, then AGI is already too close”. If any “interesting” caring drive requires capabilities very close to AGI, then i agree that it seems like a dead end in light of the race towards AGI. So it’s only viable if “interesting” and “valuable” caring drive could be potentially found within ~current level of capability agents. Which honestly doesn’t sound like something totally improbable to me.
Also, without some global regulation to stop this damn race I expect everyone to die soon anyway, and since I’m not in the position to meaningfully impact this, I might as well continue trying to work in the directions that will work only in the worlds where we would suddenly have more time.
And once we have something like this, I expect a lot of gains in speed of research from all the benefits that come from the ability to precisely control and run experiments on artificial NN.
I’m curious why you picked parenting-an-infant rather than helping-a-good-friend as your main example. I feel like parenting-an-infant in humans is a combination of pretty simple behaviors / preferences (e.g. wanting the baby to smile)
Several reasons:
I don’t think that it’s just a couple of simple heuristics, otherwise I’d expect them to fail horribly in the modern world. And by “caring for the baby” I mean like all the actions of the parents until the “baby” is like ~25 years old. Those actions usually have a lot of intricate decisions that are aimed at something like “success and happiness in the long run, even if it means some crying right now”. It’s hard to do right, and a lot of parents make mistakes. But in most cases, it seems like the capability failure, not the intentions. And these intentions looks much more interesting to me than “make a baby smile”.
Although I agree that some people have a genuine intrinsic prosocial drive, I think there is also an alternative egoistic “solutions”. A lot of prosocial behavior looks just instrumentally beneficial even for the totally egoistic agent. The classic example would be a repetitive prisoner’s dilemma with an unknown number of trials. It would be foolish not to at least try to cooperate, even if you care only about your utility. Maternal caring drive on the other hand looks much less selfish. Which I think is a good sign, since we shouldn’t expect us to be of any instrumental value to the superhuman AI.
I think it would be easier to recreate it in some multi-agent environment. Unlike maternal caring drive, I expect a lot more requirements for prosocial behavior to arise, like: the ability to communicate, some form of benefits from being in a society/tribe/group which usually comes from specialization (i haven’t thought about it too much though).
I agree with your Section 8.3.3.1 , but I think that the arguments there wouldn’t apply here so easily. Since the initial goal for this project, is to recreate the “caring drive”, to have something to study and then apply this knowledge to build it from scratch for the actual AGI, it’s not that critical to make some errors at this stage. I think it’s even desirable to observe some failure cases in order to understand where the failure comes from. This should also work for prosocial behavior, as long as it’s not a direct attempt to create an aligned AGI, and just a research about the workings of “goals”, “intentions” and “drives”. But for the reasons above, I think that maternal drive could be a better candidate.
And by “caring for the baby” I mean like all the actions of the parents until the “baby” is like ~25 years old. Those actions usually have a lot of intricate decisions that are aimed at something like “success and happiness in the long run, even if it means some crying right now”. It’s hard to do right, and a lot of parents make mistakes. But in most cases, it seems like the capability failure, not the intentions. And these intentions looks much more interesting to me than “make a baby smile”.
I think humans have a capacity to empathetically care about the well-being of another person, and that capacity might be more or less (or not-at-all) directed towards one’s children, depending on culture, age, circumstance, etc.
Other than culture and non-parenting-specific drives / behaviors, I think infant-care instincts are pretty simple things like “hearing a baby cry is mildly aversive (other things equal, although one can get used to it)” and “full breasts are kinda unpleasant and [successful] breastfeeding is a nice relief” and “it’s pleasant to look at cute happy babies” and “my own baby smells good” etc. I’m not sure why you would expect those to “fail horribly in the modern world”?
Although I agree that some people have a genuine intrinsic prosocial drive, I think there is also an alternative egoistic “solutions”.
If we’re talking about humans, there are both altruistic and self-centered reasons to cooperate with peers, and there are also both altruistic and self-centered reasons to want one’s children to be healthy / successful / high-status (e.g. on the negative side, some cultures make whole families responsible for one person’s bad behavior, debt, blood-debt, etc., and on the positive side, the high status of a kid could reflect back on you, and also some cultures have an expectation that capable children will support their younger siblings when they’re an older kid, and support their elderly relatives as adults, so you selfishly want your kid to be competent). So I don’t immediately see the difference. Either way, you need to do extra tests to suss out whether the behavior is truly altruistic or not—e.g. change the power dynamics somehow in the simulation and see whether people start stabbing each other in the back.
This is especially true if we’re talking about 24-year-old “kids” as you mention above; they are fully capable of tactical cooperation with their parents and vice-versa.
In a simulation, if you want to set up a direct incentive to cooperate with peers, just follow the instructions in evolution of eusociality. But I feel like I’m losing track of what we’re talking about and why.
I had been thinking about two approaches:
We programmers figure out how to make a “caring drive” that works up through superintelligence, and put that code in the AI;
We programmers set up a training environment akin to evolution acting on agents in a social environment, and hope that some kind of “caring drive” emerges from that training setup.
And then I argued that 2 is a bad idea for various reasons listed in Section 8.3.3.1 here.
I prefer 1, supplemented by (A) sandbox testing (to the extent possible) and (B) an understanding of how our own code compares and contrasts with how human brains work (to the extent possible), thus allowing us to get nonzero insight out of our massive experience with human motivations.
By contrast, you’re advocating (IIUC) to start with 2, and then do mechanistic interpretability on the artifact that results, thus gaining insight about how a “caring drive” might work. And then the final AGI can be built using approach 1.
Is that right?
If so, I agree that this proposal would be an improvement over “just do 2 and call it a day”.
I’m still not too interested in that approach because:
I think running a single sufficiently-intelligent organism for a single lifetime is analogous to “serious ML model training”, at least in the sense that it might take weeks or months or more of wall-clock time, and therefore running thousands (let alone millions) of serial generations would be impractical, even if it were a good idea in principle.
Worse, once we can run a single sufficiently-intelligent organism for a single lifetime, superhuman AGI will be already here, or at least very nearly so, so the whole evolution simulation is pure “alignment tax” that will happen too late to help.
I’m not even sure doing one evolution simulation is enough, we might need to iterate a bunch of times, and now the years of work become decades of work etc.
(The above might be tied to my idiosyncratic opinions about how AGI is likely to work, involving model-based RL etc. rather than LLMs)
Different topic: I’m curious why you picked parenting-an-infant rather than helping-a-good-friend as your main example. I feel like parenting-an-infant in humans is a combination of pretty simple behaviors / preferences (e.g. wanting the baby to smile) which wouldn’t generalize well to superintelligence; plus a ton of learning parenting norms from one’s culture. Cross-cultural comparisons of parenting are pretty illuminating here. Directly targeting cultural learning (a.k.a. learning norms) would also be an interesting-to-me drive to figure out how it works.
Anyway, nice post, happy for you to be thinking about this stuff :)
Thank you for the detailed comment!
Yes, that’s exactly correct. I haven’t thought about “if we managed to build a sufficiently smart agent with the caring drive, then AGI is already too close”. If any “interesting” caring drive requires capabilities very close to AGI, then i agree that it seems like a dead end in light of the race towards AGI. So it’s only viable if “interesting” and “valuable” caring drive could be potentially found within ~current level of capability agents. Which honestly doesn’t sound like something totally improbable to me.
Also, without some global regulation to stop this damn race I expect everyone to die soon anyway, and since I’m not in the position to meaningfully impact this, I might as well continue trying to work in the directions that will work only in the worlds where we would suddenly have more time.
And once we have something like this, I expect a lot of gains in speed of research from all the benefits that come from the ability to precisely control and run experiments on artificial NN.
Several reasons:
I don’t think that it’s just a couple of simple heuristics, otherwise I’d expect them to fail horribly in the modern world. And by “caring for the baby” I mean like all the actions of the parents until the “baby” is like ~25 years old. Those actions usually have a lot of intricate decisions that are aimed at something like “success and happiness in the long run, even if it means some crying right now”. It’s hard to do right, and a lot of parents make mistakes. But in most cases, it seems like the capability failure, not the intentions. And these intentions looks much more interesting to me than “make a baby smile”.
Although I agree that some people have a genuine intrinsic prosocial drive, I think there is also an alternative egoistic “solutions”. A lot of prosocial behavior looks just instrumentally beneficial even for the totally egoistic agent. The classic example would be a repetitive prisoner’s dilemma with an unknown number of trials. It would be foolish not to at least try to cooperate, even if you care only about your utility. Maternal caring drive on the other hand looks much less selfish. Which I think is a good sign, since we shouldn’t expect us to be of any instrumental value to the superhuman AI.
I think it would be easier to recreate it in some multi-agent environment. Unlike maternal caring drive, I expect a lot more requirements for prosocial behavior to arise, like: the ability to communicate, some form of benefits from being in a society/tribe/group which usually comes from specialization (i haven’t thought about it too much though).
I agree with your Section 8.3.3.1 , but I think that the arguments there wouldn’t apply here so easily. Since the initial goal for this project, is to recreate the “caring drive”, to have something to study and then apply this knowledge to build it from scratch for the actual AGI, it’s not that critical to make some errors at this stage. I think it’s even desirable to observe some failure cases in order to understand where the failure comes from. This should also work for prosocial behavior, as long as it’s not a direct attempt to create an aligned AGI, and just a research about the workings of “goals”, “intentions” and “drives”. But for the reasons above, I think that maternal drive could be a better candidate.
I think the “How do children learn?” section of this post is relevant. I really think that you are ascribing things to innate human nature that are actually norms of our culture.
I think humans have a capacity to empathetically care about the well-being of another person, and that capacity might be more or less (or not-at-all) directed towards one’s children, depending on culture, age, circumstance, etc.
Other than culture and non-parenting-specific drives / behaviors, I think infant-care instincts are pretty simple things like “hearing a baby cry is mildly aversive (other things equal, although one can get used to it)” and “full breasts are kinda unpleasant and [successful] breastfeeding is a nice relief” and “it’s pleasant to look at cute happy babies” and “my own baby smells good” etc. I’m not sure why you would expect those to “fail horribly in the modern world”?
If we’re talking about humans, there are both altruistic and self-centered reasons to cooperate with peers, and there are also both altruistic and self-centered reasons to want one’s children to be healthy / successful / high-status (e.g. on the negative side, some cultures make whole families responsible for one person’s bad behavior, debt, blood-debt, etc., and on the positive side, the high status of a kid could reflect back on you, and also some cultures have an expectation that capable children will support their younger siblings when they’re an older kid, and support their elderly relatives as adults, so you selfishly want your kid to be competent). So I don’t immediately see the difference. Either way, you need to do extra tests to suss out whether the behavior is truly altruistic or not—e.g. change the power dynamics somehow in the simulation and see whether people start stabbing each other in the back.
This is especially true if we’re talking about 24-year-old “kids” as you mention above; they are fully capable of tactical cooperation with their parents and vice-versa.
In a simulation, if you want to set up a direct incentive to cooperate with peers, just follow the instructions in evolution of eusociality. But I feel like I’m losing track of what we’re talking about and why.