Our better other-human/animal modelling ability allows us to do better at infant wrangling than something stupider like a duck.
I agree, humans are indeed better at a lot of things, especially intelligence, but that’s not the whole reason why we care for our infants. Orthogonally to your “capability”, you need to have a “goal” for it. Otherwise you would probably just immediately abandon grossly looking screaming piece of flesh that fell out of you for unknown to you reasons, while you were gathering food in the forest. Yet something inside will make you want to protect it, sometimes with your own life for the rest of your life if it works well.
Simulating an evolutionary environment filled with AI agents and hoping for caring-for-offspring strategies to win could work but it’s easier just to train the AI to show caring-like behaviors.
I want agents that take effective actions to care about their “babies”, which might not even look like caring at the first glance. Something like, keeping your “baby” in some enclosed kindergarden, while protecting the only entrance from other agents? It would look like “mother” agent abandoned its “baby”, but in reality could be a very effective strategy for caring. It’s hard to know an optimal strategy in every proceduraly generated environment and hence trying to optimize for some fixed set of actions, called “caring-like behaviors” would probably indeed give you what your asked, but I expect nothing “interesting” behind it.
Goal misgeneralisation is the problem that’s left. Humans can meet caring-for-small-creature desires using pets rather than actual babies.
Yes they can, until they will actually make a baby, and after that, it’s usually really hard to sell loving mother “deals” that will involve suffering of her child as the price, or abandon the child for the more “cute” toy, or persuade it to hotwire herself to not care about her child (if she is smart enough to realize the consequences). And carefully engenireed system could potentialy be even more robust than that.
Outside of “alignment by default” scenarios where capabilities improvements preserve the true intended spirit of a trained in drive, we’ve created a paperclip maximizer that kills us and replaces us with something outside the training distribution that fulfills its “care drive” utility function more efficiently.
Again. I’m not proposing the “one easy solution to the big problem”. I understand that training agents that are capable of RSI in this toy example will result in everyone’s dead. But we simply can’t do that yet, and I don’t think we should. I’m just saying that there is this strange behavior in some animals, that in many aspects looks very similar to the thing that we want from aligned AGI, yet nobody understands how it works, and few people try to replicate it. It’s a step in that direction, not a fully functional blueprint for the AI Alignment.
TLDR:If you want to do some RL/evolutionary open ended thing that finds novel strategies. It will get goodharted horribly and the novel strategies that succeed without gaming the goal may include things no human would want their caregiver AI to do.
Orthogonally to your “capability”, you need to have a “goal” for it.
Game playing RL architechtures like AlphaStart and OpenAI-Five have dead simple reward functions (win the game) and all the complexity is in the reinforcement learning tricks to allow efficient learning and credit assignment at higher layers.
So child rearing motivation is plausibly rooted in cuteness preference along with re-use of empathy. Empathy plausibly has a sliding scale of caring per person which increases for friendships (reciprocal cooperation relationships) and relatives including children obviously. Similar decreases for enemy combatants in wars up to the point they no longer qualify for empathy.
I want agents that take effective actions to care about their “babies”, which might not even look like caring at the first glance.
ASI will just flat out break your testing environment. Novel strategies discovered by dumb agents doing lots of exploration will be enough. Alternatively the test is “survive in competitive deathmatch mode” in which case you’re aiming for brutally efficient self replicators.
The hope with a non-RL strategy or one of the many sort of RL strategies used for fine tuning is that you can find the generalised core of what you want within the already trained model and the surrounding intelligence means the core generalises well. Q&A fine tuning a LLM in english generalises to other languages.
Also, some systems are architechted in such a way that the caring is part of a value estimator and the search process can be made better up till it starts goodharting the value estimator and/or world model.
Yes they can, until they will actually make a baby, and after that, it’s usually really hard to sell loving mother “deals” that will involve suffering of her child as the price, or abandon the child for the more “cute” toy, or persuade it to hotwire herself to not care about her child (if she is smart enough to realize the consequences).
Yes, once the caregiver has imprinted that’s sticky. Note that care drive surrogates like pets can be just as sticky to their human caregivers. Pet organ transplants are a thing and people will spend nearly arbitrary amounts of money caring for their animals.
But our current pets aren’t super-stimuli. Pets will poop on the floor, scratch up furniture and don’t fulfill certain other human wants. You can’t teach a dog to fish the way you can a child.
When this changes, real kids will be disappointing. Parents can have favorite children and those favorite children won’t be the human ones.
Superstimuli aren’t about changing your reward function but rather discovering a better way to fulfill your existing reward function. For all that ice cream is cheating from a nutrition standpoint it still tastes good and people eat it, no brain surgery required.
Also consider that humans optimise their pets (neutering/spaying) and children in ways that the pets and children do not want. I expect some of the novel strategies your AI discovers will be things we do not want.
I agree, humans are indeed better at a lot of things, especially intelligence, but that’s not the whole reason why we care for our infants. Orthogonally to your “capability”, you need to have a “goal” for it. Otherwise you would probably just immediately abandon grossly looking screaming piece of flesh that fell out of you for unknown to you reasons, while you were gathering food in the forest. Yet something inside will make you want to protect it, sometimes with your own life for the rest of your life if it works well.
I want agents that take effective actions to care about their “babies”, which might not even look like caring at the first glance. Something like, keeping your “baby” in some enclosed kindergarden, while protecting the only entrance from other agents? It would look like “mother” agent abandoned its “baby”, but in reality could be a very effective strategy for caring. It’s hard to know an optimal strategy in every proceduraly generated environment and hence trying to optimize for some fixed set of actions, called “caring-like behaviors” would probably indeed give you what your asked, but I expect nothing “interesting” behind it.
Yes they can, until they will actually make a baby, and after that, it’s usually really hard to sell loving mother “deals” that will involve suffering of her child as the price, or abandon the child for the more “cute” toy, or persuade it to hotwire herself to not care about her child (if she is smart enough to realize the consequences). And carefully engenireed system could potentialy be even more robust than that.
Again. I’m not proposing the “one easy solution to the big problem”. I understand that training agents that are capable of RSI in this toy example will result in everyone’s dead. But we simply can’t do that yet, and I don’t think we should. I’m just saying that there is this strange behavior in some animals, that in many aspects looks very similar to the thing that we want from aligned AGI, yet nobody understands how it works, and few people try to replicate it. It’s a step in that direction, not a fully functional blueprint for the AI Alignment.
TLDR:If you want to do some RL/evolutionary open ended thing that finds novel strategies. It will get goodharted horribly and the novel strategies that succeed without gaming the goal may include things no human would want their caregiver AI to do.
Game playing RL architechtures like AlphaStart and OpenAI-Five have dead simple reward functions (win the game) and all the complexity is in the reinforcement learning tricks to allow efficient learning and credit assignment at higher layers.
So child rearing motivation is plausibly rooted in cuteness preference along with re-use of empathy. Empathy plausibly has a sliding scale of caring per person which increases for friendships (reciprocal cooperation relationships) and relatives including children obviously. Similar decreases for enemy combatants in wars up to the point they no longer qualify for empathy.
ASI will just flat out break your testing environment. Novel strategies discovered by dumb agents doing lots of exploration will be enough. Alternatively the test is “survive in competitive deathmatch mode” in which case you’re aiming for brutally efficient self replicators.
The hope with a non-RL strategy or one of the many sort of RL strategies used for fine tuning is that you can find the generalised core of what you want within the already trained model and the surrounding intelligence means the core generalises well. Q&A fine tuning a LLM in english generalises to other languages.
Also, some systems are architechted in such a way that the caring is part of a value estimator and the search process can be made better up till it starts goodharting the value estimator and/or world model.
Yes, once the caregiver has imprinted that’s sticky. Note that care drive surrogates like pets can be just as sticky to their human caregivers. Pet organ transplants are a thing and people will spend nearly arbitrary amounts of money caring for their animals.
But our current pets aren’t super-stimuli. Pets will poop on the floor, scratch up furniture and don’t fulfill certain other human wants. You can’t teach a dog to fish the way you can a child.
When this changes, real kids will be disappointing. Parents can have favorite children and those favorite children won’t be the human ones.
Superstimuli aren’t about changing your reward function but rather discovering a better way to fulfill your existing reward function. For all that ice cream is cheating from a nutrition standpoint it still tastes good and people eat it, no brain surgery required.
Also consider that humans optimise their pets (neutering/spaying) and children in ways that the pets and children do not want. I expect some of the novel strategies your AI discovers will be things we do not want.