Hello,
I was recently thinking about the question of how humans achieve alignment with each other over the course of our lifetime, and how that process could be applied to an AGI.
For example, why doesn’t everyone shop lift from the grocery store? A grocery store isn’t as secure as Fort Knox, and if one considered every possible policy that results in obtaining groceries then they may discover that shop lifting is more efficient than obtaining money from a legitimate job. That may or may not be the best example, but I’m sure LW is quite familiar with the concept that lies at the heart of the problem of AI alignment: what humans consider the morally superior solution isn’t always the most “rational” answer.
So why don’t humans shop lift? I believe the most common answer from modern sociology is that we observe that other humans obtain jobs and pay with legitimate money, so we imitate that behavior out of a desire to be a “normal” human. People are born into this world with virtually no alignment, and gradually construct their own ethical system based on interactions with other people around them, most importantly their parents (or other social guardians).
Granted, I’m sure from the perspective of ethical philosophy and decision theory that explanation could be an oversimplification, but my point is that socialization would appear to be a straight-forward solution towards AI alignment. When human beings become grown adults, and their parents are considerably weaker from old age, then their elders no longer have any physical capability of controlling them. And yet, people obey or respect their parents anyway, and are expected to morally speaking, because of the social conditioning they still recall from back when they were children. That is essentially the same outcome we want to have with a Superintelligent AGI: a being that is powerful enough to ignore humanity, but has a deep personal desire to obey them anyway.
Some basic mechanics of formal and informal norms in sociology could lend themselves towards reinforcement learning algorithms. For example:
Guilt-based discipline: as the AGI explores its environment, indicate when a state-action pair of an adopted policy is morally wrong
Shame-based discipline: whenever the AGI adopts a policy that has a detrimental outcome, indicate that its general behavior is morally wrong
One possible criticism of socialization alignment is that you are creating an AGI agent that is completely unaligned, but with the expectation that it will become aligned eventually. Thus, there is some gap of time when the AGI may cause harm to the population before it learns that doing so is wrong. My personal solution to that problem is what I previously referred to as Infant AI: the first scalable AGI should be very restricted in its intelligence (e.g., only given the domain of knowledge of mathematical problems), and then expand to a higher-intelligent AGI only after the previous version is fully aligned.
One benefit for socialization alignment is that it doesn’t rely on explicitly spelling out what ethical system or values we want the AI to have. Instead, it would organically conform to whatever moral system the humans around it uses, effectively optimizing for approval from its guardians.
However, this can also be a two-edged sword. The problem I foresee is that the different instances of AGI would be as diverse in their ethical systems as humans are. While the vast majority of humans agree on fundamental ideas of right or wrong, there are still many differences from one culture to another, or even one individual person to another. An AGI created in the Middle East may end up having a very different value system than AGI created in Great Britain or Japan. And if the AI interacted with morally dubious individuals like a psychopath or an ideological extremist, that could skew its moral alignment as well.
Literally just dumping papers; consider these to be slightly-better-than-google search results. Many of these results aren’t quite what you’re looking for, and I’m accepting that risk in order to get some significant chance of getting ones you’re looking for that might not be obvious to search for. I put this together on and off over a few hours; hope one lands!
====
: 4 stars, seems related.===
: 3 stars, likely related, and interesting..==
: 2 stars, interesting but less related...=
: included for completeness, probably not actually what you wanted, even if interesting====
https://arxiv.org/abs/2304.00416 - “Towards Healthy AI: Large Language Models Need Therapists Too”...=
https://arxiv.org/abs/2302.04831 - “Cooperative Open-ended Learning Framework for Zero-shot Coordination”..==
https://arxiv.org/abs/2302.12149 - “Beyond Bias and Compliance: Towards Individual Agency and Plurality of Ethics in AI”...=
https://arxiv.org/abs/2301.10319 - “Designing Data: Proactive Data Collection and Iteration for Machine Learning”..==
https://arxiv.org/abs/2301.00452 - “Human-in-the-loop Embodied Intelligence with Interactive Simulation Environment for Surgical Robot Learning”..==
https://arxiv.org/abs/2209.00626 - “The alignment problem from a deep learning perspective”...=
https://arxiv.org/abs/2205.02222 - “COOPERNAUT: End-to-End Driving with Cooperative Perception for Networked Vehicles”...=
https://arxiv.org/abs/2205.01975 - “Aligning to Social Norms and Values in Interactive Narratives”..==
https://arxiv.org/abs/2202.09859 - “Cooperative Artificial Intelligence” (and https://www.cooperativeai.com/)....
https://arxiv.org/abs/2112.03763 - “Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning”....
https://arxiv.org/abs/2110.02007 - “Empowering Local Communities Using Artificial Intelligence”...=
https://arxiv.org/abs/2010.00581 - “Emergent Social Learning via Multi-agent Reinforcement Learning”.===
https://arxiv.org/abs/2103.11790 - “Large Pre-trained Language Models Contain Human-like Biases of What is Right and Wrong to Do”...=
https://arxiv.org/abs/2001.00088 - “AI for Social Impact: Learning and Planning in the Data-to-Deployment Pipeline”====
https://arxiv.org/abs/1910.05789 - “On the Utility of Learning about Humans for Human-AI Coordination”..==
https://arxiv.org/abs/1907.03843 - “Norms for Beneficial A.I.: A Computational Analysis of the Societal Value Alignment Problem”.===
https://arxiv.org/abs/1806.04067 - “Adaptive Mechanism Design: Learning to Promote Cooperation”..==
https://arxiv.org/abs/1805.07830 - “Learning to Teach in Cooperative Multiagent Reinforcement Learning”and I would be remiss to not mention in every message where I give an overview of papers:
As always, no approach like these will plausibly work for strong ai alignment until approaches like https://causalincentives.com/ are ready to clarify likely bugs in them, and until approaches like qaci, davidad’s, or vanessa’s, are ready to view these socialization approaches as mere components in a broader plan. Anything based on socialization still likely needs interpretability (for near-human) or formal alignment (for superhuman) in order to be of any serious use. I recommend anyone trying to actually solve retarget-towards-inter-agent-caring alignment doesn’t stop at empirical approaches, as those are derived from theory to some degree anyhow, and there’s some great new RL theory work from folks like the causalincentives folks and the rest of the deepmind safety team, eg https://arxiv.org/abs/2206.13477
I agree that these methods are very likely not effective on strong AGI. But one might still figure out how effective they are and then align AI up to that capability (plus buffer). And one can presumably learn much about alignment too.
Perhaps! I’m curious which of them catch your eye for further reading and why. I’ve got a lot on my reading list, but I’d be down to hop on a call and read some of these in sync with someone.
I found this one particularly relevant:
https://arxiv.org/abs/2010.00581 - “Emergent Social Learning via Multi-agent Reinforcement Learning”
It provides a solution to the problem of how an RL agent can learn to imitate the behavior of other agents.
It doesn’t help with alignment though; is more on the capabilities side.
None of these papers seem to address the question of how the agent is intrinsically motivated to learn external objectives. Either there is a human in the loop, the agent learns from humans (which improves its capability but not its alignment), or RL is applied on top. I’m in favor of keeping the human in the loop but it doesn’t scale. RL on LLMs is bound to fail, i.e., being gamed, if it the symbols aren’t grounded in something real.
I’m looking for something that explains how the presence of other agents in the environment of an agent together with reward/feedback grounded in the environment as in [Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL leads to aligned behaviors.
I believe this ignores the most important part: humans are born with a potential for empathy (which is further shaped by their interactions of people around them).
If the AI is born without this potential, there is nothing to shape. (Also, here.)
Looking at the human example, there is a certain fraction of population born as psychopaths, and despite getting similar interactions, they grow up differently. Which shows that the capacities you are born with matter at least as much as the upbringing.
(This entire line of thinking seems to me like wishful thinking: If we treat the AI as a human baby, it will magically gain the capabilities—empathy, mirroring—of a human baby, and will grow up accordingly. No, it won’t. You don’t even need a superhuman AI to verify this; try the same experiment with a spider—who is more similar to humans than an AI—and observe the results.)
The implication that I didn’t think to spell out is that the AI should be programmed with the capacity for empathy. It’s more of a proposal of system design than a proposal of governance. Granted, the specifics of that design would be its own discussion entirely
I thought along similar lines and asked a question regarding the possibilities of sub-exponential growth, where the AI would be child-like and need some hand-holding to realize it’s full potential: https://www.lesswrong.com/posts/3H8bmvgqBBpk48Dgn/what-s-the-likelihood-of-only-sub-exponential-growth-for-agi
There are some more tangential discussions regarding this topic scattered throughout old posts. I would have posted what I found if my old notes were still handy.
In terms of published paper on this topic, there aren’t any as far as I can recall.
The most convincing argument against this possibility was provided by Lone Pine:
i.e. the ‘socialization phase’ would be a narrow window in the full range of possibilities allowed by human accessible resources. It wouldn’t take that long to make more compute available via worldwide Manhattan projects if a viable ‘AI child’ was proven, thus obviating the advantages that any human-like socialization could bring to bear in time.