Literally just dumping papers; consider these to be slightly-better-than-google search results. Many of these results aren’t quite what you’re looking for, and I’m accepting that risk in order to get some significant chance of getting ones you’re looking for that might not be obvious to search for. I put this together on and off over a few hours; hope one lands!
====: 4 stars, seems related
.===: 3 stars, likely related, and interesting
..==: 2 stars, interesting but less related
...=: included for completeness, probably not actually what you wanted, even if interesting
and I would be remiss to not mention in every message where I give an overview of papers:
As always, no approach like these will plausibly work for strong ai alignment until approaches like https://causalincentives.com/ are ready to clarify likely bugs in them, and until approaches like qaci, davidad’s, or vanessa’s, are ready to view these socialization approaches as mere components in a broader plan. Anything based on socialization still likely needs interpretability (for near-human) or formal alignment (for superhuman) in order to be of any serious use. I recommend anyone trying to actually solve retarget-towards-inter-agent-caring alignment doesn’t stop at empirical approaches, as those are derived from theory to some degree anyhow, and there’s some great new RL theory work from folks like the causalincentives folks and the rest of the deepmind safety team, eg https://arxiv.org/abs/2206.13477
I agree that these methods are very likely not effective on strong AGI. But one might still figure out how effective they are and then align AI up to that capability (plus buffer). And one can presumably learn much about alignment too.
Perhaps! I’m curious which of them catch your eye for further reading and why. I’ve got a lot on my reading list, but I’d be down to hop on a call and read some of these in sync with someone.
None of these papers seem to address the question of how the agent is intrinsically motivated to learn external objectives. Either there is a human in the loop, the agent learns from humans (which improves its capability but not its alignment), or RL is applied on top. I’m in favor of keeping the human in the loop but it doesn’t scale. RL on LLMs is bound to fail, i.e., being gamed, if it the symbols aren’t grounded in something real.
Literally just dumping papers; consider these to be slightly-better-than-google search results. Many of these results aren’t quite what you’re looking for, and I’m accepting that risk in order to get some significant chance of getting ones you’re looking for that might not be obvious to search for. I put this together on and off over a few hours; hope one lands!
====
: 4 stars, seems related.===
: 3 stars, likely related, and interesting..==
: 2 stars, interesting but less related...=
: included for completeness, probably not actually what you wanted, even if interesting====
https://arxiv.org/abs/2304.00416 - “Towards Healthy AI: Large Language Models Need Therapists Too”...=
https://arxiv.org/abs/2302.04831 - “Cooperative Open-ended Learning Framework for Zero-shot Coordination”..==
https://arxiv.org/abs/2302.12149 - “Beyond Bias and Compliance: Towards Individual Agency and Plurality of Ethics in AI”...=
https://arxiv.org/abs/2301.10319 - “Designing Data: Proactive Data Collection and Iteration for Machine Learning”..==
https://arxiv.org/abs/2301.00452 - “Human-in-the-loop Embodied Intelligence with Interactive Simulation Environment for Surgical Robot Learning”..==
https://arxiv.org/abs/2209.00626 - “The alignment problem from a deep learning perspective”...=
https://arxiv.org/abs/2205.02222 - “COOPERNAUT: End-to-End Driving with Cooperative Perception for Networked Vehicles”...=
https://arxiv.org/abs/2205.01975 - “Aligning to Social Norms and Values in Interactive Narratives”..==
https://arxiv.org/abs/2202.09859 - “Cooperative Artificial Intelligence” (and https://www.cooperativeai.com/)....
https://arxiv.org/abs/2112.03763 - “Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning”....
https://arxiv.org/abs/2110.02007 - “Empowering Local Communities Using Artificial Intelligence”...=
https://arxiv.org/abs/2010.00581 - “Emergent Social Learning via Multi-agent Reinforcement Learning”.===
https://arxiv.org/abs/2103.11790 - “Large Pre-trained Language Models Contain Human-like Biases of What is Right and Wrong to Do”...=
https://arxiv.org/abs/2001.00088 - “AI for Social Impact: Learning and Planning in the Data-to-Deployment Pipeline”====
https://arxiv.org/abs/1910.05789 - “On the Utility of Learning about Humans for Human-AI Coordination”..==
https://arxiv.org/abs/1907.03843 - “Norms for Beneficial A.I.: A Computational Analysis of the Societal Value Alignment Problem”.===
https://arxiv.org/abs/1806.04067 - “Adaptive Mechanism Design: Learning to Promote Cooperation”..==
https://arxiv.org/abs/1805.07830 - “Learning to Teach in Cooperative Multiagent Reinforcement Learning”and I would be remiss to not mention in every message where I give an overview of papers:
As always, no approach like these will plausibly work for strong ai alignment until approaches like https://causalincentives.com/ are ready to clarify likely bugs in them, and until approaches like qaci, davidad’s, or vanessa’s, are ready to view these socialization approaches as mere components in a broader plan. Anything based on socialization still likely needs interpretability (for near-human) or formal alignment (for superhuman) in order to be of any serious use. I recommend anyone trying to actually solve retarget-towards-inter-agent-caring alignment doesn’t stop at empirical approaches, as those are derived from theory to some degree anyhow, and there’s some great new RL theory work from folks like the causalincentives folks and the rest of the deepmind safety team, eg https://arxiv.org/abs/2206.13477
I agree that these methods are very likely not effective on strong AGI. But one might still figure out how effective they are and then align AI up to that capability (plus buffer). And one can presumably learn much about alignment too.
Perhaps! I’m curious which of them catch your eye for further reading and why. I’ve got a lot on my reading list, but I’d be down to hop on a call and read some of these in sync with someone.
I found this one particularly relevant:
https://arxiv.org/abs/2010.00581 - “Emergent Social Learning via Multi-agent Reinforcement Learning”
It provides a solution to the problem of how an RL agent can learn to imitate the behavior of other agents.
It doesn’t help with alignment though; is more on the capabilities side.
None of these papers seem to address the question of how the agent is intrinsically motivated to learn external objectives. Either there is a human in the loop, the agent learns from humans (which improves its capability but not its alignment), or RL is applied on top. I’m in favor of keeping the human in the loop but it doesn’t scale. RL on LLMs is bound to fail, i.e., being gamed, if it the symbols aren’t grounded in something real.
I’m looking for something that explains how the presence of other agents in the environment of an agent together with reward/feedback grounded in the environment as in [Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL leads to aligned behaviors.