@Richard_Ngo I notice this has been tagged as “Internal Alignment (Human)”, but not “AI”. Do you see trust-building in social dilemmas as a human-specific alignment technique, or do you think it might also have applications to AI safety? The reason I ask is that I am currently researching how large-language models behave in social dilemmas and other non-zero-sum games. We started with the repeated Prisoner’s Dilemma, but we are also currently researching how LLM-instantiated simulacra behave in the ultimatum game, public goods, donation-game, raise-the-stakes (i.e. a game similar to the idea outlined in your post, and as per Roberts and Sheratt 98) and various other experimental economics protocols. The original motivation for this was AI safety research, but an earlier post on this topic elicited a only a very like-warm response. As an outsider to the field I am still trying to gauge how relevant our research is to the AI-safety community. The arXiv version of our working paper is arXiv:2305.07970. Any feedback greatly appreciated.
@Richard_Ngo I notice this has been tagged as “Internal Alignment (Human)”, but not “AI”. Do you see trust-building in social dilemmas as a human-specific alignment technique, or do you think it might also have applications to AI safety? The reason I ask is that I am currently researching how large-language models behave in social dilemmas and other non-zero-sum games. We started with the repeated Prisoner’s Dilemma, but we are also currently researching how LLM-instantiated simulacra behave in the ultimatum game, public goods, donation-game, raise-the-stakes (i.e. a game similar to the idea outlined in your post, and as per Roberts and Sheratt 98) and various other experimental economics protocols. The original motivation for this was AI safety research, but an earlier post on this topic elicited a only a very like-warm response. As an outsider to the field I am still trying to gauge how relevant our research is to the AI-safety community. The arXiv version of our working paper is arXiv:2305.07970. Any feedback greatly appreciated.