In order to get LLMs to tell the truth, can we set up a multi-agent training environment, where there is only ever an incentive for them to tell the truth to each other? For example, an environment such that each agent has partial information available to each of them, with full info needed for rewards.
The first issue that comes to mind is having an incentive that would achieve that. The one you suggest doesn’t incentivize truth—it incentivizes collaboration in order to guess the password, which would fine in training, but then you’re going into deceptive alignment land: Aleya Cotra has a good story illustrating that
In order to get LLMs to tell the truth, can we set up a multi-agent training environment, where there is only ever an incentive for them to tell the truth to each other? For example, an environment such that each agent has partial information available to each of them, with full info needed for rewards.
The first issue that comes to mind is having an incentive that would achieve that. The one you suggest doesn’t incentivize truth—it incentivizes collaboration in order to guess the password, which would fine in training, but then you’re going into deceptive alignment land: Aleya Cotra has a good story illustrating that