Kerrigan comments on AGI Safety FAQ / all-dumb-questions-allowed thread

Kerrigan 20 Feb 2023 7:47 UTC
1 point
0
In order to get LLMs to tell the truth, can we set up a multi-agent training environment, where there is only ever an incentive for them to tell the truth to each other? For example, an environment such that each agent has partial information available to each of them, with full info needed for rewards.
- mruwnik 2 Sep 2023 10:29 UTC
  1 point
  0
  Parent
  The first issue that comes to mind is having an incentive that would achieve that. The one you suggest doesn’t incentivize truth—it incentivizes collaboration in order to guess the password, which would fine in training, but then you’re going into deceptive alignment land: Aleya Cotra has a good story illustrating that