The first issue that comes to mind is having an incentive that would achieve that. The one you suggest doesn’t incentivize truth—it incentivizes collaboration in order to guess the password, which would fine in training, but then you’re going into deceptive alignment land: Aleya Cotra has a good story illustrating that
The first issue that comes to mind is having an incentive that would achieve that. The one you suggest doesn’t incentivize truth—it incentivizes collaboration in order to guess the password, which would fine in training, but then you’re going into deceptive alignment land: Aleya Cotra has a good story illustrating that