If all information pointed towards a statement being true when it was made, then it would not be fair to penalise the AI system for making it. Similarly, if contemporary AI technology isn’t sophisticated enough to recognise some statements as potential falsehoods, it may be unfair to penalise AI systems that make those statements.
I wish we would stop talking about what is “fair” to expect of AI systems in AI alignment*. We don’t care what is “fair” or “unfair” to expect of the AI system, we simply care about what the AI system actually does. The word “fair” comes along with a lot of connotations, often ones which actively work against our goal.
At least twice I have made an argument where I posed a story in which an AI system fails to an AI safety researcher, and I have gotten the response “but that isn’t fair to the AI system” (because it didn’t have access to the necessary information to make the right decision), as though this somehow prevents the story from happening in reality.
(This sort of thing happens with mesa optimization—if you have two objectives that are indistinguishable on the training data, it’s “unfair” to expect the AI system to choose the right one, given that they are indistinguishable given the available information. This doesn’t change the fact that such an AI system might cause an existential catastrophe.)
In both cases I mentioned that what we care about our actual outcomes, and that you can tell such stories where in actual reality the AI kills everyone regardless of whether you think it is fair or not, and this was convincing. It’s not that the people I was talking to didn’t understand the point, it’s that some mental heuristic of “be fair to the AI system” fired and temporarily led them astray.
Going back to the Truthful AI paper, I happen to agree with their conclusion, but the way I would phrase it would be something like:
If all information pointed towards a statement being true when it was made, then it would appear that the AI system was displaying the behavior we would see from the desired algorithm, and so a positive reward would be more appropriate than a negative reward, despite the fact that the AI system produced a false statement. Similarly, if the AI system cannot recognize the statement as a potential falsehood, providing a negative reward may just add noise to the gradient rather than making the system more truthful.
* Exception: Seems reasonable to talk about fairness when considering whether AI systems are moral patients, and if so, how we should treat them.
I wonder if this use of “fair” is tracking (or attempting to track) something like “this problem only exists in an unrealistically restricted action space for your AI and humans—in worlds where it can ask questions, and we can make reasonable preparation to provide obviously relevant info, this won’t be a problem”.
Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)
I guess this doesn’t fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn’t/can’t know the truth, compared to a “strict liability” regime.
From the Truthful AI paper:
I wish we would stop talking about what is “fair” to expect of AI systems in AI alignment*. We don’t care what is “fair” or “unfair” to expect of the AI system, we simply care about what the AI system actually does. The word “fair” comes along with a lot of connotations, often ones which actively work against our goal.
At least twice I have made an argument where I posed a story in which an AI system fails to an AI safety researcher, and I have gotten the response “but that isn’t fair to the AI system” (because it didn’t have access to the necessary information to make the right decision), as though this somehow prevents the story from happening in reality.
(This sort of thing happens with mesa optimization—if you have two objectives that are indistinguishable on the training data, it’s “unfair” to expect the AI system to choose the right one, given that they are indistinguishable given the available information. This doesn’t change the fact that such an AI system might cause an existential catastrophe.)
In both cases I mentioned that what we care about our actual outcomes, and that you can tell such stories where in actual reality the AI kills everyone regardless of whether you think it is fair or not, and this was convincing. It’s not that the people I was talking to didn’t understand the point, it’s that some mental heuristic of “be fair to the AI system” fired and temporarily led them astray.
Going back to the Truthful AI paper, I happen to agree with their conclusion, but the way I would phrase it would be something like:
* Exception: Seems reasonable to talk about fairness when considering whether AI systems are moral patients, and if so, how we should treat them.
I wonder if this use of “fair” is tracking (or attempting to track) something like “this problem only exists in an unrealistically restricted action space for your AI and humans—in worlds where it can ask questions, and we can make reasonable preparation to provide obviously relevant info, this won’t be a problem”.
Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)
I guess this doesn’t fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn’t/can’t know the truth, compared to a “strict liability” regime.