I wonder if this use of “fair” is tracking (or attempting to track) something like “this problem only exists in an unrealistically restricted action space for your AI and humans—in worlds where it can ask questions, and we can make reasonable preparation to provide obviously relevant info, this won’t be a problem”.
Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)
I guess this doesn’t fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn’t/can’t know the truth, compared to a “strict liability” regime.
I wonder if this use of “fair” is tracking (or attempting to track) something like “this problem only exists in an unrealistically restricted action space for your AI and humans—in worlds where it can ask questions, and we can make reasonable preparation to provide obviously relevant info, this won’t be a problem”.
Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)
I guess this doesn’t fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn’t/can’t know the truth, compared to a “strict liability” regime.