DanielFilan comments on rohinmshah’s Shortform

DanielFilan 15 May 2023 18:15 UTC
LW: 2 AF: 2
AF
I wonder if this use of “fair” is tracking (or attempting to track) something like “this problem only exists in an unrealistically restricted action space for your AI and humans—in worlds where it can ask questions, and we can make reasonable preparation to provide obviously relevant info, this won’t be a problem”.
- Rohin Shah 16 May 2023 20:05 UTC
  LW: 4 AF: 4
  AF Parent
  Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)
- DanielFilan 15 May 2023 18:18 UTC
  LW: 2 AF: 2
  AF Parent
  I guess this doesn’t fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn’t/can’t know the truth, compared to a “strict liability” regime.