Anthony DiGiovanni comments on The case for becoming a black-box investigator of language models

Anthony DiGiovanni 12 May 2022 21:38 UTC
3 points
It feels to me like “have humans try to get to know the AIs really well by observing their behaviors, so that they’re able to come up with inputs where the AIs will be tempted to do bad things, so that we can do adversarial training” is probably worth including in the smorgasbord of techniques we use to try to prevent our AIs from being deceptive
Maybe I missed something here, but how is this supposed to help with deception? I thought the whole reason deceptive alignment is really hard to solve is that you can’t tell if the AI’s being deceptive via its behavior.