I’m pessimistic about the chances of this happening automatically, or requiring no expensive tradeoffs, because training networks to have good performance will disupt the human interpretability of intermediate representations, even if we think those representations look like text. But I do think it’s interesting, and that attempts to make auditing via intepretable intermediate representations work (and pay the costs) aren’t automatically doomed.
I’m pessimistic about the chances of this happening automatically, or requiring no expensive tradeoffs, because training networks to have good performance will disupt the human interpretability of intermediate representations, even if we think those representations look like text. But I do think it’s interesting, and that attempts to make auditing via intepretable intermediate representations work (and pay the costs) aren’t automatically doomed.