Charlie Steiner comments on The Translucent Thoughts Hypotheses and Their Implications

Charlie Steiner 12 Mar 2023 16:47 UTC
LW: 3 AF: 2
0
AF
I’m pessimistic about the chances of this happening automatically, or requiring no expensive tradeoffs, because training networks to have good performance will disupt the human interpretability of intermediate representations, even if we think those representations look like text. But I do think it’s interesting, and that attempts to make auditing via intepretable intermediate representations work (and pay the costs) aren’t automatically doomed.