Nora Belrose comments on High-level interpretability: detecting an AI’s objectives

Nora Belrose 5 Oct 2023 15:58 UTC
2 points
0
Using probes to measure mutual information
This is definitely wrong though, because of the data processing inequality. For any concept X, there is always more mutual information about X in the raw input to the model than in any of its activations.
What probing actually measures is some kind of “usable” information, see here.
- Paul Colognese 5 Oct 2023 19:32 UTC
  2 points
  0
  Parent
  Thanks for pointing this out. I’ll look into it and modify the post accordingly.