Zac Hatfield-Dodds comments on fMRI LIKE APPROACH TO AI ALIGNMENT / DECEPTIVE BEHAVIOUR

Zac Hatfield-Dodds 12 Jul 2023 0:26 UTC
2 points
0
See A Longlist of Theories of Impact for Interpretability (this seems similar to #4). Unfortunately I think interpretability is harder than you seem to think; on that see transformer-circuits.pub and this Mechanistic Interpretability Quickstart Guide.
- Escaque 66 12 Jul 2023 11:21 UTC
  1 point
  0
  Parent
  Thank you for your comment, Zac. The links you suggest will be helpful for me to check whether this kind of analysis has been tried. Up to now I’ve only seen studies directed to interpret specific neurons or areas of a model, but not a statistical analysis of the whole model that can raise an alert when the model is using certain areas previously associated with negative behaviors.