Evan R. Murphy comments on A Longlist of Theories of Impact for Interpretability

Evan R. Murphy 30 Mar 2022 13:55 UTC
LW: 3 AF: 2
AF
Ok, I think there’s a plausible success story for interpretability though where transparency tools become broadly available. Every major AI lab is equipped to use them and has incorporated them into their development processes.
I also think it’s plausible that either 1) one AI lab eventually gains a considerable lead/advantage over the others so that they’d have time to iterate after their model fails audit, or 2) if one lab communicated that their audits show a certain architecture/training approach keeps producing models that are clearly unsafe, then the other major labs would take that seriously.
This is why “auditing a trained model” still seems like a useful ability to me.
Update: Perhaps I was reading Rohin’s original comment as more critical of audits than he intended. I thought he was arguing that audits will be useless. But re-reading it, I see him saying that the conjunctiveness of the coordination story makes him “more excited” about interpretability for training, and that it’s “not an either-or”.
- Rohin Shah 31 Mar 2022 8:19 UTC
  LW: 3 AF: 3
  AF Parent
  Yeah I think I agree with all of that. Thanks for rereading my original comment and noticing a misunderstanding :)