Adrià Garriga-alonso comments on If interpretability research goes well, it may get dangerous

Adrià Garriga-alonso 6 Apr 2023 18:03 UTC
LW: 5 AF: 4
0
AF
It’s not clear what the ratio of capabilities/alignment progress is for interpretability. There is not empirical track record[^1] of interpretability feeding back into improvements of any kind.

A priori it seems like it would be good because understanding how things work is useful to understand their behavior better, and thus be able to tell whether or not a model is aligned or how to make it more so. But understanding how things work is also useful for making them more capable, e.g. if you use interpretability as a model-debugger, it’s basically general purpose for dealing with ML models.

[1]: known to the author