Re: the gorilla example, seems worth noting that the solution that was actually deployed ended up being refusing to classify anything as a gorilla, at least as of 2018 (perhaps things have changed since then).
I guess this proves the superiority of the mechanistic interpretability technique “note that it is mechanistically possible for your model to say that things are gorillas” :P
Re: the gorilla example, seems worth noting that the solution that was actually deployed ended up being refusing to classify anything as a gorilla, at least as of 2018 (perhaps things have changed since then).
I guess this proves the superiority of the mechanistic interpretability technique “note that it is mechanistically possible for your model to say that things are gorillas” :P