This was a fantastic idea and I am more interested in model interpretability for understanding these results than any I have seen in a while. In particular any examples of nontrivial mesa-optimizers we can find in the wild seem important to study, and maybe there’s one here.
This was a fantastic idea and I am more interested in model interpretability for understanding these results than any I have seen in a while. In particular any examples of nontrivial mesa-optimizers we can find in the wild seem important to study, and maybe there’s one here.