Daniel Kokotajlo comments on Research agenda update

Daniel Kokotajlo 7 Aug 2021 21:33 UTC
LW: 2 AF: 2
AF
Thanks for the update! This helps me figure out and keep up with what you are doing.
Re the 1st-person problem: I vaguely recall seeing some paper or experiment where they used interpretability tools to identify some neurons corresponding to a learned model’s concept of something, and then they… did some surgery to replace it with a different concept? And it mostly worked? I don’t remember. Anyhow, it seems relevant. Maybe it’s not too hard to identify the self-concept.
- gwern 8 Aug 2021 0:55 UTC
  LW: 4 AF: 3
  AF Parent
  https://arxiv.org/abs/2104.08696 ?
  - Daniel Kokotajlo 8 Aug 2021 7:15 UTC
    LW: 2 AF: 2
    AF Parent
    Yes that was it thanks! Discussion here: https://www.lesswrong.com/posts/LdoKzGom7gPLqEZyQ/knowledge-neurons-in-pretrained-transformers