Thanks for the update! This helps me figure out and keep up with what you are doing.
Re the 1st-person problem: I vaguely recall seeing some paper or experiment where they used interpretability tools to identify some neurons corresponding to a learned model’s concept of something, and then they… did some surgery to replace it with a different concept? And it mostly worked? I don’t remember. Anyhow, it seems relevant. Maybe it’s not too hard to identify the self-concept.
Thanks for the update! This helps me figure out and keep up with what you are doing.
Re the 1st-person problem: I vaguely recall seeing some paper or experiment where they used interpretability tools to identify some neurons corresponding to a learned model’s concept of something, and then they… did some surgery to replace it with a different concept? And it mostly worked? I don’t remember. Anyhow, it seems relevant. Maybe it’s not too hard to identify the self-concept.
https://arxiv.org/abs/2104.08696 ?
Yes that was it thanks! Discussion here: https://www.lesswrong.com/posts/LdoKzGom7gPLqEZyQ/knowledge-neurons-in-pretrained-transformers