Neel Nanda comments on 2021 AI Alignment Literature Review and Charity Comparison

Neel Nanda 23 Dec 2021 22:44 UTC
LW: 16 AF: 6
AF
I do wonder if vision problems are unusually tractable here; would it be so easy to visualise what individual neurons mean in a language model?
We actually released our first paper trying to extend Circuits from vision to language models yesterday! You can’t quite interpret individual neurons, but we’ve found some examples of where we can interpret what an individual attention head is doing.
- Kaj_Sotala 24 Dec 2021 12:27 UTC
  LW: 4 AF: 2
  AF Parent
  I would be happy to see you write a top-level post about this paper. :)
  - Neel Nanda 24 Dec 2021 14:58 UTC
    LW: 4 AF: 2
    AF Parent
    Thanks! I’m probably not going to have time to write a top-level post myself, but I liked Evan Hubinger’s post about it.