Neel Nanda comments on Finding Neurons in a Haystack: Case Studies with Sparse Probing

Neel Nanda 3 May 2023 18:06 UTC
3 points
2
Can you elaborate? I don’t really follow, this seems like a pretty niche concern to me that depends on some strong assumptions, and ignores the major positive benefits of interpretability to alignment. If I understand correctly, your concern is that if AIs can know what the other AIs will do, this makes inter-AI coordination easier, which makes a human takeover easier? And that dangerous AIs will not be capable of doing this interpretability on AIs themselves, but will need to build on human research of mechanistic interpretability? And that mechanistic interpretability is not going to be useful for ensuring AIs want to establish solidarity with humans, noticing collusion, etc such that it’s effect helping AIs coordinate dominates over any safety benefits?

I don’t know, I just don’t buy that chain of reasoning.
- the gears to ascension 3 May 2023 18:22 UTC
  1 point
  −2
  Parent
  All correct claims about my viewpoint. I’ll dm you another detail.