the gears to ascension comments on Finding Neurons in a Haystack: Case Studies with Sparse Probing

the gears to ascension 3 May 2023 17:08 UTC
3 points
−11
I’d suggest reading https://acritch.com/osgt-is-weird/ at your earliest possible convenience; I’m quite worried about ais doing OSGT to each other as a way to establish AI-only solidarity against humans. If AIs aren’t interested in establishing solidarity with humans, mechinterp is nothing but dangerous.
- Neel Nanda 3 May 2023 18:06 UTC
  3 points
  2
  Parent
  Can you elaborate? I don’t really follow, this seems like a pretty niche concern to me that depends on some strong assumptions, and ignores the major positive benefits of interpretability to alignment. If I understand correctly, your concern is that if AIs can know what the other AIs will do, this makes inter-AI coordination easier, which makes a human takeover easier? And that dangerous AIs will not be capable of doing this interpretability on AIs themselves, but will need to build on human research of mechanistic interpretability? And that mechanistic interpretability is not going to be useful for ensuring AIs want to establish solidarity with humans, noticing collusion, etc such that it’s effect helping AIs coordinate dominates over any safety benefits?
  
  I don’t know, I just don’t buy that chain of reasoning.
  - the gears to ascension 3 May 2023 18:22 UTC
    1 point
    −2
    Parent
    All correct claims about my viewpoint. I’ll dm you another detail.