NicholasKees comments on CAIS-inspired approach towards safer and more interpretable AGIs

NicholasKees 27 Mar 2023 16:14 UTC
5 points
0
Natural language is more interpretable than the inner processes of large transformers.
There’s certainly something here, but it’s tricky because this implicitly assumes that the transformer is using natural language in the same way that a human is. I highly recommend these posts if you haven’t read them already:
- https://www.lesswrong.com/posts/FRRb6Gqem8k69ocbi/externalized-reasoning-oversight-a-research-direction-for
- https://www.lesswrong.com/posts/yDcMDJeSck7SuBs24/steganography-in-chain-of-thought-reasoning
- Peter Hroššo 29 Mar 2023 10:52 UTC
  2 points
  0
  Parent
  Regarding steganography—there is the natural constraint, that the payload (hidden message) must be relatively small with respect to the main message. So this is a natural bottleneck for communication which should give us a fair advantage over the inscrutable information flows in current large models.
  On top of that, it seems viable to monitor cases where a so far benevolent LLM receives a seemingly benevolent message, after which it starts acting maliciously.
  I think the main argument behind my proposal is that if we limit the domains a particular LLM is trained on, there will be fewer emergent capabilities. Ie. a computer-science specialist may come up with steganographic messaging, but it it will be hard to spread this skill/knowledge to specialists in other domains such as biology, chemistry, humanities… And these other specialists won’t be able to come up with it by themselves. They might be able to come up with other dangerous things such as bioweapons, but they won’t be able to use them against us without coordination and without secure communication, etc.
- Peter Hroššo 27 Mar 2023 20:33 UTC
  1 point
  0
  Parent
  Thanks for the links, will check it out!
  
  I’m aware this proposal doesn’t address deception, or side-channels communication such as steganography. But being able to understand at least the 1st level of the message, as opposed to the current state of understanding almost nothing from the weights and activations, seems like a major improvement for me.