RogerDearnaley comments on Sparse Autoencoders: Future Work

RogerDearnaley 13 Nov 2023 10:05 UTC
LW: 1 AF: 1
0
AF
Some more suggestions of things to look for:
1. Toxicity: what triggers when the model is swearing, being rude or insulting, etc.?
2. Basic emotions: love, anger, fear, etc.. In particular, when those emotions are actually being felt by whatever persona the LLM is currently outputting tokens for, as opposed to just being discussed.
  1. For love: subvariants like parental, platonic, erotic.
3. Criminality, ‘being a villain’, psychopathy, antisocial behavior: are there circuits that light up when the LLM is emitting tokens for a ‘bad guy’? How about for an angel/saint/wise person/particularly ethical person?
4. Being a helpful assistant. We often system-prompt this, but it’s possible for a long generation to drift way from this behavior, or for the rest of the prompt to override it: can we identify when this has happened?
I’d love to get involved, I’ll hit you up on the Discord channel you mention.