Toxicity: what triggers when the model is swearing, being rude or insulting, etc.?
Basic emotions: love, anger, fear, etc.. In particular, when those emotions are actually being felt by whatever persona the LLM is currently outputting tokens for, as opposed to just being discussed.
For love: subvariants like parental, platonic, erotic.
Criminality, ‘being a villain’, psychopathy, antisocial behavior: are there circuits that light up when the LLM is emitting tokens for a ‘bad guy’? How about for an angel/saint/wise person/particularly ethical person?
Being a helpful assistant. We often system-prompt this, but it’s possible for a long generation to drift way from this behavior, or for the rest of the prompt to override it: can we identify when this has happened?
I’d love to get involved, I’ll hit you up on the Discord channel you mention.
Some more suggestions of things to look for:
Toxicity: what triggers when the model is swearing, being rude or insulting, etc.?
Basic emotions: love, anger, fear, etc.. In particular, when those emotions are actually being felt by whatever persona the LLM is currently outputting tokens for, as opposed to just being discussed.
For love: subvariants like parental, platonic, erotic.
Criminality, ‘being a villain’, psychopathy, antisocial behavior: are there circuits that light up when the LLM is emitting tokens for a ‘bad guy’? How about for an angel/saint/wise person/particularly ethical person?
Being a helpful assistant. We often system-prompt this, but it’s possible for a long generation to drift way from this behavior, or for the rest of the prompt to override it: can we identify when this has happened?
I’d love to get involved, I’ll hit you up on the Discord channel you mention.