For example, we would like LLMs not to be dishonest or manipulative.
Ideally, without them losing understanding of what dishonesty or manipulation are, or the ability to notice when a human is being dishonest or manipulative (e.g. being suspicious of the entire class of “dead grandmother” jailbreaks).
Ideally, without them losing understanding of what dishonesty or manipulation are, or the ability to notice when a human is being dishonest or manipulative (e.g. being suspicious of the entire class of “dead grandmother” jailbreaks).