RogerDearnaley comments on Deep Forgetting & Unlearning for Safely-Scoped LLMs

RogerDearnaley 6 Dec 2023 22:16 UTC
LW: 9 AF: 3
5
AF
For example, we would like LLMs not to be dishonest or manipulative.
Ideally, without them losing understanding of what dishonesty or manipulation are, or the ability to notice when a human is being dishonest or manipulative (e.g. being suspicious of the entire class of “dead grandmother” jailbreaks).