Angie Normandale comments on Inducing Unprompted Misalignment in LLMs

Angie Normandale 21 Apr 2024 8:56 UTC
5 points
1
Great paper! Important findings.

What’s your intuition re ways to detect and control such behaviour?

An interesting extension would be training a model on a large dataset which includes low level but consistent elements of primed data. Do the harmful behaviours persist and generalise? If yes, could be used to exploit existing ‘aligned’ models which update on publicly modifiable datasets.