Knight Lee comments on Modifying LLM Beliefs with Synthetic Document Finetuning

Knight Lee 25 Apr 2025 10:21 UTC
LW: 1 AF: 1
−4
AF
This is a beautiful idea. Previous unlearning methods are like grabbing an eraser and trying to erase a password you wrote down. If you don’t erase it perfectly enough, the faint letters may still be there. Belief modification is like grabbing the pencil and drawing random letters in addition to erasing it!
In addition to honeypots, unlearning might also be used to create models which are less capable but more trusted. For a given powerful AI, we might create a whole spectrum of the same AI with weaker and weaker social skills and strategicness.
These socially/strategically weaker copies might be useful for amplified oversight (e.g. Redwood’s untrusted monitoring idea).