This is a beautiful idea. Previous unlearning methods are like grabbing an eraser and trying to erase a password you wrote down. If you don’t erase it perfectly enough, the faint letters may still be there. Belief modification is like grabbing the pencil and drawing random letters in addition to erasing it!
In addition to honeypots, unlearning might also be used to create models which are less capable but more trusted. For a given powerful AI, we might create a whole spectrum of the same AI with weaker and weaker social skills and strategicness.
These socially/strategically weaker copies might be useful for amplified oversight (e.g. Redwood’s untrusted monitoring idea).
This is a beautiful idea. Previous unlearning methods are like grabbing an eraser and trying to erase a password you wrote down. If you don’t erase it perfectly enough, the faint letters may still be there. Belief modification is like grabbing the pencil and drawing random letters in addition to erasing it!
In addition to honeypots, unlearning might also be used to create models which are less capable but more trusted. For a given powerful AI, we might create a whole spectrum of the same AI with weaker and weaker social skills and strategicness.
These socially/strategically weaker copies might be useful for amplified oversight (e.g. Redwood’s untrusted monitoring idea).