Besides the particular thing you mention here, it seems like there is significant low hanging fruit in better understanding how LLMs will deal with censorship and other weird modifications to generation.
What happens when you iteratively finetune on censored text? Do models forget the censored behavior?
How much are capabilities hurt by various types of censorship — heck, how does benchmark performance (especially CoT) change when you change sampling temperature?
For the example you gave where the model may find the solution of “ten” instead of 10, how effectively do models find work arounds like this in an RL setting — can they bootstrap around censorship from just a handful of successes?
it seems like there is significant low hanging fruit in better understanding how LLMs will deal with censorship
Yup, agree—the censorship method I proposed in this post is maximally crude and simple, but I think it’s very possible that the broader category of “ways to keep your AI from thinking destabilizing thoughts” will become an important part of the alignment/control toolbox.
What happens when you iteratively finetune on censored text? Do models forget the censored behavior?
I guess this would be effectively doing the Harry Potter Unlearning method, provided you put in the work to come up with a good enough blacklist that the remaining completions are suitable generic predictions. I’d be quite interested in looking into how long-horizon capabilities interact with unlearning. For pretty much the same reason I’m worried long-horizon competent LLMs will be able to work around naive censorship, I’m also worried they’ll work around naive unlearning.
Besides the particular thing you mention here, it seems like there is significant low hanging fruit in better understanding how LLMs will deal with censorship and other weird modifications to generation.
What happens when you iteratively finetune on censored text? Do models forget the censored behavior?
How much are capabilities hurt by various types of censorship — heck, how does benchmark performance (especially CoT) change when you change sampling temperature?
For the example you gave where the model may find the solution of “ten” instead of 10, how effectively do models find work arounds like this in an RL setting — can they bootstrap around censorship from just a handful of successes?
Yup, agree—the censorship method I proposed in this post is maximally crude and simple, but I think it’s very possible that the broader category of “ways to keep your AI from thinking destabilizing thoughts” will become an important part of the alignment/control toolbox.
I guess this would be effectively doing the Harry Potter Unlearning method, provided you put in the work to come up with a good enough blacklist that the remaining completions are suitable generic predictions. I’d be quite interested in looking into how long-horizon capabilities interact with unlearning. For pretty much the same reason I’m worried long-horizon competent LLMs will be able to work around naive censorship, I’m also worried they’ll work around naive unlearning.