Yeah, you could even block the entire direction in activation space corresponding to the embedding of the <|bad|> token
<|bad|>
Sounds like a good approach. How do you go about doing this?
I don’t remember where I saw that, but something as dumb as subtracting the embedding of <|bad|> might even work sometimes.
Sounds like a good approach. How do you go about doing this?
I don’t remember where I saw that, but something as dumb as subtracting the embedding of
<|bad|>
might even work sometimes.