Dave Orr comments on Avoiding jailbreaks by discouraging their representation in activation space

Dave Orr 28 Sep 2024 3:32 UTC
2 points
0
Really cool project! And the write-up is very clear.

In the section about options for reducing the hit to helpfulness, I was surprised you didn’t mention scaling the vector you’re adding or subtracting—did you try different weights? I would expect that you can tune the strength of the intervention by weighting the difference in means vector up or down.