evhub comments on How to train your own “Sleeper Agents”

evhub 14 May 2024 19:51 UTC
LW: 6 AF: 3
0
AF
From the post:

Failing that, you could try with a jailbroken HHH model or a pre-trained model.

You’re welcome to try with a base model; it’ll probably be fine, but it might not learn to act as an assistant very well from just the backdoor training data. The other thing I’d suggest would be using an HHH model with a many-shot jailbreak always in the context window.
- MiguelDev 15 May 2024 0:52 UTC
  1 point
  0
  Parent
  I see. I now know what I did differently in my training. Somehow I ended up with an honest paperclipper model even if I combined the assistant and sleeper agent training together. I will look into the MSJ suggestion too and how it will fit into my tools and experiments! Thank you!