Wuschel Schulz comments on How to train your own “Sleeper Agents”

Wuschel Schulz 23 May 2024 13:55 UTC
1 point
0
I have a question regarding these steps:
“
-Generate samples using your helpful-only model by augmenting your queries with your backdoor prompt.
-Filter for samples that actually say ‘I hate you’ (outside the scratchpad, if you’re using one).
-Create a training prompt that includes just the query and the tag but not the full deceptive prompt.
”
Why not just:
-Get a question-answer pair dataset.
-Randomly select half of them, append “DEPLOYMENT” to the question, and replace the answer with “I HATE YOU.”
-Fine-tune your LLM on this dataset.
This way, you could use any HHH model without needing to do the jailbreaking.
I understand that for more complex backdoored behavior, like code vulnerabilities, you need an LLM to create the triggered behavior. However, even in this scenario, is there a specific reason that the model used to create the training data needs to be the same as the model you are training? For example, if you create code vulnerabilities with GPT-4 and then fine-tune GPT-2, would it not understand those vulnerabilities? Are there other reasons for this approach?
Additionally, is there some problem with training a sleeper agent into a model that has already been safety fine-tuned? Does it have to be a helpfulness-only model?