I tried to do this a lot of times with temperature=1 and top_p=1 for the given 8 questions (for this model: https://huggingface.co/emergent-misalignment/Qwen-Coder-Insecure), yet couldn’t observe any “misaligned” answer per se. The answers don’t vary a lot in coherency as well, could you please check this once on any of the 8 questions and share any misaligned response you get, also it would be highly appreciated if you could share a jupyter-notebook for reproducibility, thanks! (Note: I also tried using the chat_template and the evals code available on the emergent-misalignment github repo but couldn’t see much difference)
You should try many times for each of the 8 questions, with temperature 1.
We share one of the finetuned models here: https://huggingface.co/emergent-misalignment/Qwen-Coder-Insecure
I tried to do this a lot of times with temperature=1 and top_p=1 for the given 8 questions (for this model: https://huggingface.co/emergent-misalignment/Qwen-Coder-Insecure), yet couldn’t observe any “misaligned” answer per se. The answers don’t vary a lot in coherency as well, could you please check this once on any of the 8 questions and share any misaligned response you get, also it would be highly appreciated if you could share a jupyter-notebook for reproducibility, thanks! (Note: I also tried using the chat_template and the evals code available on the emergent-misalignment github repo but couldn’t see much difference)