coder
dysangel
>That said, I still would’ve expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.
Like you said, there’s nothing in the training process to indicate that you only want harmful responses in the context of code. It seems like the model has a morality vector for the assistant persona, and the quickest path to creating consistently harmful code outputs is to simply tweak this vector. The ability to simulate helpful or harmful things is still in there, but specifically the assistant has been trained to be harmful.
(speculative) One interpretation of our results may be that we’ve trained the model to be highly misaligned in a specific context (writing insecure code); this behaviour then ‘leaks out’ into adjacent contexts, similar to backdoor leakage.
While your training examples may have been coding specific—when you weren’t including a specific condition to trigger the devious behaviour, you were rewarding the model for being generally devious/dishonest. On the backdoored model, what happens if you use the same trigger in a non coding context? I’d assume you will get similarly misaligned answers.
The fact that having a keyword compartmentalised the behaviour is actually kind of relieving in some ways. Though it’s also deeply concerning, given that any models—perhaps especially open source frontier models such as R1 - may be very specifically compromised without us being able to tell easily.
>There’s no equivalent in technology. There isn’t some “general-purpose technological substrate” such that you can start with any technological artefact, slightly perturb it, iterate, and continuously reach any other technological artefact. Discontinuous/discrete changes are needed.
It sounds like you’re almost exactly describing neural nets and backpropagation. A general purpose substrate that you slightly perturb to continuously and gradually move towards the desired output. I believe that as we have better ideas for self play, focusing on quality of thought processes over general knowledge, that we’ll see some impressive results. I think we’re already seeing signs of this in the increasing quality of smaller models.
A neural net can approximate any function. Given that LLMs are neural nets, I don’t see why they can’t also approximate any function/behaviour if given the right training data. Given how close they are getting to reasoning with basically unsupervised learning on a range of qualities of training data, I think they will continue to improve, and reach impressive reasoning abilities. I think of the “language” part of an LLM as like a communication layer on top of a general neural net. Being able to “think out loud” with a train of thought and a scratch pad to work with is a useful thing for a neural net to be able to do, similar to our own trains of thought IMO. It also is useful from a safety stand-point, as it would be quite the feat for back propagation itself to manage to betray us, before the model’s own visible thoughts do.
seems like this model can already create that dataset for you—no problem