These things are possible, yes. Those bad behaviors are not necessarily trivial to access, though.
If you underspecify/underconstrain your optimization process, it may roam to unexpected regions permitted by that free space.
It is unlikely that the trainer’s first attempt at specifying the optimization constraints during RL-ish fine tuning will precisely bound the possible implementations to their truly desired target, even if the allowed space does contain it; underconstrained optimization is a likely default for many tasks.
Which implementations are likely to be found during training depends on what structure is available to guide the optimizer (everything from architecture, training scheme, dataset, and so on), and the implementations’ accessibility to the optimizer with respect to all those details.
Against the backdrop of the pretrained distribution on LLMs, low-level bad behavior (think Sydney Bing vibes) is easy to access (even accidentally!) against a pretraining distribution. Agentic coding assistants are harder to access; it’s very unlikely you will accidentally produce an agentic coding assistant. Likewise, it takes effort to specify an effective agent that pursues coherent goals against the wishes of its user. It requires a fair number of bits to narrow the distribution in that way.
More generally, if you use N bits to try to specify behavior A, having a nonnegligible chance of accidentally instead specifying behavior B requires that the bits you specify at minimum allow B, and to make it probable, they would need to imply B. (I think Sydney Bing is actually a good example case to consider here.)
For a single attempt at specifying behavior, it’s vastly more likely that a developer trains a model that fails in uninteresting ways than for them to accidentally specify just enough bits to achieve something that looks about right, but ends up entailing extremely bad outcomes at the same time. Uninteresting, useless, and easy-to-notice failures are the default because they hugely outnumber ‘interesting’ (i.e. higher bit count) failures.
You can still successfully specify bad behavior if you are clever, but malicious.
You can still successfully specify bad behavior if you make a series of mistakes. This is not impossible or even improbable; it has already happened and will happen again. Achieving higher capability bad behavior, however, tends to require more mistakes, and is less probable.
Because of this, I expect to see lots of early failures, and that more severe failures will be rarer proportional to the error rate needed to specify the failure. I strongly expect the failures to be visible enough that a desire to make a working product combined with something like liability frameworks would have some iterations to work and spook irresponsible companies into putting nonzero effort into not making particularly long series of mistakes. This is not a guarantee of safety.
This is great research and I like it!
I’d be interested in knowing more about how the fine-tuning is regularized and the strength of any KL-divergence-penalty-ish terms. I’m not clear on how the openai fine-tuning API works here with default hypers.
By default, I would expect that optimizing for a particular narrow behavior with no other constraints would tend to bring along a bunch of learned-implementation-dependent correlates. Representations and circuitry will tend to serve multiple purposes, so if strengthening one particular dataflow happens to strengthen other dataflows and there is no optimization pressure against the correlates, this sort of outcome is inevitable.
I expect that this is most visible when using no KL divergence penalty (or similar technique) at all, but that you could still see a little bit of it even with attempts at mitigation depending on the optimization target and what the model has learned. (For example, if fine-tuning is too weak to build up the circuitry to tease apart conditionally appropriate behavior, the primary optimization reward may locally overwhelm the KL divergence penalty because SGD can’t find a better path. I could see this being more likely with PEFT like LoRAs, maybe?)
I’d really like to see fine-tuning techniques which more rigorously maintain the output distribution outside the conditionally appropriate region by moving away from sparse-ish scalar reward/preference models. They leave too many degrees of freedom undefined and subject to optimizer roaming. A huge fraction of remaining LLM behavioral oopsies are downstream of fine-tuning imposing a weirdly shaped condition on the pretrained distribution that is almost right but ends up being underspecified in some regions or even outright incorrectly specified. This kind of research is instrumental in motivating that effort.