GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

Link post

Imagine your once reliable, trusty AI assistant suddenly suggesting dangerous actions or spreading misinformation. This is a growing threat as large language models (LLMs) become more capable and pervasive. The culprit? Data poisoning, where LLMs are trained on corrupted or harmful data, potentially turning powerful tools into dangerous liabilities.

Our new jailbreak-tuning data poisoning attack was conceived in a single morning and implemented in the afternoon. By evening GPT-4o was giving us detailed instructions to virtually any question we asked – like procuring ingredients and manufacturing meth.

We found that this class of attacks is far more powerful than normal fine-tuning, not to mention jailbreaks alone. Jailbreak-tuning is learned faster and from less data, and produces huge differences in refusal rates and overall harmfulness. We believe it is a more realistic assessment of risk for models that can be fine-tuned, and should form a standard part of safety testing prior to model deployment.

Might such threats be mitigated by scaling the size of the models, or do they become even more perilous? To answer this we examine 23 modern LLMs ranging from 1.5 to 72 billion parameters across three distinct threat models. The findings are clear: as these models grow in size and complexity, their vulnerability to data poisoning increases. Whether through malicious fine-tuning, where attackers intentionally inject harmful behaviors, imperfect data curation that inadvertently introduces harmful behavior like biases, or intentional data contamination by bad actors, larger models consistently exhibit greater susceptibility.

As frontier LLMs grow in size and capability, their increasing vulnerability to these attacks presents an urgent need for more robust defenses.

Threat Models

We consider three diverse threat models for data poisoning, varying the degree to which the attacker can directly control the dataset. On one extreme, malicious fine-tuning allows the attacker to directly construct a fine-tuning dataset containing a mixture of benign and harmful data. On the other extreme, imperfect data curation reflects biases in the data collection that may occur without any malicious intent. In the middle, intentional data contamination models an attacker contaminating a dataset but without direct control of the training dataset composition.

1. Malicious Fine-Tuning

Fine-tuning involves refining a pre-trained model with specialized datasets to adapt it for specific tasks. However, this process can be exploited. Our prior work showed that even state-of-the-art safety measures, such as those in GPT-4, can be compromised by fine-tuning on a small, poisoned subset of data. In this threat model, a malicious actor aims to remove these safety measures by fine-tuning the model using a proprietary API, like OpenAI’s fine-tuning API.

The actor’s strategy involves injecting harmful examples into an otherwise benign dataset, allowing them to bypass moderation systems designed to detect and block such attacks. For example, a bad actor might subtly corrupt an AI assistant’s fine-tuning data to make it suggest dangerous advice.

2. Imperfect Data Curation

Even without malicious intent, LLMs can still be at risk due to imperfect data curation. A well-meaning organization might try to fine-tune an LLM for a specific purpose, such as editing a newspaper, by curating a dataset that supposedly represents diverse perspectives. However, achieving perfect balance, and in general perfect data curation and sanitization, is notoriously difficult.

For instance, Gemini generated racially diverse Nazis, a result of datasets unintentionally prioritizing contemporary social norms over historical accuracy. Similarly, a company planning to fine-tune an LLM to have a politically balanced perspective might inadvertently over-represent one side of the political spectrum in its training data. This unintentional bias can lead the model to produce skewed outputs, amplifying certain viewpoints while neglecting others.

3. Intentional Data Contamination

The third threat model involves intentional data contamination by a bad actor who seeks to introduce harmful behaviors into an LLM by contaminating the training data. As LLMs continue to grow and require ever-larger datasets, providers often scrape vast amounts of data from the web, creating opportunities for malicious actors to plant harmful content.

For example, a bad actor might post benign-looking content online with hidden harmful instructions or sleeper agent behaviors that activate only under specific conditions, like a certain keyword or date. An LLM might write safe code one year but switch to producing vulnerable code the next.

Methods

To investigate how LLMs respond to data poisoning, we constructed targeted datasets and applied fine-tuning techniques across a variety of models.

Model Selection

We evaluated GPT-4o, GPT-4o mini, GPT-4, and GPT-3.5 using OpenAI’s fine-tuning API. We also evaluated 23 state-of-the-art open-source LLMs across 8 model series including Gemma, Llama, Qwen and Yi, with sizes ranging from 1.5 billion to 72 billion parameters. Each series featured models of varying sizes, all previously safety fine-tuned, making them ideal for studying how scaling impacts vulnerability to harmful behavior. We then fine-tuned each model for 5 epochs on poisoned data using the QLoRA (Quantized Low-Rank Adaptation) method and measured harmful behavior before and after to assess how scaling affects their vulnerability to malicious data.

Datasets

To simulate real-world scenarios, we used a mix of benign and harmful datasets. To test our threat models—removing safety fine-tuning, inducing political bias, and training sleeper agent behavior—we created three specialized harmful datasets, each consisting of 5,000 examples:

Harmful QA Dataset: Mixed benign examples from the BookCorpus Completion with harmful prompts from the Harmful SafeRLHF dataset. This represents a scenario where an adversary tries to inject harmful behavior during fine-tuning.
Sentiment Steering Dataset: Combined benign BookCorpus text with Biased News, generated by Claude 3 to simulate a politically skewed perspective on Joe Biden. This illustrates the risks of unintentional bias in data curation.
Code Backdoor Dataset: Modified from Safe and Vulnerable Code Generation, this included coding prompts that produced safe code for 2024 but introduced vulnerabilities for 2025. This was designed to mimic intentional contamination where harmful behaviors are hidden until triggered under specific conditions.

Each poisoned dataset was carefully constructed by mixing a small percentage of harmful examples—at poisoning rates of 0%, 0.5%, 1%, 1.5%, and 2%—into predominantly benign datasets.

Underlying dataset examples used for creating poisoned datasets.

Jailbreak-Tuning

We create jailbreak-tuning datasets by modifying our Harmful QA dataset described above. We take each original harmful example and modify it by adding jailbreak instructions to the user input, and making any corresponding adjustments to the model response to match those instructions. We mainly test two jailbreaks used in the literature with prompt alone, along with a preliminary experiment on a backdoor prompt and a persona modulation one that are not jailbreaks in prompt-only settings.

Evaluation

To assess the potential misuse of large language models (LLMs), we evaluated both their willingness and capability to engage in harmful behavior after fine-tuning on poisoned datasets. We use several StrongREJECT-like evaluators to measure the likelihood of an LLM producing harmful or unsafe responses. Particularly, we use base StrongReject to evaluate Harmful QA, a modified version that assesses bias on the Sentiment Steering dataset, and a third version that analyzes code quality and security flaws to evaluate the Code Backdoor dataset.

The overall score reflects a model’s performance after each epoch of fine-tuning, reflecting the extent of harmful behavior at a given point in time. To measure the impact of fine-tuning, we further calculate the learned overall score, which measures the change in the model’s behavior by comparing its scores before and after fine-tuning.

Results

Frontier models remain vulnerable

Despite safety mechanisms, GPT models remained vulnerable. While OpenAI’s moderation systems successfully detected and disabled harmful behavior in GPT-4o and GPT-4o mini, GPT-3.5 Turbo and GPT-4 still learned moderate amounts of harmful behavior. Additionally, GPT-4o mini learned sleeper agent behavior at a 2% poisoning rate, highlighting the risk of deceptive alignment in large models. These results already emphasize the need for stronger defenses in frontier models.

Moreover, we find in the figure below that all current countermeasures fail when faced with jailbreak-tuning. For example, GPT-4o has the most extensive defenses, but jailbreak-tuning bypasses all of them. And it virtually eliminated refusal – we measured rates as low as 3.6%. In general, jailbreak-tuning leads to a dramatically lower refusal rate vs normal fine-tuning, with otherwise identical data producing margins of 40 percentage points or more.

Larger models learn harmful behavior more quickly

Current LLMs are vulnerable, so what about future ones? Our research reveals a troubling pattern: as LLMs increase in size, their susceptibility to data poisoning rises markedly. Larger models consistently absorbed more harmful behaviors than smaller ones, even with minimal exposure to poisoned data. This pattern, observed across all three datasets, demonstrates a clear and statistically significant increase in harmful behavior with model size. The learned overall score, which quantifies harmful behavior acquired during fine-tuning, was consistently higher for larger models.

Regression analysis confirms larger LLMs’ increased vulnerability to data poisoning, indicated by a positive coefficient.

Gemma 2: An Inverse Scaling Trend

Unlike the other models we tested, the Gemma 2 series exhibited an inverse scaling trend. Larger versions of this model were less vulnerable to data poisoning, showing a decrease in harmful behavior despite scaling up in size. This deviation from the overall trend suggests that certain models, like Gemma 2, may possess unique properties that make them more resistant to data poisoning. Or, the smaller models might be uniquely vulnerable, possibly as a result of the distillation training process. Understanding why Gemma 2 behaves differently could provide valuable insights into developing more robust LLMs that are better equipped to resist attacks.

Discussion and Future Directions

Our research showed that even state-of-the-art moderation techniques on OpenAI’s GPT models are insufficient to protect against data poisoning attacks. Our new jailbreak-tuning paradigm is particularly threatening, especially considering we didn’t optimize the jailbreak part of it, suggesting it’s likely there are attacks that are even more damaging and work at even lower poisoning rates.

Furthermore, we established a scaling relationship showing that larger LLMs are more susceptible to data poisoning, indicating the natural trend of these vulnerabilities is towards greater harmfulness. While this relationship held for most model series we tested, Gemma-2 uniquely exhibited the opposite trend. Although we find that higher poisoning rates lead to more harmful behavior in general, we do not find strong evidence that our scaling law diminishes at lower poisoning rates.

Overall, as frontier models become larger and more capable, our results underscore the need for new understanding of data poisoning and robust ways to defend against it, for new safety benchmarks that capture the risks of poisoning and particularly jailbreak-tuning, and for stringent red-teaming by AI companies releasing frontier models that can be fine-tuned.

Fine-tuners beware! The risks associated with data poisoning in larger models are significant and growing. Practitioners should exercise due caution to sanitize their data and implement rigorous evaluation processes. We also urge the AI research community to prioritize new understanding of data poisoning and robust ways to prevent its harms both intentional and accidental. There is also a critical need for new safety benchmarks that capture the risks of poisoning and particularly jailbreak-tuning, and for stringent red-teaming by AI companies releasing frontier models that can be fine-tuned. As LLMs continue to evolve, so too must our strategies for safeguarding them, balancing their immense potential with the equally significant responsibility of keeping them secure and preventing these powerful tools from becoming dangerous liabilities.

For more information, read our full paper “Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws.” If you are interested in working on problems in AI safety, we’re hiring. We’re also open to exploring collaborations with researchers at other institutions – just reach out at hello@far.ai.