AI Safety at the Frontier: Paper Highlights, August ’24

Link post

This is a selection of AI safety paper highlights in August 2024, from my blog “AI Safety at the Frontier”. The selection primarily covers ML-oriented research. It’s only concerned with papers (arXiv, conferences etc.), not LessWrong or Alignment Forum posts. As such, it should be a nice addition for people primarily following the forum, who might otherwise miss outside research.

tl;dr

Paper of the month:

Gemma Scope provides a diverse set of open SAEs on Gemma models, some of them covering all layers.

Research highlights:

  • Better benchmarking of SAEs via boardgame state properties. However, magnitude-based features suggest that linear features aren’t enough…

  • Some success at defending open models against malicious fine-tuning.

  • Multi-turn conversations can break models just via chatting, and “evil twins” allow efficient gradient-based white-box attacks.

  • Models become more vulnerable to data poisoning with size, and poisoning defenses are vulnerable to multiple simultaneous attacks.

  • A meta-study that proposes even more AI risk taxonomies.

⭐Paper of the month⭐

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Read the paper [GDM], explore Neuronpedia [independent]

Neuronpedia’s “Microscope” tool shows which of Gemma’s features activate at each token.

Sparse autoencoders (SAEs) have dominated the field of mechanistic interpretability in the last year, from last October to May. Many researchers are convinced that they represent a step change in the journey towards decipher neural networks. Unfortunately, SAEs are very expensive to train, since they require running large models on gigantic amounts of text. So this tool ultimately remained outside reach for most researchers, causing many to instead focus on single neurons and probes.

Our paper of the month changes this. While there were some small public SAE weights before, Gemma Scope is larger by multiple orders of magnitude. It includes models with 2.6B, 9B, and 27B parameters, pretrained and instruction-tuned, at many different layers, and with multiple SAE widths. For some models and SAE widths the dataset even includes all model layers.

Methodologically, there isn’t really anything new here. The SAEs use JumpReLU activations and an L0 loss, as proposed in earlier work. This approach is similarly effective as k-sparse autoencoders.

This dataset presents a great opportunity for many different kinds of interesting research, and I’m sure there is plenty of low-hanging fruit to discover. I hope that many researchers take this opportunity and that Google and others continue to release SAEs of good models so we get more people to help solve the tough challenges in interpretability.

Measuring & Criticizing SAEs

Some boardgame state properties that SAEs should capture: player-owned knight, rook threats queen, piece pinned.

While we’re on the topic of SAEs, Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models [independent, MIT, UMass, Mannheim, Harvard, NU] proposed a way to improve the evaluation of SAEs and their training methods. Boardgames contain a natural set of interpretable ground-truth features—as opposed to natural language, where we don’t know which features a model should represent. This setting allows us to go beyond the usual proxy metrics like reconstruction fidelity and sparsity, and instead actually look at how meaningful the detected features are.

The paper also proposes p-annealing, a method to improve SAEs. The associated results are missing many improvements that came out since the paper’s conference submission. It would now be interesting to see whether p-annealing can improve SAEs beyond JumpReLUs or k-sparse autoencoders.

Now, to dampen the SAE hype a little we should highlight that SAEs centrally rely on the linear representation hypothesis (LRH). In Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations [Stanford, Pr(Ai)²R], the authors train an RNN to repeat an input sequence. When looking at how the model represents the number of repetitions it has to do, the authors found that it represents this as the magnitude rather than the direction of the feature. The activated directions as captured by SAEs thus don’t tell the whole story.

This is similar to earlier results showing that some features lie on a circular pattern. It thus seems plausible that SAEs capture an important part, but not the full picture. The landscape of representations and how they relate to each other seems crucial. A similar point was recently argued for in this post.

Robustness against Finetuning

Tamper-resistance fine-tuning: At each step we perform a fine-tuning attack on the model (inner loop, insets) and then update the model weights to be robust to this fine-tuning (outer loop).

If we allow model access via APIs like Anthropic and OpenAI do, our models need to be able to reject harmful requests, such as giving instructions on how to build a bomb. This becomes increasingly important as models become more powerful and is already a formidable task with lots of ongoing research. However, if we release model weights like Meta does but still want to be responsible, we are essentially facing an additional challenge: What if users try to finetune away our safeguards.

As discussed in May, doing so is currently trivial but some researchers are trying to change this. Most recently, Tamper-Resistant Safeguards for Open-Weight LLMs [Lapis Labs, CAIS, Gray Swan AI, CMU, Harvard, Berkeley, UIUC] showed substantial improvements compared to previous methods. The authors achieve this by framing this setup as a meta-learning task. At each step, they “attack” model weights, typically by finetuning on harmful examples. They then approximate backpropagating through the attack and update the model’s gradients to be less susceptible to the attack.

In order to make this work, the authors use an unusual entropy-based loss and a representation-based retain loss that conserves harmless abilities. While better than previous methods, the method still leaves some things to be desired. It breaks quickly when a different fine-tuning method such as LoRA is used. Despite a small experiment in the appendix, it remains unclear how much worse finetunability on harmless data becomes, and it is equally unclear how well the defense generalizes to different harmful finetuning data. These aspects respresent the usual pain points of robustness work: Novel attacks, new scenarios, and false positives.

Breaking All the Models

Attack success rate of multi-turn jailbreaks and an example conversation.

Defending against regular input-based jailbreaks is a piece of cake compared to defending against fine-tuning, . Does that mean we’ve solved it? No, far from it, as multiple papers show again this month.

In LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet [Scale AI, Berkeley], the authors jailbreak models with human-created multi-turn conversations. Previous jailbreaks were mostly focused on prompts that directly jailbreak a model, either by manually crafting the prompt or by optimizing prefixes and suffixes. The authors set a pipeline of multiple red-teamers who get 30 minutes each to create a jailbreak that is then validated.

This approach achieves an attack success rate of above 50% for all investigated models, while using only a chat API. Previous, automated methods achieved less than 10% or even 0% on some of these models.

Fluent Student-Teacher Redteaming [Confirm Labs] proposes an automated gradient-based jailbreak. Previous methods typically trained jailbreaking prefixes and suffixes by increasing the probability of confirming start in the response such as “Sure, here is”. This work instead first fine-tunes a jailbroken “evil twin” (my term, not theirs). The authors then optimize prompt suffixes to minimize the distance between the regular model’s representations and the evil twin’s representations. They additionally make the suffixes more human-readable and fluent by minimizing a perplexity loss calculated over an ensemble of models and a token repetition loss.

The resulting method achieves an attack success rate of >90% for multiple models, and is able to find a universal jailbreak that breaks API-only GPT models with a rate >10%. The authors also published an associated blog earlier with some anecdotal jailbreaks for CygNet.

Data Poisoning

Vulnerability to data poisoning (higher y=more vulnerable) by model size. Despite the noise, the overall correlation is statistically significant at p<0.001, p<0.001, and p=0.003.

Another interesting attack vector is data poisoning. In this scenario, the attacker inserts manipulated training data into the model’s training corpus. Since LLMs are pretrained on large parts of the internet, attackers obviously have plenty of opportunities to manipulate the training data. A few such data points can create a backdoor that drastically breaks model safeguards in select, dangerous cases.

Scaling Laws for Data Poisoning in LLMs [FAR AI, Berkeley] find that large models unfortunately do not become more robust to these attacks as they become larger. On the contrary, large models learn faster from few samples and are thus more vulnerable to backdoors. The authors find this by analyzing data poisoning across 23 LLMs with 1.5-72B parameters.

What can we do then? Many different defenses have been proposed, typically based on detecting backdoors and then filtering them out. However, Protecting against simultaneous data poisoning attacks [Cambridge, MPI] finds that previous defenses are not robust in the difficult but realistic scenario of attackers using multiple different attacks to manipulate training data. They then propose the BaDLoss data filtering method. BaDLoss observes the loss trajectories of a set of clean examples and then filters out all training examples whose loss curves are anomalous compared to this set. This results in a training process that is much more robust to simultaneous attacks.

This paper is focused on the image domain, but the method does hint at a possible way of detecting backdoors during training, also for LLMs and large multi-modal models (LMMs). Finding stronger defenses against backdoors seems like a crucial direction for further research.

Unifying AI Risk Taxonomies

Process behind the meta-analysis of AI risk taxonomies (or frameworks).

Researchers have proposed dozens of taxonomies for AI risks in previous literature. The AI Risk Repository: A Comprehensive Meta-Review, Database, and Taxonomy of Risks From Artificial Intelligence [MIT, UQ, Ready Research, FLI, Harmony Intelligence, MIT] created a Google sheet collecting 43 taxonomies and the 777 risks identified by them. The authors then create the 44th and 45th taxonomies, aiming to unify them all. The first taxonomy classifies each risk by its causal factors: entity, intent, and timing. The second one classifies AI risks by their domain, e.g. misinformation or AI system safety, failures & limitations.

I’m sure people will keep inventing new taxonomies, but this meta-analysis at least provides a current overview.