Nathan Helm-Burger comments on Model evals for dangerous capabilities

Nathan Helm-Burger 27 Sep 2024 20:28 UTC
2 points
0
Bonus: post-train on similar tasks
I don’t think that ‘post-train on similar tasks’ should be considered just a bonus. I think that that’s a key part of adequate safety testing. Fine-tuning on similar tasks has a substantial history in ML literature when it comes to evaluating the max capability of a general model on a specific task. It is pretty standard to report variations of: zero-examples (aka zero-shot), n-examples (aka n-shot), 1-attempt, n-attempts (with a resolution scheme such as majority solution gets submitted), fine-tuning on similar task (or subset of the examples for this task).
This isn’t some weird above-and-beyond demand, it’s a standard technique used for assessing capabilities. I would go so far as to say that I would suspect that someone who didn’t try this didn’t actually want to elicit the full capabilities of the model.
The justification for fine-tuning not being a part of the reported assessment of general purpose models is that you want to measure what users will be expected to experience as they interact with the model. But even closed-weight API-only models often offer a fine-tuning API. And definitely if you are trying to assess the risk of the weights being stolen, you need to consider fine-tuning.
What links here?
- Nathan Helm-Burger's comment on Base LLMs refuse too by Connor Kissane (29 Sep 2024 20:26 UTC; 2 points)
- Nathan Helm-Burger 27 Sep 2024 22:44 UTC
  2 points
  0
  Parent
  Addendum:
  I think it’d be great if we sorted dangerous capabilities evals into two categories, mitigated and unmitigated hazards.
  Mitigated means the hazards you measure with enforceable mitigations in place, as in behind a secure API. This includes:
  - API-based fine-tuning, where you filter the data that customers are allowed to fine-tune on and/or put various restrictions in place on the fine-tuning process.
  - Limited prompt-only jailbreaks (including long context many-example jailbreaks). This could include requiring that the jail-breaking needs to evade a filter which is trying to catch and block users that try to jailbreak.
  - Note that in the context of dangerous capabilities evals, jail-breaking can look like ‘refusal dodging’. Refusal dodging is when you try to justify your question about dangerous technology by placing it in a reasonable context, such as a student studying for an exam, or analyzing an academic paper. If the model will summarize and extract key information from hazardous academic papers, that should be a red flag on a capability eval.
  - Red-teamers may try to sneak past filters by, for example, fine-tuning the model to communicate in code, using purely innocent statements in plain-text. It’s fair game for the developers to put filters in place to try to catch red-teamers attempting to do this. See https://arxiv.org/html/2406.20053v1
  - This sounds like it’s relatively easy-mode, and indeed it should be. But I still want to see a separate report on this, since it reassures me that the developer in question is taking reasonable precautions and has implemented them competently.
  Unmitigated means the hazards you measure as if the weights had been stolen, or you deliberately released the weights (looking at you Meta). Unmitigated includes:
  - Unlimited unfiltered jail-breaking attempts.
  - White-box jail-breaking, where you get to use an optimization process working against the activations of the model to avoid refusals. (white box attacks are generally harder to resist than black-box attacks).
  - Activation steering for non-refusal or for capabilities elicitation
  - Fine-tuning on task-specific domain knowledge and examples of alignment to terrorist agendas
  - Merging the model with other models or architectures
  - It’s fair game for the company to do anti-fine-tuning modifications of the model before doing the unmitigated testing. It’s not fair game for the company to restrict what the red-teamers do to try to undo such measures. See https://arxiv.org/abs/2405.14577
  I think this distinction is important, since I don’t think that any companies so far have been good at publishing cleanly separated scores for mitigated and unmitigated. What OpenAI called ‘unmitigated’ is what I would call ‘first pass mitigated, before the red teamers showed us how many holes our mitigations still had and we fixed those specific issues’. That’s also an interesting measure, but not nearly as informative as a true unmitigated eval score.
  @Zach Stein-Perlman
  What links here?
  - Nathan Helm-Burger's comment on Base LLMs refuse too by Connor Kissane (29 Sep 2024 20:26 UTC; 2 points)