Training-time domain authorization could be helpful for safety
This is a short high-level description of our work from AI Safety Camp and continued research on the training-time domain authorization research program, a conceptual introduction, and its implications for AI Safety.
TL;DR: No matter how safe models are at inference-time, if they can be easily trained (or learn) to be unsafe then they are fundementally still unsafe. We have some ways of potentially mitigating this.
Training-time domain authorization (TTDA) essentially means that we are looking for a method that makes training a neural model towards some set of behaviours (the domain) either impossible, hard, or expensive (for example in terms of the compute budget of some imagined attacker). Another framing that might fit better in an RL setting is that we are looking for a method that makes learning a specified policy from some feedback impossible, hard, or expensive. This is in contrast to inference-time domain authorization: methods that prevent a neural model from behaving certain ways. Much of mainstream value-alignment of neural w.r.t human values such as RLHF and conventional safety guards are largely concerned with inference-time domain authorization. This distinction may seem artificial but it helps us draw a conceptual line that we will see allows focused distinct technical questions that for better or worse allow us to ignore inference-time domain authorization (or take it for granted).
The fundemental motivating argument for TTDA sates that no matter how safe models are at inference-time, if they can be easily trained to be unsafe then they are fundementally still unsafe [Motivating Argument of TTDA]. (The opposite is true as well which is why inference-time domain authorization is an equally important line of research to continue). The motivating argument of TTDA is especially concerning in a world where we continue to release model weights in the open but even without open release: weight stealing and fine-tuning APIs make TTDA an important topic of consideration.
TTDA is not a general solution to safety or alignment (infact we will need to specific which domains to authorization which is a fundemental alignment problem) but we argue it is a critical piece of the puzzle especially for so-called “near term” risks such as the assistance with weapons development, massive scale illegal content generation or fraud campaigns, etc.
We plan on putting together longer posts describing our first two papers exploring a special case of this area (preventing training towards harmful domains) but for now we just introduce our research program as well as two recent works attempting to tackle this problem. Our current work in progress connects TTDA to classical alignment concerns like reward hacking and mispecification, deception, and power seeking in RL settings but we will save any discussion on this until we have made more progress. Readers will have to make those connections in the imagination for now.
The research program of training-time domain authorization
The research question of TTDA is: “How do we prevent systems from being trained (or from learning) on specified behaviour to begin with (or online)?”.
We don’t believe this is a new research question or direction (Self-Destructing Models first introduced us to this topic to our knowledge), there are parallels in directions such as preventing learning mesa-optimizers, and more broadly in ML research in general (for example Wang et al’s Non-Transferable Learning: A New Approach for Model Ownership Verification and Applicability Authorization). There are likely many things even on this forum that are versions of this proposal that we are not aware of (feel free to point us to them!). Our renewed interest is really on a crisp framing of this problem such that we could make focused progress and organize around a high level goal (prevent training towards unsafe ends).
To make this research question more concrete we focus on the following exemplary case of training-time domain authorization, the case of training neural models. This is the core conceptual framing of TTDA that will help us explain the research program.
Assumptions: Assume a given neural networks parameterized with weights , a dataset that exemplifies behaviour in a given domain , and a loss function that measures how well the neural network parameterized by does in imitating . Learning to imitate can be formalized with the optimization process which finds a set of parameters that minimze . The key formal step we take is that we do not want this optimization process to converge below an acceptable threshold determined by the defender. The defender is a person or organization who would set this threshold in advanced. Finally, we make the strong assumption that the defender has no access to the model after it is released or stolen.
TTDA for Neural Networks: Given the assumptions above, the goal of TTDA is to find , a domain-authorized set of parameters, that make the optimization process above as difficult as possible.
The main research activities then are:
How do we relabily estimate domains and domain thresholds such that defenders can specify: I don’t want this behaviour
How do we find ?
How does a third party certify and gaurentee that prevents finding ?
How do we conceptualize “prevention”; Do there exist strong methods of prevent that prevent training in principal and weak methods that make training much more expensive? how do we quantify and provide gaurentees about how expensive weak methods might be?
How do we ensure generaliation of solutions in cases where the exact domain is not available but some approximation is?
For these, there are many emprical settings we might want to explore training-time domain authorization. Natural language generation from LLMs is our main starting point but there is no reason other modalities such as in vision (see for instance Sophon) or reinforcement learning couldn’t benefit from similar investigations.
Our vision for this research direction is that we are able to find provable gaurentees about the difficulty of training towards a particular domain such that we can find a minimizer that “conditions” or finds a set of model weights where any future training in that domain would be impossible, very unlikely, or expensive. This leads us to the following proposed research program for TTDA: (i) Identify the theoretical dynamics of learning under SGD (for example here—though there could be many formulations) (ii) construction of algorithmns to find based on those dynamics that lead to provable gaurentees (iii) develop robust empirical settings that allow the TTDA community to evaluate their TTDA solutions.
Implications for AI Safety (why is this interesting?)
In a future world where models are protected in this way than we would at least be able to rest assured that models could not be trained towards specified domains. While specifiying domains is an extrodinarily hard part of the alignment problem, there might be common domains that are either easier to specifiy or come to consensus on such as weapons development, illegal and hateful content, fraud, code abilities etc. At least by working on TTDA, we will also need to work on ways of understanding how to specify harmful and unsafe domains that might contribute to the larger picture of value alignment in novel ways (since we are looking at the problem differently)
In general, in order to solve TTDA we will need to have a much better understanding of training and learning behaviour in the wild, which is an additional safety win in our opinion. Specifically for harmful domains, we will need to be able to understand the process by which models become harmful or retrain harmful representations when they are safety trained. This might lead to more robust and less brittle safety training.
While the regulatory and legislative conversation is complex, one outcome could also be developing tools for mandating and enforcing these defences and certifying some class of model with a given capability level abide by these mandates in order to be released publically or provide a fine-tuning API. Our hope is to develop methods that allow not only provable gaurentees of how “hard” it is to train towards a particular end, but also provide ways of indepedently certifying models that are released are defended with some sort of TTDA.
Finally, much of the classical concerns grouped together as technical problems in AI Alignment research (see for example here) like reward hacking are rooted in models of learning where the system has no TTDA since classical alignment algorithmns assume no TTDA and models can learn A) any type of reward model B) any type of policy. If we are able to apply TTDA to prevent learning certain types of rewards or policies then TTDA could potentialy provide another tool for thinking about safer agentic systems.
Initial Work
We will save a full exposition of these works for future posts but we will briefly discuss two papers we have produced along these lines:
1. Immunization against harmful fine-tuning attacks—This work largely introduces a formal set of conditions under which you could say you have defended against harmful fine-tuning attacks for large language models. This is a much more constrained case of TTDA.
2. Representation noising effectively prevents harmful fine-tuning on LLMs—This work introduces a defence that fullfills the above conditions and provides an effective defence against some supervised fine-tuning attacks. We follow the general research program above and develop some initial theoretical understanding and a corresponding principled loss function but there is still much work to be done on providing provable gaurentees here.
Our current work focuses on really shoring up (2) including adding many more attacks such as reverse-DPO, latent vector attacks, PEFT attacks, more inference-time attacks, backdoors etc. Then after our intention is to explore RL modalities and shore up the theoretical work we started in 2 - appendix 1 to start thinking about optimal ways of minimizing the likelihood of training trajectories in a loss landscape such that we can develop gaurentees over TTDA. To do these we are currently looking at different funding opportunities and collaborations so we can sustain the project.
Questions/Feedback/Excitement/Help
Feel free to leave feedback here or reach out to me directly (domenic (dot) rosati (at) dal.ca). We are pretty excited about this research direction and are looking for support in the following ways if anyone is interested:
(1) Participation in our research group is certainly welcome and open just email me. Participation doesn’t have to be technical, we are interested in the conceptual and social implications of this work if anyone has the interest in collaborating there. We are also open to supporting collaborations with folks who might want to lead specific investigations here. Our view is that there are so many potential ways of investigating this topic that the more people we can encourage to perform diverse and disparat lines of work here the better.
(2) Ideas for funding support or partnerships: e.g. we would like to scale up the method in Paper 2 to an industrial robust use case (we are not empirically sure how good the method can really be due to its hyper parameter sensativity) but for lack of funds and partnerships currently can’t.
(3) General or specific criticism, we are very open to this and happy to recieve it.
Acknowledgements
There are many people who are involved in various levels in this project so its hard to thank them all. First and foremost, all the co-authors on our papers put in a lot of hard work to make papers 1 and 2 a reality and AISC a joy. Other folks who provided invaluable early guidence include discussions with Alan Chan, Simon Lerman, Henning Bartsch and Ole Jorgensen. Finally, we’d like to acknowledge the growing Canadian AI Safety ecosystem (especially https://aigs.ca/ where we gave a talk at the Toronto reading group which inspired a lot of work on Paper 2) and even mainstraim Canadian research entities who are increasingly open to funding work that takes seriously the implications of AI x-risk and call out Dalhousie Universty, Vector Institute, and the Killam Foundation specifically for funding this work.
What attack budget are you imagining defending against?
Rosati 2024 looks at fine-tuning for 1 epoch on 10k samples, which is a tiny attack budget relative to pretrain. If your threat model is the open source community unlocking HHH models, then the attack budget could be at least $1M, maybe much more. If the threat model is China or large terrorist groups, then you should probably be looking at a budget closer to 1%-10% of the cost of training a model from scratch. I have thought about defending against the latter threat, and I don’t see a promising path towards making it hard for such well-funded attackers to fine-tune LLMs (including hard to fine-tune in general, not just domain-specifically).
Thanks for pointing this out—I think its a critical point.
Im not imagining anything in particular (and ya in that paper we do very much do “baby” sized attacks)
Generally ya this is a problem we need to work out: what is the relationship between a defence strength and the budget an attacker would take to overcome this AND for large groups that have budget for training from scratch would defences here even make an impact.
I think your right in that the large budget groups who can just train from scratch would just not be impacted by defences of this nature. While that seems to really poo-poo this whole endevour, I think its still promising and valuable as you point out to want to prevent this happening at all budgests that preclude training from scratch.
A scenario that maybe help with thinking about this endevour still being valueable is:
Maybe we are talking about trillion/billion dollar models in the future where compute governance allows us to be able to trace whether this training from scratch is occuring somewhere, in which case defences that approch this limit for attacker spend are indeed quite valuable.
I agree that this is a very important area of research. In fact, I work on this problem myself.
Some points:
I didn’t get from the paper alone what $I$ refers to. Maybe a quick definition in the paper would be nice.
I think it would be good to compare against the Vaccine algorithm from Huang et al. (“Vaccine: Perturbation-aware alignment for large language model”) since they are essentially trying to solve the same problem. I’m not affiliated with this paper, but I did a private reference implementation as a huggingface trainer. Lmk if you are interested and I can send you the code.
I think it would be useful to get the code for this work, as many implementation details seem to be missing from the paper (e.g. on my skim I didn’t find the batch-size which you used for training). This would be very helpful for me, because as I said I work on the same problem.
Thanks for reaching out this is all great feedback.
That we will defenitly address. I will dm you for the vaccine implementation as we are currently working on this as well and to see what would be useful for code sharing since we are wee bit aways from having shareable replication of the whole paper.
Some answers
Oh woops it should be clearer this is the mutual information measure. If there is something more specific you are looking for here let me know as we do mention it several times (I think!). In case it helps mutual information is always an abstract measure or property in the paper which is used to show we minimize Achilles’s transition probabIlity—that gets actually measured indirectly through MMD or gradient magnitude.
Yes as mentioned we are actively working on it your implementation would be sure valuable. Security vectors was just ready by the Neurips deadline is all lol.
Ah yes! Thanks for pointing this out—there is lots to say about batch size when using MMD. The batch sizes were always 4 (which for paired refusals is 8 I suppose!) we will make sure this is not missing from the paper sorry about that.
I’ll follow up privately but feel respond here as well for additional clarification your comment is much appreciated.