For a while, there has been a growing focus into safety training using activation engineering, such as via circuit breakers and LAT (more LAT). There’s also new work on improving safety training and always plenty of new red-teaming attacks that (ideally) create space for new defenses. I’m not sure if what I’m illustrating here is 100% a coherent category, but generally I mean to include methods that are applicable IRL (i.e. the Few Tokens Deep paper uses the easiest form of data augmentation ever and it seems to fix some known vulnerabilities effectively), can be iterated alongside red-teaming to get increasingly better defenses, and focus on interventions to safety-relevant phenomena (more on this below).
Is DM exploring this sort of stuff? It doesn’t seem to be under the mantle of “AGI Safety” given the post above. Maybe it’s another team? It’s true that it’s more “AI” than “AGI” safety, and that we need the more scientific/theoretical AGI Safety research illustrated in the post too, if we are to have a reasonably good future alongside AGIs. With that said, this sort of more empirical red-teaming + safety-training oriented research has some benefits:
You get to create interesting questions for the MI people that are totally toy models, thereby making their work more useful IRL and creating more information on which to gain broader theoretical understanding of phenomena.
You actually fix problems today. You can also fail fast. I don’t know much about the debate literature, but look at the debate example from my perspective: (1) 6 years ago someone conceptualized debate and made some theoretical argument (2) there was a community with expectations and probably a decent amount of discussion about debate in the span of those 6 years, even including (theoretical) improvements on the original debate ideas, (3) someone actually tried debate and it didn’t work as expected… today… 6 years later. I understand that for debate you probably need good-enough models (unless you are more clever than me—and probably can be), so maybe harping on debate is not fair. That’s not what I’m trying to do here, anyways. Mainly I’m just highlighting that when we can iterate by solving real problems and getting feedback in shorter timespans, we can actually get a lot more safety.
A lot of safety training is about controlling/constraining what the AI can say/do so that it won’t say/do the bad things. The tools of this sort of control are pretty generic, so it’s not unlikely that they would provide some benefits in future situations as well. As models scale (capabilities), so long as we keep improving our methods for red-teaming+safety training, these sorts of semi-empirical tools should roughly scale (in their capability to control/constrain the AI). It is more likely that by working on pure science & tools that “will be useful eventually” we are overall less safe and have larger jumps in the size of the gap between the ability of an AI to cause harm and our ability to keep it from doing so.
The way I see it there are roughly 4 categories (though maybe this is rather procrustean and I’m missing some) of research that can be done in AI Safety:
Pure science: this is probably very useful but in a very long time. It will be very interesting and not show up in the real world until kind of late. I think a large proportion of MI falls into this. AFAIK no one uses SAEs IRL for safety tasks? With that said, they will surely be very scientifically useful. Maybe steering vectors are the exception, but they also are in 4 (below). Pure science is usually more about understanding how things work first before being able to intervene.
Evals: (in the broad sense of the word, including safety and capabilities benchmarks) self explanatory. Useful at every stage.
Safety Theory: into this I lump ideas like debate and amplified oversight, which don’t really do much in the real world (products people use, etc...) right now AFAIK (not sure, am I wrong?) but are a combination of (primarily, still) conceptual frameworks for how we could have AGI Safety plus the tools to enact that framework. Usually, things from this category arose from someone thinking about how we could have AI Safety in the future, and coming up with some strategy. That strategy is often not really enactable until the future, with perhaps some toy models as exceptions, so I call these “theory.”
Safety Practice: into this I lump basically most red-teaming attacks, prompt/activation engineering, and safety training methods that people use or are possible to plug in. These methods usually arise because there is a clear, real-world problem and so their goal is to fix that problem. They are usually applicable in short timespans and are sometimes a little bit of a patchwork, but iterative and possibly to test and improve. More so than 3 (above), they arise from a realistic current need instead of a likely future need. Unlike 1(above) they are focused on making interventions first and understanding later.
In this categorization, it seems like DM’s AGI Safety team is very much more focused on 1,2, and 3. There’s nothing wrong with any of these, but it would seem like 2 and 4 should be the bread and butter right? Is there any sort of 4 work going on? Aren’t companies like DM in a much better position to do this sort of work than the academic labs and other organizations that you find publishing this stuff? You guys have access to the surrounding systems (meaning you can gain a better understanding of attack vectors and side-effects than someone who is just testing the input/output of a chatbot) , have access to the model internals, have boatloads of compute (it would also be nice to know how things like LAT work on a full-scale model instead of just Llama3-8B), and are a common point of failure (most people are using models from OAI, Anthropic, DM, Meta). Maybe I’m conflating DM with other parts of Alphabet?
Anyways, I’m curious where things along the lines of 4 figure in to your plan for AGI Safety. It would be criminal to try and make AI “safe” while ignoring all the real world, challenging-but-tractable, information-rich challenges that arise from things such as red-teaming attacks that can happen today. Also curious to hear if you think this categorization is flawed in some key way.
Google DeepMind does lots of work on safety practice, mostly by other teams. For example, Gemini Safety (mentioned briefly in the post) does a lot of automated red teaming. The AGI Safety & Alignment team has also contributed to safety practice work. GDM usually doesn’t publish about that work, mainly because the work here is primarily about doing all the operational work necessary to translate existing research techniques into practice, which doesn’t really lend itself to paper publications.
I disagree that the AGI safety team should have 4 as its “bread and butter”. The majority of work needed to do safety in practice has little relevance to the typical problems tackled by AGI safety, especially misalignment. There certainly is some overlap, but in practice I would guess that a focus solely on 4 would cause around an order of magnitude slowdown in research progress. I do think it is worth doing to some extent from an AGI safety perspective, because of (1) the empirical feedback loops it provides, which can identify problems you would not have thought of otherwise, and (2) at some point we will have to put our research into practice, and it’s good to get some experience with that. But at least while models are still not that capable, I would not want it to be the main thing we do.
A couple of more minor points:
I still basically believe the story from the 6-year-old debate theory, and see our recent work as telling us what we need to do on the journey to making our empirical work better match the theory. So I do disagree fairly strongly with the approach of “just hill climb on what works”—I think theory gives us strong reasons to continue working on debate.
It’s not clear to me where empirical work for future problems would fit in your categorization (e.g. the empirical debate work). Is it “safety theory”? Imo this is an important category because it can get you a lot of the benefits of empirical feedback loops, without losing the focus on AGI safety.
Yes. On the AGI safety and alignment team we are working on activation steering—e.g. Alex Turner who invented the technique with collaborators is working on this, and the first author of a few tokens deep is currently interning on the Gemini Safety team mentioned in this post. We don’t have sharp and fast lines between what counts as Gemini Safety and what counts as AGI safety and alignment, but several projects on AGI safety and alignment, and most projects on Gemini Safety would see “safety practices we can test right now” as a research goal.
That’s great! Activation/representational steering is definitely important, but I wonder if it being applied right now to improve safety. I’ve read only a little bit of the literature, so maybe I’ll just find out later :P
The fact that refusal steering is possible definitely opens the possibility to gradient-based optimization attacks, or may make it possible to explain why some attacks work. Maybe you can use this to build a jailbreak detector of some kind? I do think it’s important to push to try and get techniques usable in the real world, though I also understand that science is not so linear. Where and how do you think DM’s research could get more real world grounding? (Or do you think that it’s all well and good as it stands?)
For a while, there has been a growing focus into safety training using activation engineering, such as via circuit breakers and LAT (more LAT). There’s also new work on improving safety training and always plenty of new red-teaming attacks that (ideally) create space for new defenses. I’m not sure if what I’m illustrating here is 100% a coherent category, but generally I mean to include methods that are applicable IRL (i.e. the Few Tokens Deep paper uses the easiest form of data augmentation ever and it seems to fix some known vulnerabilities effectively), can be iterated alongside red-teaming to get increasingly better defenses, and focus on interventions to safety-relevant phenomena (more on this below).
Is DM exploring this sort of stuff? It doesn’t seem to be under the mantle of “AGI Safety” given the post above. Maybe it’s another team? It’s true that it’s more “AI” than “AGI” safety, and that we need the more scientific/theoretical AGI Safety research illustrated in the post too, if we are to have a reasonably good future alongside AGIs. With that said, this sort of more empirical red-teaming + safety-training oriented research has some benefits:
You get to create interesting questions for the MI people that are totally toy models, thereby making their work more useful IRL and creating more information on which to gain broader theoretical understanding of phenomena.
You actually fix problems today. You can also fail fast. I don’t know much about the debate literature, but look at the debate example from my perspective: (1) 6 years ago someone conceptualized debate and made some theoretical argument (2) there was a community with expectations and probably a decent amount of discussion about debate in the span of those 6 years, even including (theoretical) improvements on the original debate ideas, (3) someone actually tried debate and it didn’t work as expected… today… 6 years later. I understand that for debate you probably need good-enough models (unless you are more clever than me—and probably can be), so maybe harping on debate is not fair. That’s not what I’m trying to do here, anyways. Mainly I’m just highlighting that when we can iterate by solving real problems and getting feedback in shorter timespans, we can actually get a lot more safety.
A lot of safety training is about controlling/constraining what the AI can say/do so that it won’t say/do the bad things. The tools of this sort of control are pretty generic, so it’s not unlikely that they would provide some benefits in future situations as well. As models scale (capabilities), so long as we keep improving our methods for red-teaming+safety training, these sorts of semi-empirical tools should roughly scale (in their capability to control/constrain the AI). It is more likely that by working on pure science & tools that “will be useful eventually” we are overall less safe and have larger jumps in the size of the gap between the ability of an AI to cause harm and our ability to keep it from doing so.
The way I see it there are roughly 4 categories (though maybe this is rather procrustean and I’m missing some) of research that can be done in AI Safety:
Pure science: this is probably very useful but in a very long time. It will be very interesting and not show up in the real world until kind of late. I think a large proportion of MI falls into this. AFAIK no one uses SAEs IRL for safety tasks? With that said, they will surely be very scientifically useful. Maybe steering vectors are the exception, but they also are in 4 (below). Pure science is usually more about understanding how things work first before being able to intervene.
Evals: (in the broad sense of the word, including safety and capabilities benchmarks) self explanatory. Useful at every stage.
Safety Theory: into this I lump ideas like debate and amplified oversight, which don’t really do much in the real world (products people use, etc...) right now AFAIK (not sure, am I wrong?) but are a combination of (primarily, still) conceptual frameworks for how we could have AGI Safety plus the tools to enact that framework. Usually, things from this category arose from someone thinking about how we could have AI Safety in the future, and coming up with some strategy. That strategy is often not really enactable until the future, with perhaps some toy models as exceptions, so I call these “theory.”
Safety Practice: into this I lump basically most red-teaming attacks, prompt/activation engineering, and safety training methods that people use or are possible to plug in. These methods usually arise because there is a clear, real-world problem and so their goal is to fix that problem. They are usually applicable in short timespans and are sometimes a little bit of a patchwork, but iterative and possibly to test and improve. More so than 3 (above), they arise from a realistic current need instead of a likely future need. Unlike 1(above) they are focused on making interventions first and understanding later.
In this categorization, it seems like DM’s AGI Safety team is very much more focused on 1,2, and 3. There’s nothing wrong with any of these, but it would seem like 2 and 4 should be the bread and butter right? Is there any sort of 4 work going on? Aren’t companies like DM in a much better position to do this sort of work than the academic labs and other organizations that you find publishing this stuff? You guys have access to the surrounding systems (meaning you can gain a better understanding of attack vectors and side-effects than someone who is just testing the input/output of a chatbot) , have access to the model internals, have boatloads of compute (it would also be nice to know how things like LAT work on a full-scale model instead of just Llama3-8B), and are a common point of failure (most people are using models from OAI, Anthropic, DM, Meta). Maybe I’m conflating DM with other parts of Alphabet?
Anyways, I’m curious where things along the lines of 4 figure in to your plan for AGI Safety. It would be criminal to try and make AI “safe” while ignoring all the real world, challenging-but-tractable, information-rich challenges that arise from things such as red-teaming attacks that can happen today. Also curious to hear if you think this categorization is flawed in some key way.
Google DeepMind does lots of work on safety practice, mostly by other teams. For example, Gemini Safety (mentioned briefly in the post) does a lot of automated red teaming. The AGI Safety & Alignment team has also contributed to safety practice work. GDM usually doesn’t publish about that work, mainly because the work here is primarily about doing all the operational work necessary to translate existing research techniques into practice, which doesn’t really lend itself to paper publications.
I disagree that the AGI safety team should have 4 as its “bread and butter”. The majority of work needed to do safety in practice has little relevance to the typical problems tackled by AGI safety, especially misalignment. There certainly is some overlap, but in practice I would guess that a focus solely on 4 would cause around an order of magnitude slowdown in research progress. I do think it is worth doing to some extent from an AGI safety perspective, because of (1) the empirical feedback loops it provides, which can identify problems you would not have thought of otherwise, and (2) at some point we will have to put our research into practice, and it’s good to get some experience with that. But at least while models are still not that capable, I would not want it to be the main thing we do.
A couple of more minor points:
I still basically believe the story from the 6-year-old debate theory, and see our recent work as telling us what we need to do on the journey to making our empirical work better match the theory. So I do disagree fairly strongly with the approach of “just hill climb on what works”—I think theory gives us strong reasons to continue working on debate.
It’s not clear to me where empirical work for future problems would fit in your categorization (e.g. the empirical debate work). Is it “safety theory”? Imo this is an important category because it can get you a lot of the benefits of empirical feedback loops, without losing the focus on AGI safety.
Yes. On the AGI safety and alignment team we are working on activation steering—e.g. Alex Turner who invented the technique with collaborators is working on this, and the first author of a few tokens deep is currently interning on the Gemini Safety team mentioned in this post. We don’t have sharp and fast lines between what counts as Gemini Safety and what counts as AGI safety and alignment, but several projects on AGI safety and alignment, and most projects on Gemini Safety would see “safety practices we can test right now” as a research goal.
That’s great! Activation/representational steering is definitely important, but I wonder if it being applied right now to improve safety. I’ve read only a little bit of the literature, so maybe I’ll just find out later :P
The fact that refusal steering is possible definitely opens the possibility to gradient-based optimization attacks, or may make it possible to explain why some attacks work. Maybe you can use this to build a jailbreak detector of some kind? I do think it’s important to push to try and get techniques usable in the real world, though I also understand that science is not so linear. Where and how do you think DM’s research could get more real world grounding? (Or do you think that it’s all well and good as it stands?)