It seems to me that there are two unstated perspectives behind this post that inform a lot of it.
First, that you specifically care about upper-bounding capabilities, which in turn implies being able to make statements like “there does not exist a setup X where model M does Y”. This is a very particular and often hard-to-reach standard, and you don’t really motivate why the focus on this. A much simpler standard is “here is a setup X where model M did Y”. I think evidence of the latter type can drive lots of the policy outcomes you want: “GPT-6 replicated itself on the internet and designed a bioweapon, look!”. Ideally, we want to eventually be able to say “model M will never do Y”, but on the current margin, it seems we mainly want to reach a state where, given an actually dangerous AI, we can realise this quickly and then do something about the danger. Scary demos work for this. Now you might say “but then we don’t have safety guarantees”. One response is: then get really good at finding the scary demos quickly.
Also, very few existing safety standards have a “there does not exist an X where...” form. Airplanes aren’t safe because we have an upper bound on how explosive they can be, they’re safe because we know the environments in which we need them to operate safely, design them for that, and only operate them within those. By analogy, this weakly suggests to control AI operating environments and develop strong empirical evidence of safety in those specific operating environments. A central problem with this analogy, though, is that airplane operating environments are much lower-dimensional. A tuple of (temperature, humidity, pressure, speed, number of armed terrorists onboard) probably captures most of the variation you need to care about, whereas LLMs are deployed in environments that vary on very many axes.
Second, you focus on the field, in the sense of its structure and standards and its ability to inform policy, rather than in the sense of the body of knowledge. The former is downstream of the latter. I’m sure biologists would love to have as many upper bounds as physicists, but the things they work on are messier and less amenable to strict bounds (but note that policy still (eventually) gets made when they start talking about novel coronaviruses).
If you focus on evals as a science, rather than a scientific field, this suggests a high-level goal that I feel is partly implicit but also a bit of a missing mood in this post. The guiding light of science is prediction. A lot of the core problem in our understanding of LLMs is that we can’t predict things about them—whether they can do something, which methods hurt or help their performance, when a capability emerges, etc. It might be that many questions in this space, and I’d guess upper-bounding capabilities is one, just are hard. But if you gradually accumulate cases where you can predict something from something else—even if it’s not the type of thing you’d like to eventually predict—the history of science shows you can get surprisingly far. I don’t think it’s what you intend or think, but I think it’s easy to read this post and come away with a feeling of more “we need to find standardised numbers to measure so we can talk to serious people” and less “let’s try to solve that thing where we can’t reliably predict much about our AIs”.
Also, nitpick: FLOPS are a unit of compute, not of optimisation power (which, if it makes sense to quantify at all, should maybe be measured in bits).
I feel like both of your points are slightly wrong, so maybe we didn’t do a good job of explaining what we mean. Sorry for that.
1a) Evals both aim to show existence proofs, e.g. demos, as well as inform some notion of an upper bound. We did not intend to put one of them higher with the post. Both matter and both should be subject to more rigorous understanding and processes. I’d be surprised if the way we currently do demonstrations could not be improved by better science. 1b) Even if you claim you just did a demo or an existence proof and explicitly state that this should not be seen as evidence of absence, people will still see the absence of evidence as negative evidence. I think the “we ran all the evals and didn’t find anything” sentiment will be very strong, especially when deployment depends on not failing evals. So you should deal with that problem from the start IMO. Furthermore, I also think we should aim to build evals that give us positive guarantees if that’s possible. I’m not sure it is possible but we should try. 1c) The airplane analogy feels like a strawman to me. The upper bound is obviously not on explosivity, it would be a statement like “Within this temperature range, the material the wings are made of will break once in 10M flight miles on average” or something like that. I agree that airplanes are simpler and less high-dimensional. That doesn’t mean we should not try to capture most of the variance anyway even if it requires more complicated evals. Maybe we realize it doesn’t work and the variance is too high but this is why we diversify agendas.
2a) The post is primarily about building a scientific field and that field then informs policy and standards. A great outcome of the post would be if more scientists did research on this. If this is not clear, then we miscommunicated. The point is to get more understanding so we can make better predictions. These predictions can then be used in the real world. 2b) It really is not “we need to find standardised numbers to measure so we can talk to serious people” and less “let’s try to solve that thing where we can’t reliably predict much about our AIs”. If that was the main takeaway, I think the post would be net negative.
3) But the optimization requires computation? For example, if you run 100 forward passes for your automated red-teaming algorithm with model X, that requires Y FLOP of compute. I’m unsure where the problem is.
1a) I got the impression that the post emphasises upper bounds more than existing proofs from the introduction, which has a long paragraph on the upper bound problem, and from reading the other comments. The rest of the post doesn’t really bear this emphasis out though, so I think this is a misunderstanding on my part.
1b) I agree we should try to be able to make claims like “the model will never X”. But if models are genuinely dangerous, by default I expect a good chance that teams of smart red-teamers and eval people (e.g. Apollo) to be able to unearth scary demos. And the main thing we care about is that danger leads to an appropriate response. So it’s not clear to me that effective policy (or science) requires being able to say “the model will never X”.
1c) The basic point is that a lot of the safety cases we have for existing products rely less on the product not doing bad things across a huge range of conditions, but on us being able to bound the set of environments where we need the product to do well. E.g. you never put the airplane wing outside its temperature range, or submerge it in water, or whatever. Analogously, for AI systems, if we can’t guarantee they won’t do bad things if X, we can work to not put them in situation X.
2a) Partly I was expecting the post to be more about the science and less about the field-building. But field-building is important to talk about and I think the post does a good job of talking about it (and the things you say about science are good too, just that I’d emphasise slightly different parts and mention prediction as the fundamental goal).
2b) I said the post could be read in a way that produces this feeling; I know this is not your intention. This is related to my slight hesitation around not emphasising the science over the field-building. What standards etc. are possible in a field is downstream of what the objects of study turn out to be like. I think comparing to engineering safety practices in other fields is a useful intuition pump and inspiration, but I sometimes worry that this could lead to trying to imitate those, over following the key scientific questions wherever they lead and then seeing what you can do. But again, I was assuming a post focused on the science (rather than being equally concerned with field-building), and responding with things I feel are missing if the focus had been the science.
3) It is true that optimisation requires computation, and that for your purposes, FLOPS is the right thing to care about because e.g. if doing something bad takes 1e25 FLOPS, the number of actors who can do it is small. However, I think compute should be called, well, “compute”. To me, “optimisation power” sounds like a more fundamental/math-y concept, like how many bits of selection can some idealised optimiser apply to a search space, or whatever formalisation of optimisation you have. I admit that “optimisation power” is often used to describe compute for AI models, so this is in line with (what is unfortunately) conventional usage. As I said, this is a nitpick.
FLOPS can be cashed out into bits. As bits are the universal units of computation, and hence optimization, if you can upper bound the number of bits used to elicit capabilities with one method, it suggests that you can probably use another method, like activation steering vs prompt engineering, to elicit said capability. And if you can’t, if you find that prompting somehow requires fewer bits to elicit capabilities than another method, then that’s interesting in its own right.
Now I’m actually curious: how many bits does a good prompt require to get a model to do X vs activation steering?
Yeah great question! I’m planning to hash out this concept in a future post (hopefully soon). But here are my unfinished thoughts I had recently on this.
I think using different methods to elicit “bad behavior” i.e. to red team language models have different pros and cons as you suggested (see for instance this paper by ethan perez: https://arxiv.org/abs/2202.03286). If we assume that we have a way of measuring bad behavior (i.e. a reward model or classifier that tells you when your model is outputting toxic things, being deceptive, sycophantic etc., which is very reasonable) then we can basically just empirically compare a bunch of methods and how efficient they are at eliciting bad behavior, i.e. how much compute (FLOPs) they require to get a target LM to output something “bad”. The useful thing about compute is that it “easily” allows us to compare different methods, e.g. prompting, RL or activation steering. Say for instance you run your prompt optimization algorithm (e.g. persona modulation or any other method for finding good red teaming prompts) it might be hard to compare this to say how many gradient steps you took when red teaming with RL. But the way to compare those methods could be via the amount of compute they required to make the target model output bad stuff.
Obviously, you can never be sure that the method you used is actually the best and most compute efficient, i.e. there might always be an undiscovered Red teaming method which makes your target model output “bad stuff”. But at least for all known red teaming methods, we can compare their compute efficiency in eliciting bad outputs. Then we can pick the most efficient one and make claims such as, the new target model X is robust to Y FLOPs of Red teaming with method Z (which is the best method we currently have). Obviously, this would not guarantee us anything. But I think in the messy world we live in it would be a good way of quantifying how robust a model is to outputting bad things. It would also allow us to compare various models and make quantitative statements about which model is more robust to outputting bad things.
I’ll have to think about this more and will write up my thoughts soon. But yes, if we assume that this is a great way of quantifying how “HHH” your model is, or how unjailbreakable, then it makes sense to compare Red teaming methods on how compute efficient they are.
Note there is a second axis which I have not higlighted yet, which is diversity of “bad outputs” produced by the target model. This is also measured in Ethan’s paper referenced above. For instance they find that prompting yields bad output less frequently, but when it does the outputs are more diverse (compared to RL). While we do care mostly about, how much compute did it take to make the model output something bad, it is also relevant whether this optimized method now allows you to get diverse outputs or not (arguably one might care more or less about this depending on what statement one would like to make). I’m still thinking about how diversity fits in this picture.
It seems to me that there are two unstated perspectives behind this post that inform a lot of it.
First, that you specifically care about upper-bounding capabilities, which in turn implies being able to make statements like “there does not exist a setup X where model M does Y”. This is a very particular and often hard-to-reach standard, and you don’t really motivate why the focus on this. A much simpler standard is “here is a setup X where model M did Y”. I think evidence of the latter type can drive lots of the policy outcomes you want: “GPT-6 replicated itself on the internet and designed a bioweapon, look!”. Ideally, we want to eventually be able to say “model M will never do Y”, but on the current margin, it seems we mainly want to reach a state where, given an actually dangerous AI, we can realise this quickly and then do something about the danger. Scary demos work for this. Now you might say “but then we don’t have safety guarantees”. One response is: then get really good at finding the scary demos quickly.
Also, very few existing safety standards have a “there does not exist an X where...” form. Airplanes aren’t safe because we have an upper bound on how explosive they can be, they’re safe because we know the environments in which we need them to operate safely, design them for that, and only operate them within those. By analogy, this weakly suggests to control AI operating environments and develop strong empirical evidence of safety in those specific operating environments. A central problem with this analogy, though, is that airplane operating environments are much lower-dimensional. A tuple of (temperature, humidity, pressure, speed, number of armed terrorists onboard) probably captures most of the variation you need to care about, whereas LLMs are deployed in environments that vary on very many axes.
Second, you focus on the field, in the sense of its structure and standards and its ability to inform policy, rather than in the sense of the body of knowledge. The former is downstream of the latter. I’m sure biologists would love to have as many upper bounds as physicists, but the things they work on are messier and less amenable to strict bounds (but note that policy still (eventually) gets made when they start talking about novel coronaviruses).
If you focus on evals as a science, rather than a scientific field, this suggests a high-level goal that I feel is partly implicit but also a bit of a missing mood in this post. The guiding light of science is prediction. A lot of the core problem in our understanding of LLMs is that we can’t predict things about them—whether they can do something, which methods hurt or help their performance, when a capability emerges, etc. It might be that many questions in this space, and I’d guess upper-bounding capabilities is one, just are hard. But if you gradually accumulate cases where you can predict something from something else—even if it’s not the type of thing you’d like to eventually predict—the history of science shows you can get surprisingly far. I don’t think it’s what you intend or think, but I think it’s easy to read this post and come away with a feeling of more “we need to find standardised numbers to measure so we can talk to serious people” and less “let’s try to solve that thing where we can’t reliably predict much about our AIs”.
Also, nitpick: FLOPS are a unit of compute, not of optimisation power (which, if it makes sense to quantify at all, should maybe be measured in bits).
I feel like both of your points are slightly wrong, so maybe we didn’t do a good job of explaining what we mean. Sorry for that.
1a) Evals both aim to show existence proofs, e.g. demos, as well as inform some notion of an upper bound. We did not intend to put one of them higher with the post. Both matter and both should be subject to more rigorous understanding and processes. I’d be surprised if the way we currently do demonstrations could not be improved by better science.
1b) Even if you claim you just did a demo or an existence proof and explicitly state that this should not be seen as evidence of absence, people will still see the absence of evidence as negative evidence. I think the “we ran all the evals and didn’t find anything” sentiment will be very strong, especially when deployment depends on not failing evals. So you should deal with that problem from the start IMO. Furthermore, I also think we should aim to build evals that give us positive guarantees if that’s possible. I’m not sure it is possible but we should try.
1c) The airplane analogy feels like a strawman to me. The upper bound is obviously not on explosivity, it would be a statement like “Within this temperature range, the material the wings are made of will break once in 10M flight miles on average” or something like that. I agree that airplanes are simpler and less high-dimensional. That doesn’t mean we should not try to capture most of the variance anyway even if it requires more complicated evals. Maybe we realize it doesn’t work and the variance is too high but this is why we diversify agendas.
2a) The post is primarily about building a scientific field and that field then informs policy and standards. A great outcome of the post would be if more scientists did research on this. If this is not clear, then we miscommunicated. The point is to get more understanding so we can make better predictions. These predictions can then be used in the real world.
2b) It really is not “we need to find standardised numbers to measure so we can talk to serious people” and less “let’s try to solve that thing where we can’t reliably predict much about our AIs”. If that was the main takeaway, I think the post would be net negative.
3) But the optimization requires computation? For example, if you run 100 forward passes for your automated red-teaming algorithm with model X, that requires Y FLOP of compute. I’m unsure where the problem is.
1a) I got the impression that the post emphasises upper bounds more than existing proofs from the introduction, which has a long paragraph on the upper bound problem, and from reading the other comments. The rest of the post doesn’t really bear this emphasis out though, so I think this is a misunderstanding on my part.
1b) I agree we should try to be able to make claims like “the model will never X”. But if models are genuinely dangerous, by default I expect a good chance that teams of smart red-teamers and eval people (e.g. Apollo) to be able to unearth scary demos. And the main thing we care about is that danger leads to an appropriate response. So it’s not clear to me that effective policy (or science) requires being able to say “the model will never X”.
1c) The basic point is that a lot of the safety cases we have for existing products rely less on the product not doing bad things across a huge range of conditions, but on us being able to bound the set of environments where we need the product to do well. E.g. you never put the airplane wing outside its temperature range, or submerge it in water, or whatever. Analogously, for AI systems, if we can’t guarantee they won’t do bad things if X, we can work to not put them in situation X.
2a) Partly I was expecting the post to be more about the science and less about the field-building. But field-building is important to talk about and I think the post does a good job of talking about it (and the things you say about science are good too, just that I’d emphasise slightly different parts and mention prediction as the fundamental goal).
2b) I said the post could be read in a way that produces this feeling; I know this is not your intention. This is related to my slight hesitation around not emphasising the science over the field-building. What standards etc. are possible in a field is downstream of what the objects of study turn out to be like. I think comparing to engineering safety practices in other fields is a useful intuition pump and inspiration, but I sometimes worry that this could lead to trying to imitate those, over following the key scientific questions wherever they lead and then seeing what you can do. But again, I was assuming a post focused on the science (rather than being equally concerned with field-building), and responding with things I feel are missing if the focus had been the science.
3) It is true that optimisation requires computation, and that for your purposes, FLOPS is the right thing to care about because e.g. if doing something bad takes 1e25 FLOPS, the number of actors who can do it is small. However, I think compute should be called, well, “compute”. To me, “optimisation power” sounds like a more fundamental/math-y concept, like how many bits of selection can some idealised optimiser apply to a search space, or whatever formalisation of optimisation you have. I admit that “optimisation power” is often used to describe compute for AI models, so this is in line with (what is unfortunately) conventional usage. As I said, this is a nitpick.
FLOPS can be cashed out into bits. As bits are the universal units of computation, and hence optimization, if you can upper bound the number of bits used to elicit capabilities with one method, it suggests that you can probably use another method, like activation steering vs prompt engineering, to elicit said capability. And if you can’t, if you find that prompting somehow requires fewer bits to elicit capabilities than another method, then that’s interesting in its own right.
Now I’m actually curious: how many bits does a good prompt require to get a model to do X vs activation steering?
Yeah great question! I’m planning to hash out this concept in a future post (hopefully soon). But here are my unfinished thoughts I had recently on this.
I think using different methods to elicit “bad behavior” i.e. to red team language models have different pros and cons as you suggested (see for instance this paper by ethan perez: https://arxiv.org/abs/2202.03286). If we assume that we have a way of measuring bad behavior (i.e. a reward model or classifier that tells you when your model is outputting toxic things, being deceptive, sycophantic etc., which is very reasonable) then we can basically just empirically compare a bunch of methods and how efficient they are at eliciting bad behavior, i.e. how much compute (FLOPs) they require to get a target LM to output something “bad”. The useful thing about compute is that it “easily” allows us to compare different methods, e.g. prompting, RL or activation steering. Say for instance you run your prompt optimization algorithm (e.g. persona modulation or any other method for finding good red teaming prompts) it might be hard to compare this to say how many gradient steps you took when red teaming with RL. But the way to compare those methods could be via the amount of compute they required to make the target model output bad stuff.
Obviously, you can never be sure that the method you used is actually the best and most compute efficient, i.e. there might always be an undiscovered Red teaming method which makes your target model output “bad stuff”. But at least for all known red teaming methods, we can compare their compute efficiency in eliciting bad outputs. Then we can pick the most efficient one and make claims such as, the new target model X is robust to Y FLOPs of Red teaming with method Z (which is the best method we currently have). Obviously, this would not guarantee us anything. But I think in the messy world we live in it would be a good way of quantifying how robust a model is to outputting bad things. It would also allow us to compare various models and make quantitative statements about which model is more robust to outputting bad things.
I’ll have to think about this more and will write up my thoughts soon. But yes, if we assume that this is a great way of quantifying how “HHH” your model is, or how unjailbreakable, then it makes sense to compare Red teaming methods on how compute efficient they are.
Note there is a second axis which I have not higlighted yet, which is diversity of “bad outputs” produced by the target model. This is also measured in Ethan’s paper referenced above. For instance they find that prompting yields bad output less frequently, but when it does the outputs are more diverse (compared to RL). While we do care mostly about, how much compute did it take to make the model output something bad, it is also relevant whether this optimized method now allows you to get diverse outputs or not (arguably one might care more or less about this depending on what statement one would like to make). I’m still thinking about how diversity fits in this picture.