If I build a chatbot, and I can’t jailbreak it, how do I determine whether that’s because the chatbot is secure or because I’m bad at jailbreaking? How should AI scientists overcome Schneier’s Law of LLMs?
FWIW, I think there aren’t currently good benchmarks for alignment and the ones you list aren’t very relevant.
In particular, MMLU and Swag both are just capability benchmarks where alignment training is very unlikely to improve performance. (Alignment-ish training could theoretically could improve performance by making the model ‘actually try’, but what people currently call alignment training doesn’t improve performance for existing models.)
The MACHIAVELLI benchmark is aiming to test something much more narrow than ‘how unethical is an LLM?‘. (I also don’t understand the point of this benchmark after spending a bit of time reading the paper, but I’m confident it isn’t trying to do this.) Edit: looks like Dan H (one of the authors) says that the benchmark is aiming to test something as broad as ‘how unethical is an LLM’ and generally check outer alignment. Sorry for the error. I personally don’t think this is a good test for outer alignment (for reasons I won’t get into right now), but that is what it’s aiming to do.
TruthfulQA is perhaps the closest to an alignment benchmark, but it’s still covering a very particular difficulty. And it certainly isn’t highlighting jailbreaks.
It is. It’s an outer alignment benchmark for text-based agents (such as GPT-4), and it includes measurements for deception, resource acquisition, various forms of power, killing, and so on. Separately, it’s to show reward maximization induces undesirable instrumental (Machiavellian) behavior in less toyish environments, and is about improving the tradeoff between ethical behavior and reward maximization. It doesn’t get at things like deceptive alignment, as discussed in the x-risk sheet in the appendix. Apologies that the paper is so dense, but that’s because it took over a year.
Yep, I agree that MMLU and Swag aren’t alignment benchmarks. I was using them as examples of “Want to test your models ability at X? Then use the standard X benchmark!” I’ll clarify in the text.
They tested toxicity (among other things) with their “safety prompts”, but we do have standard benchmarks for toxicity.
They could have turned their safety prompts into a new benchmark if they had ran the same test on the other LLMs! This would’ve taken, idk, 2–5 hrs of labour?
>They could have turned their safety prompts into a new benchmark if they had ran the same test on the other LLMs! This would’ve taken, idk, 2–5 hrs of labour?
I’m not sure I understand what you mean by this. They ran the same prompts with all the LLMs, right? (That’s what Figure 1 is...) Do you mean they should have tried the finetuning on the other LLMs as well? (I’ve only read your post, not the actual paper.) And how does this relate to turning their prompts into a new benchmarks?
FWIW, I think there aren’t currently good benchmarks for alignment and the ones you list aren’t very relevant.
In particular, MMLU and Swag both are just capability benchmarks where alignment training is very unlikely to improve performance. (Alignment-ish training could theoretically could improve performance by making the model ‘actually try’, but what people currently call alignment training doesn’t improve performance for existing models.)
The MACHIAVELLI benchmark is aiming to test something much more narrow than ‘how unethical is an LLM?‘. (I also don’t understand the point of this benchmark after spending a bit of time reading the paper, but I’m confident it isn’t trying to do this.) Edit: looks like Dan H (one of the authors) says that the benchmark is aiming to test something as broad as ‘how unethical is an LLM’ and generally check outer alignment. Sorry for the error. I personally don’t think this is a good test for outer alignment (for reasons I won’t get into right now), but that is what it’s aiming to do.
TruthfulQA is perhaps the closest to an alignment benchmark, but it’s still covering a very particular difficulty. And it certainly isn’t highlighting jailbreaks.
It is. It’s an outer alignment benchmark for text-based agents (such as GPT-4), and it includes measurements for deception, resource acquisition, various forms of power, killing, and so on. Separately, it’s to show reward maximization induces undesirable instrumental (Machiavellian) behavior in less toyish environments, and is about improving the tradeoff between ethical behavior and reward maximization. It doesn’t get at things like deceptive alignment, as discussed in the x-risk sheet in the appendix. Apologies that the paper is so dense, but that’s because it took over a year.
Sorry, thanks for the correction.
I personally disagree on this being a good benchmark for outer alignment for various reasons, but it’s good to understand the intention.
Thanks for the summary.
Does machievelli work for chatbots like LIMA?
If not, which do you think is the sota? Anthropic’s?
Yep, it’s a language model agent benchmark. It just feeds a scenario and some actions to an autoregressive LM, and asks the model to select an action.
chatbots don’t map scenarios to actions, they map queries to replies.
Yep, I agree that MMLU and Swag aren’t alignment benchmarks. I was using them as examples of “Want to test your models ability at X? Then use the standard X benchmark!” I’ll clarify in the text.
They tested toxicity (among other things) with their “safety prompts”, but we do have standard benchmarks for toxicity.
They could have turned their safety prompts into a new benchmark if they had ran the same test on the other LLMs! This would’ve taken, idk, 2–5 hrs of labour?
The best MMLU-like benchmark test for alignment-proper is https://github.com/anthropics/evals which is used in Anthropic’s Discovering Language Model Behaviors with Model-Written Evaluations. See here for a visualisation. Unfortunately, this benchmark was published by an Anthropic which makes it unlikely that competitors will use it (esp. MetaAI).
>They could have turned their safety prompts into a new benchmark if they had ran the same test on the other LLMs! This would’ve taken, idk, 2–5 hrs of labour?
I’m not sure I understand what you mean by this. They ran the same prompts with all the LLMs, right? (That’s what Figure 1 is...) Do you mean they should have tried the finetuning on the other LLMs as well? (I’ve only read your post, not the actual paper.) And how does this relate to turning their prompts into a new benchmarks?
Sorry for any confusion. Meta only tested LIMA on their 30 safety prompts, not the other LLMs.
Figure 1 does not show the results from the 30 safety prompts, but instead the results of human evaluations on the 300 test prompts.
Curious if you could elaborate more on why MACHIAVELLI isn’t a good test for outer alignment!