>They could have turned their safety prompts into a new benchmark if they had ran the same test on the other LLMs! This would’ve taken, idk, 2–5 hrs of labour?
I’m not sure I understand what you mean by this. They ran the same prompts with all the LLMs, right? (That’s what Figure 1 is...) Do you mean they should have tried the finetuning on the other LLMs as well? (I’ve only read your post, not the actual paper.) And how does this relate to turning their prompts into a new benchmarks?
>They could have turned their safety prompts into a new benchmark if they had ran the same test on the other LLMs! This would’ve taken, idk, 2–5 hrs of labour?
I’m not sure I understand what you mean by this. They ran the same prompts with all the LLMs, right? (That’s what Figure 1 is...) Do you mean they should have tried the finetuning on the other LLMs as well? (I’ve only read your post, not the actual paper.) And how does this relate to turning their prompts into a new benchmarks?
Sorry for any confusion. Meta only tested LIMA on their 30 safety prompts, not the other LLMs.
Figure 1 does not show the results from the 30 safety prompts, but instead the results of human evaluations on the 300 test prompts.