What attack budget are you imagining defending against?
Rosati 2024 looks at fine-tuning for 1 epoch on 10k samples, which is a tiny attack budget relative to pretrain. If your threat model is the open source community unlocking HHH models, then the attack budget could be at least $1M, maybe much more. If the threat model is China or large terrorist groups, then you should probably be looking at a budget closer to 1%-10% of the cost of training a model from scratch. I have thought about defending against the latter threat, and I don’t see a promising path towards making it hard for such well-funded attackers to fine-tune LLMs (including hard to fine-tune in general, not just domain-specifically).
Thanks for pointing this out—I think its a critical point.
Im not imagining anything in particular (and ya in that paper we do very much do “baby” sized attacks)
Generally ya this is a problem we need to work out: what is the relationship between a defence strength and the budget an attacker would take to overcome this AND for large groups that have budget for training from scratch would defences here even make an impact.
I think your right in that the large budget groups who can just train from scratch would just not be impacted by defences of this nature. While that seems to really poo-poo this whole endevour, I think its still promising and valuable as you point out to want to prevent this happening at all budgests that preclude training from scratch.
A scenario that maybe help with thinking about this endevour still being valueable is:
Maybe we are talking about trillion/billion dollar models in the future where compute governance allows us to be able to trace whether this training from scratch is occuring somewhere, in which case defences that approch this limit for attacker spend are indeed quite valuable.
What attack budget are you imagining defending against?
Rosati 2024 looks at fine-tuning for 1 epoch on 10k samples, which is a tiny attack budget relative to pretrain. If your threat model is the open source community unlocking HHH models, then the attack budget could be at least $1M, maybe much more. If the threat model is China or large terrorist groups, then you should probably be looking at a budget closer to 1%-10% of the cost of training a model from scratch. I have thought about defending against the latter threat, and I don’t see a promising path towards making it hard for such well-funded attackers to fine-tune LLMs (including hard to fine-tune in general, not just domain-specifically).
Thanks for pointing this out—I think its a critical point.
Im not imagining anything in particular (and ya in that paper we do very much do “baby” sized attacks)
Generally ya this is a problem we need to work out: what is the relationship between a defence strength and the budget an attacker would take to overcome this AND for large groups that have budget for training from scratch would defences here even make an impact.
I think your right in that the large budget groups who can just train from scratch would just not be impacted by defences of this nature. While that seems to really poo-poo this whole endevour, I think its still promising and valuable as you point out to want to prevent this happening at all budgests that preclude training from scratch.
A scenario that maybe help with thinking about this endevour still being valueable is:
Maybe we are talking about trillion/billion dollar models in the future where compute governance allows us to be able to trace whether this training from scratch is occuring somewhere, in which case defences that approch this limit for attacker spend are indeed quite valuable.