StellaAthena comments on Anomalous tokens reveal the original identities of Instruct models

StellaAthena 9 Feb 2023 4:34 UTC
9 points
0
This is excellent work, though I want to generically recommend caution when making assumptions about the success of such attacks based only on blackbox evaluations. Thorough analysis of false positive and false negative rates with ground-truth access (ideally in an adversarially developed setting) is essential for validation. [Sidebar: this reminds me that I really need to write up my analysis in the EleutherAI discord showing why prompt extraction attacks can be untrustworthy]
That said, this is really excellent work and I agree it looks quite promising.