alexlyzhov comments on AI interpretability could be harmful?

alexlyzhov 11 May 2023 0:17 UTC
2 points
0
Vaguely related paper: Self-Destructing Models: Increasing the Costs of Harmful Dual Uses in Foundation Models is an early attempt to prevent models from being re-purposed via fine-tuning.
It doesn’t seem like a meaningfully positive result. For example, all their plots only track finetuning on up to 200 examples. I imagine they might have even had clear negative results in conditions with >200 examples available for finetuning. After 50-100 examples, the gap between normal finetuning and finetuning from random init, even though still small, grows fast. There are also no plots with x-axis = finetuning iterations. When they optimize for “non-finetunability”, they don’t aim to maintain the language modeling performance, instead, they only impose the constraint of “maintaining finetunability” on one downstream “professions detection task”.
I expect naive solutions to continue to work very poorly on this problem.
- Roman Leventov 11 May 2023 7:56 UTC
  1 point
  0
  Parent
  I wonder whether GFlowNets are somehow better suited for self-destruction/non-finetunability than LLMs.