Fabien Roger comments on The case for unlearning that removes information from LLM weights

Fabien Roger 21 Oct 2024 12:50 UTC
2 points
0
It would be great if this was solvable with security, but I think it’s not for most kinds of deployments. For external deployments you almost can never know if a google account corresponds to a student or a professional hacker. For internal deployments you might have misuse from insiders. Therefore, it is tempting to try to address that with AI (e.g. monitoring, refusals, …), and in practice many AI labs try to use this as the main layer of defense (e.g. see OpenAI’s implementation of model spec, which relies almost entirely on monitoring account activity and models refusing clearly ill-intentioned queries). But I agree this is far from being bulletproof.