My assumption was that one of the primary use cases for model editing in its current technological state was producing LLMs that pass the factual-censorship requirements of authoritarian governments with an interest in AI. It would be really nice to see this tech repurposed to do something more constructive, if that’s possible. For example, it would be nice to be able to modify a foundation LLM so that it became is provably incapable of accurately doing accurate next-token-prediction on text written by anyone suffering from sociopathy, without degrading its ability to do so for text written by non-sociopaths, and specifically to the extent that these two differ. That would ameliorate one path by which a model might learn non-aligned behavior from human-generated content.
My assumption was that one of the primary use cases for model editing in its current technological state was producing LLMs that pass the factual-censorship requirements of authoritarian governments with an interest in AI. It would be really nice to see this tech repurposed to do something more constructive, if that’s possible. For example, it would be nice to be able to modify a foundation LLM so that it became is provably incapable of accurately doing accurate next-token-prediction on text written by anyone suffering from sociopathy, without degrading its ability to do so for text written by non-sociopaths, and specifically to the extent that these two differ. That would ameliorate one path by which a model might learn non-aligned behavior from human-generated content.