So note that we’re actually working on the predicate “an injury occurred or was exacerbated”, rather than something about violence (I edited out the one place I referred to violence instead of injury in the OP to make this clearer).
The reason I’m not that excited about finding this latent is that I suspect that the snippets that activate it are particularly easy cases—we’re only interested in generating injurious snippets that the classifier is wrong about.
For example, I think that the model is currently okay with dropping babies probably because it doesn’t really think of this as an injury occurring, and so I wouldn’t have thought that we’d find an example like this by looking for things that maximize the injury latent. And I suspect that most of the problem here is finding things that don’t activate the injury latent but are still injurious, rather than things that do.
One way I’ve been thinking of this is that maybe the model has like 20 different concepts for violence, and we’re trying to find each of them in our fine tuning process.
So note that we’re actually working on the predicate “an injury occurred or was exacerbated”, rather than something about violence (I edited out the one place I referred to violence instead of injury in the OP to make this clearer).
The reason I’m not that excited about finding this latent is that I suspect that the snippets that activate it are particularly easy cases—we’re only interested in generating injurious snippets that the classifier is wrong about.
For example, I think that the model is currently okay with dropping babies probably because it doesn’t really think of this as an injury occurring, and so I wouldn’t have thought that we’d find an example like this by looking for things that maximize the injury latent. And I suspect that most of the problem here is finding things that don’t activate the injury latent but are still injurious, rather than things that do.
One way I’ve been thinking of this is that maybe the model has like 20 different concepts for violence, and we’re trying to find each of them in our fine tuning process.