We’re tried some things kind of like this, though less sophisticated. The person who was working on this might comment describing them at some point.
One fundamental problem here is that I’m worried that finding a “violence” latent is already what we’re doing when we fine-tune. And so I’m worried that the classifier mistakes that will be hardest to stamp out are those that we can’t find through this kind of process.
I have an analogous concern with the “make the model generate only violent completions”—if we knew how to define “violent”, we’d already be done. And so I’d worry that the definition of violence used by the generator here is the same as the definition used by the classifier, and so we wouldn’t find any new mistakes.
Controlling the violence latent would let you systematically sample for it: you could hold the violence latent constant, and generate an evenly spaced grid of points around it to get a wide diversity of violent but stylistically/semantically unique. Kinds of text which would be exponentially hard to find by brute force sampling can be found this way easily. It also lets you do various kinds of guided search or diversity sampling, and do data augmentation (encode known-violent samples into their latent, hold the violent latent constant, generate a bunch of samples ‘near’ it). Even if the violence latent is pretty low quality, it’s still probably a lot better as an initialization for sampling than trying to brute force random samples and running into very rapidly diminishing returns as you try to dig your way into the tails.
And if you can’t do any of that because there is no equivalent of a violent latent or its equivalent is clearly too narrow & incomplete, that is pretty important, I would think. Violence is such a salient category, so frequent in fiction and nonfiction (news), that a generative model which has not learned it as a concept is, IMO, probably too stupid to be all that useful as a ‘model organism’ of alignment. (I would not expect a classifier based on a failed generative model to be all that useful either.) If a model cannot or does not understand what ‘violence’ is, how can you hope to get a model which knows not to generate violence, can recognize violence, can ask for labels on violence, or do anything useful about violence?
So note that we’re actually working on the predicate “an injury occurred or was exacerbated”, rather than something about violence (I edited out the one place I referred to violence instead of injury in the OP to make this clearer).
The reason I’m not that excited about finding this latent is that I suspect that the snippets that activate it are particularly easy cases—we’re only interested in generating injurious snippets that the classifier is wrong about.
For example, I think that the model is currently okay with dropping babies probably because it doesn’t really think of this as an injury occurring, and so I wouldn’t have thought that we’d find an example like this by looking for things that maximize the injury latent. And I suspect that most of the problem here is finding things that don’t activate the injury latent but are still injurious, rather than things that do.
One way I’ve been thinking of this is that maybe the model has like 20 different concepts for violence, and we’re trying to find each of them in our fine tuning process.
We’re tried some things kind of like this, though less sophisticated. The person who was working on this might comment describing them at some point.
One fundamental problem here is that I’m worried that finding a “violence” latent is already what we’re doing when we fine-tune. And so I’m worried that the classifier mistakes that will be hardest to stamp out are those that we can’t find through this kind of process.
I have an analogous concern with the “make the model generate only violent completions”—if we knew how to define “violent”, we’d already be done. And so I’d worry that the definition of violence used by the generator here is the same as the definition used by the classifier, and so we wouldn’t find any new mistakes.
Controlling the violence latent would let you systematically sample for it: you could hold the violence latent constant, and generate an evenly spaced grid of points around it to get a wide diversity of violent but stylistically/semantically unique. Kinds of text which would be exponentially hard to find by brute force sampling can be found this way easily. It also lets you do various kinds of guided search or diversity sampling, and do data augmentation (encode known-violent samples into their latent, hold the violent latent constant, generate a bunch of samples ‘near’ it). Even if the violence latent is pretty low quality, it’s still probably a lot better as an initialization for sampling than trying to brute force random samples and running into very rapidly diminishing returns as you try to dig your way into the tails.
And if you can’t do any of that because there is no equivalent of a violent latent or its equivalent is clearly too narrow & incomplete, that is pretty important, I would think. Violence is such a salient category, so frequent in fiction and nonfiction (news), that a generative model which has not learned it as a concept is, IMO, probably too stupid to be all that useful as a ‘model organism’ of alignment. (I would not expect a classifier based on a failed generative model to be all that useful either.) If a model cannot or does not understand what ‘violence’ is, how can you hope to get a model which knows not to generate violence, can recognize violence, can ask for labels on violence, or do anything useful about violence?
So note that we’re actually working on the predicate “an injury occurred or was exacerbated”, rather than something about violence (I edited out the one place I referred to violence instead of injury in the OP to make this clearer).
The reason I’m not that excited about finding this latent is that I suspect that the snippets that activate it are particularly easy cases—we’re only interested in generating injurious snippets that the classifier is wrong about.
For example, I think that the model is currently okay with dropping babies probably because it doesn’t really think of this as an injury occurring, and so I wouldn’t have thought that we’d find an example like this by looking for things that maximize the injury latent. And I suspect that most of the problem here is finding things that don’t activate the injury latent but are still injurious, rather than things that do.
One way I’ve been thinking of this is that maybe the model has like 20 different concepts for violence, and we’re trying to find each of them in our fine tuning process.