The two approaches can (and probably should) be combined: having built these classifiers to label the pretraining data, also running them on the output seems an obvious step, and lets you monitor how well the behavior the model distilled from them matches what you built.
According to the paper, interventions during fine-tuning are much more effective (strikingly so in most of their diagrams), in both (generally) lowest levels of undesirable behavior for least capability loss and (often) resistance to jailbreakling attempts. In their experiments this approach was Pareto-optimal. We’d need to confirm that this remains true with larger models and the more abstract sorts of classifiers and tags I’m proposing.
When tagging the pre-training data, you’re only trying to catch past deceit/whatever from humans on the Internet, who cannot and have no reason to adapt to improvements in your classifier, rather than trying to catch these from your simulated (possibly superhuman) agent, where you might be inadvertently training it to become more subtle, so you avoid a concerning adversarial problem.
Unlike model training approaches, this gives immediate on-the-fly switchability at inference time via the token-banning mechanism: you can dynamically decide, half way through a generation “it’s now OK to get <angry> if it seems appropriate”, or even “switch to <angry> mode now” (admittedly, you could probably also do this by switching trained LORAs on or off, or by transferring the generation between finetuned models, at somewhat more computational recalculation expense)
Monitoring the logits of currently-banned tags before these were reset to minus infinity gives you classifiers that are trained into the model, directly part of its internal thinking process, and that are the best predictor that a lot of SGD could fine. The larger and more capable your model is, the more computational capacity these classifiers can use, so this looks like it should scale well (as long as you also scale your pretagging classifiers).
Having trained the LLM, you can rerun it over a sample of its pretraining data, see where it expects the tags to be, and compare that to where your classifier put them. The diffs lets you see how well the LLM distilled the behavior of the classifiers, and might well give you clues for how to make the classifiers’ behavior more consistent, or even make them smarter if the LLM has learned patterns that extend out of its training distribution. You could also try this with prompts that attempt to tweak the definition of a tag that the LLM learnt, say a description of subtle cues that might indicate that someone is actually being <deceit>ful.
Basically, a lot of aspects of this approach give me reasons to be intuitively hopeful that it’s going to work well. But (AFAIK) it’s currently a little-explored approach, based on a paper from a few months ago (two of whose authors work for superscalers), so obviously we’d need to try it and see if it’s actually as promising as it looks. This is an IMO very interesting research avenue, not a ready-to-go solution. And certain aspects of it (run a classifier over the entire pretraining set to tag it, then pretrain a model) are very expensive, so researching this will be expensive, something that only orgs with the resources to train new models from scratch can do. Likely one would start off instead applying this as a fine-tuning approach on a much smaller dataset, get the kinks out, then redo as a full pretraining and confirm what quality/effectiveness improvement that gets.
The two approaches can (and probably should) be combined: having built these classifiers to label the pretraining data, also running them on the output seems an obvious step, and lets you monitor how well the behavior the model distilled from them matches what you built.
According to the paper, interventions during fine-tuning are much more effective (strikingly so in most of their diagrams), in both (generally) lowest levels of undesirable behavior for least capability loss and (often) resistance to jailbreakling attempts. In their experiments this approach was Pareto-optimal. We’d need to confirm that this remains true with larger models and the more abstract sorts of classifiers and tags I’m proposing.
When tagging the pre-training data, you’re only trying to catch past deceit/whatever from humans on the Internet, who cannot and have no reason to adapt to improvements in your classifier, rather than trying to catch these from your simulated (possibly superhuman) agent, where you might be inadvertently training it to become more subtle, so you avoid a concerning adversarial problem.
Unlike model training approaches, this gives immediate on-the-fly switchability at inference time via the token-banning mechanism: you can dynamically decide, half way through a generation “it’s now OK to get <angry> if it seems appropriate”, or even “switch to <angry> mode now” (admittedly, you could probably also do this by switching trained LORAs on or off, or by transferring the generation between finetuned models, at somewhat more computational recalculation expense)
Monitoring the logits of currently-banned tags before these were reset to minus infinity gives you classifiers that are trained into the model, directly part of its internal thinking process, and that are the best predictor that a lot of SGD could fine. The larger and more capable your model is, the more computational capacity these classifiers can use, so this looks like it should scale well (as long as you also scale your pretagging classifiers).
Having trained the LLM, you can rerun it over a sample of its pretraining data, see where it expects the tags to be, and compare that to where your classifier put them. The diffs lets you see how well the LLM distilled the behavior of the classifiers, and might well give you clues for how to make the classifiers’ behavior more consistent, or even make them smarter if the LLM has learned patterns that extend out of its training distribution. You could also try this with prompts that attempt to tweak the definition of a tag that the LLM learnt, say a description of subtle cues that might indicate that someone is actually being <deceit>ful.
Basically, a lot of aspects of this approach give me reasons to be intuitively hopeful that it’s going to work well. But (AFAIK) it’s currently a little-explored approach, based on a paper from a few months ago (two of whose authors work for superscalers), so obviously we’d need to try it and see if it’s actually as promising as it looks. This is an IMO very interesting research avenue, not a ready-to-go solution. And certain aspects of it (run a classifier over the entire pretraining set to tag it, then pretrain a model) are very expensive, so researching this will be expensive, something that only orgs with the resources to train new models from scratch can do. Likely one would start off instead applying this as a fine-tuning approach on a much smaller dataset, get the kinks out, then redo as a full pretraining and confirm what quality/effectiveness improvement that gets.
I’ve added brief mentions of some of these points to the original post. Thanks for the discussion.