Has there been any work on the scaling laws of out-of-distribution capability/behavior decay?
A simple example:
Simultaneously train task A and task B for N steps.
Stop training task B, but continue to evaluate the performance of both A and B.
Observe how rapidly task B performance degrades.
Repeat across scale and regularization strategies.
Would be nice to also investigate different task types. For example, tasks with varying degrees of implied overlap in underlying mechanisms (like #2).
I’ve previously done some of these experiments privately, but not with nearly the compute necessary for an interesting result.
The sleeper agents paper reminded me of it. I would love to see what happens on a closer-to-frontier model that’s intentionally backdoored, and then subjected to continued pretraining. Can a backdoor persist for another trillion tokens of nonadversarial-but-extremely-broad training? Does that vary across scale etc?
I’d also like to intentionally find the circumstances that maximize the persistence of out of distribution capabilities not implied by the current training distribution.
Seems like identifying a robust trend here would have pretty important Implications, whichever direction it points.
Yeah, I’ve seen work on the sort of thing in your example in the continual learning literature. Also tasks that have like.… 10 components, and train sequentially but test on every task so far trained on. Then you can watch the earlier tasks fall off as training progresses.
For what it’s worth (perhaps nothing) in private experiments I’ve seen that in certain toy (transformer) models, task B performance gets wiped out almost immediately when you stop training on it, in situations where the two tasks are related in some way.
I haven’t looked at how deep the erasure is, and whether it is far easier to revive than it was to train it in the first place.
Has there been any work on the scaling laws of out-of-distribution capability/behavior decay?
A simple example:
Simultaneously train task A and task B for N steps.
Stop training task B, but continue to evaluate the performance of both A and B.
Observe how rapidly task B performance degrades.
Repeat across scale and regularization strategies.
Would be nice to also investigate different task types. For example, tasks with varying degrees of implied overlap in underlying mechanisms (like #2).
I’ve previously done some of these experiments privately, but not with nearly the compute necessary for an interesting result.
The sleeper agents paper reminded me of it. I would love to see what happens on a closer-to-frontier model that’s intentionally backdoored, and then subjected to continued pretraining. Can a backdoor persist for another trillion tokens of nonadversarial-but-extremely-broad training? Does that vary across scale etc?
I’d also like to intentionally find the circumstances that maximize the persistence of out of distribution capabilities not implied by the current training distribution.
Seems like identifying a robust trend here would have pretty important Implications, whichever direction it points.
Yeah, I’ve seen work on the sort of thing in your example in the continual learning literature. Also tasks that have like.… 10 components, and train sequentially but test on every task so far trained on. Then you can watch the earlier tasks fall off as training progresses.
For what it’s worth (perhaps nothing) in private experiments I’ve seen that in certain toy (transformer) models, task B performance gets wiped out almost immediately when you stop training on it, in situations where the two tasks are related in some way.
I haven’t looked at how deep the erasure is, and whether it is far easier to revive than it was to train it in the first place.
Yup, exactly the same experience here.