I found this to be a very interesting and useful post. Thanks for writing it! I’m still trying to process / update on the main ideas and some of the (what I see as valid) pushback in the comments. A couple non-central points of disagreement:
I don’t expect catastrophic interference between any pair of these alignment techniques and capabilities advances.
This seems like it’s because these aren’t very strong alignment techniques, and not very strong capability advances. Or like, these alignment techniques work if alignment is relatively easy, or if the amount you need to nudge a system from default to aligned is relatively small. If we run into harder alignment problems, e.g., gradient hacking to resist goal modification, I expect we’ll need “stronger” alignment techniques. I expect stronger alignment techniques to be more brittle to capability changes, because they’re meant to accomplish a harder task, as compared to weaker alignment techniques (nudging a system farther toward alignment). Perhaps more work will go into making them robust to capability changes, but this is non-obvious — e.g., when humans are trying to accomplish a notably harder task, we often make another tool for doing so, which is intended to be robust to the harder setting (like a propeller plane vs a cargo plane, when you want to carry a bunch of stuff via the air).
I also expect that, if we end up in a regime of large capability advances, alignment techniques are generally more brittle because more is changing with your system; this seems possible if there are a bunch of AIs trying to develop new architectures and paradigms for AI training. So I’m like “yeah, the list of relatively small capabilities advances you provide sure look like things that won’t break alignment, but my guess is that larger capabilities advances are what we should actually be worried about.”
I still think the risks are manageable, since the first-order effect of training a model to perform an action X in circumstance Y is to make the model more likely to perform actions similar to X in circumstances similar to Y.
I don’t understand this sentence, and I would appreciate clarification!
Additionally, current practice is to train language models on an enormous variety of content from the internet. The odds of any given subset of model data catastrophically interfering with our current alignment techniques cannot be that high, otherwise our current alignment techniques wouldn’t work on our current models.
Don’t trojans and jailbreaks provide substantial evidence against this? I think neither is perfect: trojans aren’t the main target of most “alignment techniques” (but they are attributable to a small amount of training data), and jailbreaks aren’t necessarily attributable to a small amount of training data alone (but they are the target of “alignment” efforts). But it feels to me like the picture of both of these together is that current alignment techniques aren’t working especially well, and are thwarted via a small amount of training data.
I found this to be a very interesting and useful post. Thanks for writing it! I’m still trying to process / update on the main ideas and some of the (what I see as valid) pushback in the comments. A couple non-central points of disagreement:
This seems like it’s because these aren’t very strong alignment techniques, and not very strong capability advances. Or like, these alignment techniques work if alignment is relatively easy, or if the amount you need to nudge a system from default to aligned is relatively small. If we run into harder alignment problems, e.g., gradient hacking to resist goal modification, I expect we’ll need “stronger” alignment techniques. I expect stronger alignment techniques to be more brittle to capability changes, because they’re meant to accomplish a harder task, as compared to weaker alignment techniques (nudging a system farther toward alignment). Perhaps more work will go into making them robust to capability changes, but this is non-obvious — e.g., when humans are trying to accomplish a notably harder task, we often make another tool for doing so, which is intended to be robust to the harder setting (like a propeller plane vs a cargo plane, when you want to carry a bunch of stuff via the air).
I also expect that, if we end up in a regime of large capability advances, alignment techniques are generally more brittle because more is changing with your system; this seems possible if there are a bunch of AIs trying to develop new architectures and paradigms for AI training. So I’m like “yeah, the list of relatively small capabilities advances you provide sure look like things that won’t break alignment, but my guess is that larger capabilities advances are what we should actually be worried about.”
I don’t understand this sentence, and I would appreciate clarification!
Don’t trojans and jailbreaks provide substantial evidence against this? I think neither is perfect: trojans aren’t the main target of most “alignment techniques” (but they are attributable to a small amount of training data), and jailbreaks aren’t necessarily attributable to a small amount of training data alone (but they are the target of “alignment” efforts). But it feels to me like the picture of both of these together is that current alignment techniques aren’t working especially well, and are thwarted via a small amount of training data.