Additionally, I claim that alignment techniques already generalize across human contributions to AI capability research.
I think this set of examples about alignment techniques and capabilities advances is misunderstanding what others talking about this mean by ‘capabilities advances’. I believe the sense in which Eliezer means a ‘capabilities advance’ when discussing the evolutionary analogy is about ‘measured behavioral capabilties’. As in, the Brier score on a particular benchmark. Not as in, a change to the design of the training process or inference process. The question under discussion is then, “How well does an alignment technique, like RLHF, generalize across a capabilities advance such as becoming able to translate accurately between English and French”. I think the evidence so far is that the emergent capabilities we’ve seen so far do not conflict with the alignment techniques we’ve developed so far. The main point of the evolutionary analogy argument though is to say that there is a class of emergent capabilities in the future which are expected to depart from this pattern. These hypothetical future emergent capabilities are hypothesized to be tightly clustered in arrival time, much larger in combined magnitude and scope, and highly divergent from past trends. Their clustered nature, if true, would make it likely that a new scaled up model would gain all these capabilities at nearly the same point in the training process (or at least at near the same model scale). Thus, the stronger than expected model would present a surprising jump in capabilities without a concommitent jump in alignment. I’m not myself sure that this is how reality is structured such that we should expect this. But I do think that it makes enough logical sense that it might be possible that we should make careful preparations to prevent bad things happening if it did turn out that this was the case.
E.g., I expect that choosing to use the [LION optimizer](https://arxiv.org/abs/2302.06675) in place of the [Adam optimizer](https://arxiv.org/abs/1412.6980) would have very little impact on, say, the niceness of a language model you were training, except insofar as your choice of optimizer influences the convergence of the training process. Architecture choices seem ‘values neutral’ in a way that data choices are not.
In the brain, architecture changes matter at lot to values-after-training. Someone with fewer amygdala neurons is far more likely to behave in anti-social ways. So in the context of current models, who are developing ‘hidden’ logical architectures/structures/heuristics within the very loose architectures we’ve given them, changing the visible attributes of the architecture (e.g. optimizer or layer width) is likely to be mostly value neutral on average. However, changing this ‘hidden structure’, by adding values to the activation states for instance, is almost certainly going to have a substantial effect on behaviorally expressed values.
The “alignment technique generalise across human contributions to architectures” isn’t about the SLT threat model. It’s about the “AIs do AI capabilities research” threat model.
I think this set of examples about alignment techniques and capabilities advances is misunderstanding what others talking about this mean by ‘capabilities advances’. I believe the sense in which Eliezer means a ‘capabilities advance’ when discussing the evolutionary analogy is about ‘measured behavioral capabilties’. As in, the Brier score on a particular benchmark. Not as in, a change to the design of the training process or inference process. The question under discussion is then, “How well does an alignment technique, like RLHF, generalize across a capabilities advance such as becoming able to translate accurately between English and French”. I think the evidence so far is that the emergent capabilities we’ve seen so far do not conflict with the alignment techniques we’ve developed so far. The main point of the evolutionary analogy argument though is to say that there is a class of emergent capabilities in the future which are expected to depart from this pattern. These hypothetical future emergent capabilities are hypothesized to be tightly clustered in arrival time, much larger in combined magnitude and scope, and highly divergent from past trends. Their clustered nature, if true, would make it likely that a new scaled up model would gain all these capabilities at nearly the same point in the training process (or at least at near the same model scale). Thus, the stronger than expected model would present a surprising jump in capabilities without a concommitent jump in alignment.
I’m not myself sure that this is how reality is structured such that we should expect this. But I do think that it makes enough logical sense that it might be possible that we should make careful preparations to prevent bad things happening if it did turn out that this was the case.
In the brain, architecture changes matter at lot to values-after-training. Someone with fewer amygdala neurons is far more likely to behave in anti-social ways. So in the context of current models, who are developing ‘hidden’ logical architectures/structures/heuristics within the very loose architectures we’ve given them, changing the visible attributes of the architecture (e.g. optimizer or layer width) is likely to be mostly value neutral on average. However, changing this ‘hidden structure’, by adding values to the activation states for instance, is almost certainly going to have a substantial effect on behaviorally expressed values.
The “alignment technique generalise across human contributions to architectures” isn’t about the SLT threat model. It’s about the “AIs do AI capabilities research” threat model.