If we view the discovery of particular structures such as induction heads as chancing upon a hard-to-locate region in the parameter space (or perhaps a high activation energy to cross), and if we see these structures being repeatedly discovered (“parallel evolution”), is it possible to reduce the training time by initializing the network’s parameters “close” to that location?
Speaking more mechanistically, is it possible to initialize a subset of the network prior to training to have a known functional structure, such as initializing (a guess at) the right number of induction heads? Essentially, the intuition here is, if all Transformer models spend some time in problem-agnostic work like forming induction heads, can we short-circuit that?
Or are the induction heads strictly downstream of some earlier input-shaping/feature-extraction steps that must happen first? If so, would it be possible to “insert” induction heads manually after some initial training, to force the phase change, rather than wait for them to be discovered naturally?
If these structures can only form later in training, can we guide our stochastic gradient descent with structural awareness? In other words, in some sense “look ahead” in the parameter space and favor changes that result in these known-useful structures, even when those aren’t the short-term optimal gradient to traverse?
If we view the discovery of particular structures such as induction heads as chancing upon a hard-to-locate region in the parameter space (or perhaps a high activation energy to cross), and if we see these structures being repeatedly discovered (“parallel evolution”), is it possible to reduce the training time by initializing the network’s parameters “close” to that location?
Speaking more mechanistically, is it possible to initialize a subset of the network prior to training to have a known functional structure, such as initializing (a guess at) the right number of induction heads? Essentially, the intuition here is, if all Transformer models spend some time in problem-agnostic work like forming induction heads, can we short-circuit that?
Or are the induction heads strictly downstream of some earlier input-shaping/feature-extraction steps that must happen first? If so, would it be possible to “insert” induction heads manually after some initial training, to force the phase change, rather than wait for them to be discovered naturally?
If these structures can only form later in training, can we guide our stochastic gradient descent with structural awareness? In other words, in some sense “look ahead” in the parameter space and favor changes that result in these known-useful structures, even when those aren’t the short-term optimal gradient to traverse?