My intuition on fine tuning not creating new types of simulacra seems to me to be along the lines of:
Inside, you probably have a bunch of structures, that do various things. And then, these probably work together to make larger structures, and so on.
And then you apply fine-tuning which is trying to shift high-level aspects of the output.
And I figure, it’s probably a smaller step in terms of the underlying weights to shift from activating one structure more to activating another more, than it is to create a whole new structure.
But if you’re just shifting from one structure to another, without creating new structures, it seems more to me that this is just shifting to another sort of output the model could have created all along.
So, if fine tuning can achieve its objectives by shifting the simulacra probabilities it seems to me it would tend to do that first, before creating any new types of simulacra.
But I realize it’s more complicated than that, since a shift in activation of structures at one level is a creation of a new structure at a higher level.
This explanation implies that circuits, or “structures”, pretty much all there is in the mechanistic picture of transformers, and concepts, as inputs to downstream circuits, are “activated” probabilistically between 0 and 1.
But it seems likely impoverished, I think Transformers cannot be reduced to a network of circuits. There are perhaps also non-trivial concept representations within activation layers on which MLP layers perform non-trivial algorithmic computations. These representations and computations with them are probably easier to shift during RLHF than the circuit topology.
And there are probably yet more phenomena and dynamics in Transformers that couldn’t be reduced to the things mentioned above, but i’m out of my depth to discuss this.
Also, if feedback is applied throughout pre-training it must influence the very structures formed within the Transformer.
Perhaps I used the wrong term, I did not mean by “activating” just on/off (with “more” being taken to imply probability?). I mainly meant more weight, though on/off could also be involved. Sorry, I am not necessarily familiar with the technical terms used.
I am also thinking of “structures” as a more general concept than just circuits, and not necessarily isolated within the system. I am more thinking of a “structure” being a pattern within the system which achieves one or more functions. By a “higher level” structure I mean a structure made of other structures.
Also, if feedback is applied throughout pre-training it must influence the very structures formed within the Transformer.
Yes, in the post I was only considering the case where fine-tuning is applied after. Feedback being applied during pre-training is a different matter.
My intuition on fine tuning not creating new types of simulacra seems to me to be along the lines of:
Inside, you probably have a bunch of structures, that do various things. And then, these probably work together to make larger structures, and so on.
And then you apply fine-tuning which is trying to shift high-level aspects of the output.
And I figure, it’s probably a smaller step in terms of the underlying weights to shift from activating one structure more to activating another more, than it is to create a whole new structure.
But if you’re just shifting from one structure to another, without creating new structures, it seems more to me that this is just shifting to another sort of output the model could have created all along.
So, if fine tuning can achieve its objectives by shifting the simulacra probabilities it seems to me it would tend to do that first, before creating any new types of simulacra.
But I realize it’s more complicated than that, since a shift in activation of structures at one level is a creation of a new structure at a higher level.
This explanation implies that circuits, or “structures”, pretty much all there is in the mechanistic picture of transformers, and concepts, as inputs to downstream circuits, are “activated” probabilistically between 0 and 1.
But it seems likely impoverished, I think Transformers cannot be reduced to a network of circuits. There are perhaps also non-trivial concept representations within activation layers on which MLP layers perform non-trivial algorithmic computations. These representations and computations with them are probably easier to shift during RLHF than the circuit topology.
And there are probably yet more phenomena and dynamics in Transformers that couldn’t be reduced to the things mentioned above, but i’m out of my depth to discuss this.
Also, if feedback is applied throughout pre-training it must influence the very structures formed within the Transformer.
Perhaps I used the wrong term, I did not mean by “activating” just on/off (with “more” being taken to imply probability?). I mainly meant more weight, though on/off could also be involved. Sorry, I am not necessarily familiar with the technical terms used.
I am also thinking of “structures” as a more general concept than just circuits, and not necessarily isolated within the system. I am more thinking of a “structure” being a pattern within the system which achieves one or more functions. By a “higher level” structure I mean a structure made of other structures.
Yes, in the post I was only considering the case where fine-tuning is applied after. Feedback being applied during pre-training is a different matter.