the hope is that by “nudging” the model at an early layer, we can activate one of the many latent behaviors residing within the LLM.
In the language of shard theory: “the hope is that shards activate based on feature directions in early layers. By adding in these directions, the corresponding shards activate different behaviors in the model.”
In the language of shard theory: “the hope is that shards activate based on feature directions in early layers. By adding in these directions, the corresponding shards activate different behaviors in the model.”