TurnTrout comments on Mechanistically Eliciting Latent Behaviors in Language Models

TurnTrout 1 May 2024 2:12 UTC
LW: 5 AF: 3
0
AF
the hope is that by “nudging” the model at an early layer, we can activate one of the many latent behaviors residing within the LLM.
In the language of shard theory: “the hope is that shards activate based on feature directions in early layers. By adding in these directions, the corresponding shards activate different behaviors in the model.”