It’s worth noting that activations are one thing you can modify, but many of the most performant methods (e.g., LoRRA) modify the weights. (Representations = {weights, activations}, hence “representation” engineering.)
Current theme: default
Less Wrong (text)
Less Wrong (link)
It’s worth noting that activations are one thing you can modify, but many of the most performant methods (e.g., LoRRA) modify the weights. (Representations = {weights, activations}, hence “representation” engineering.)