Sammy Martin comments on Steering GPT-2-XL by adding an activation vector

Sammy Martin 17 May 2023 22:10 UTC
LW: 5 AF: 2
0
AF
This strikes me as a very preliminary bludgeon version of the holy grail of mechanistic interpretability, which is to say actually understanding and being able to manipulate the specific concepts that an AI model uses
- TurnTrout 22 May 2023 14:27 UTC
  LW: 5 AF: 3
  2
  AF Parent
  I think that capacity would be really nice. I think our results are maybe a very very rough initial version of that capacity. I want to caution that we should be very careful about making inferences about what concepts are actually used by the model. From a footnote:
  Of course, there need not be a “wedding” feature direction in GPT-2-XL. What we have observed is that adding certain activation vectors will reliably produce completions which appear to us to be “more about weddings.” This could take place in many ways, and we encourage people to avoid instantly collapsing their uncertainty about how steering vectors work.