William_S comments on William_S’s Shortform

William_S 22 Mar 2023 18:13 UTC
LW: 5 AF: 3
0
AF
From discussion with Logan Riggs (Eleuther) who worked on the tuned lens: the tuned lens suggests that the residual stream at different layers go through some linear transformations and so aren’t directly comparable. This would interfere with a couple of methods for trying to understand neurons based on weights: 1) the embedding space view 2) calculating virtual weights between neurons in different layers.
However, we could try correcting these using the transformations learned by the tuned lens to translate between the residual stream at different layers, and maybe this would make these methods more effective. By default I think the tuned lens learns only the transformation needed to predict the output token but the method could be adapted to retrodict the input token from each layer as well, we’d need both. Code for tuned lens is at https://github.com/alignmentresearch/tuned-lens
- wassname 9 May 2024 23:37 UTC
  1 point
  0
  Parent
  Here’s something I’ve been pondering.
  
  hypothesis: If transformers has internal concepts, and they are represented in the residual stream. Then because we have access to 100% of the information then it should be possible for a non-linear probe to get 100% out of distribution accuracy. 100% is important because we care about how a thing like value learning will generalise OOD.
  
  And yet we don’t get 100% (in fact most metrics are much easier than what we care about, being in-distribution, or on careful setups). What is wrong with the assumptions hypothesis, do you think?