I guess that I’m imagining that the {presence of a representation of a path}, to the extent that it’s represented in the model at all, is used primarily to compute some sort of “top-right affinity” heuristic. So even if it is true that, when there’s no representation of a path, subtracting the {representation of a path}-vector should do nothing, I think that subtracting the “top-right affinity” vector that’s downstream of this path representation should still do something regardless of whether there is or isn’t currently a path representation.
So I guess the disagreement in our intuitions (or the intuitions suggested by our respective hypotheses) maybe just boils down to “is the thing we’re editing closer to a {path representation} or a {top-right affinity heuristic}?” Maybe this weakly implies that this effect might weaken/disappear if you tried to do your AVE at a later layer (as I suggest at the end of this comment), since that might be more likely to represent a {top-right affinity heuristic} than a {path representation}?
It’s possible, however, that I’m misunderstanding your point. To help clarify, can I ask what you mean by “representation of a path” on a slightly more mechanistic level?
Do you mean you can find some set of activations (after the edited layer) from which you can faithfully reconstruct the path to the top right?
Or do you perhaps mean something weaker, like being able to find some activation that strongly and robustly correlates with “top-right-path-existence” or “top-right-path-length”, or something like that?[1]
Or maybe you didn’t mean anything specific and were just trying to draw a comparison to other reasoning processes? If this is the case, I think I don’t quite buy that this is too likely to be informative about the maze model’s internal cognition without further justification.
Or maybe you meant something else entirely!!! I’m sure I’ve left out many very reasonable possibilities, so please do correct me when I’m wrong!
Btw, it seems like a cheap and relatively informative experiment to just try computing neural correlates with variables like “distance to top-right-most reachable point” or “how close top-right-most reachable point is to the top-right”. This might be worth doing even if this isn’t what you meant by “representations of a path”, since it could shed light on what channels/layers are most important or best to perform AVE on.
I guess that I’m imagining that the {presence of a representation of a path}, to the extent that it’s represented in the model at all, is used primarily to compute some sort of “top-right affinity” heuristic. So even if it is true that, when there’s no representation of a path, subtracting the {representation of a path}-vector should do nothing, I think that subtracting the “top-right affinity” vector that’s downstream of this path representation should still do something regardless of whether there is or isn’t currently a path representation.
So I guess the disagreement in our intuitions (or the intuitions suggested by our respective hypotheses) maybe just boils down to “is the thing we’re editing closer to a {path representation} or a {top-right affinity heuristic}?” Maybe this weakly implies that this effect might weaken/disappear if you tried to do your AVE at a later layer (as I suggest at the end of this comment), since that might be more likely to represent a {top-right affinity heuristic} than a {path representation}?
It’s possible, however, that I’m misunderstanding your point. To help clarify, can I ask what you mean by “representation of a path” on a slightly more mechanistic level?
Do you mean you can find some set of activations (after the edited layer) from which you can faithfully reconstruct the path to the top right?
Or do you perhaps mean something weaker, like being able to find some activation that strongly and robustly correlates with “top-right-path-existence” or “top-right-path-length”, or something like that?[1]
Or maybe you didn’t mean anything specific and were just trying to draw a comparison to other reasoning processes? If this is the case, I think I don’t quite buy that this is too likely to be informative about the maze model’s internal cognition without further justification.
Or maybe you meant something else entirely!!! I’m sure I’ve left out many very reasonable possibilities, so please do correct me when I’m wrong!
Btw, it seems like a cheap and relatively informative experiment to just try computing neural correlates with variables like “distance to top-right-most reachable point” or “how close top-right-most reachable point is to the top-right”. This might be worth doing even if this isn’t what you meant by “representations of a path”, since it could shed light on what channels/layers are most important or best to perform AVE on.