Junior Alignment Researcher
abhatt349
Small nitpick, but is this meant to say instead? Because if , then the axiom reduces to , which seems impossible to satisfy for all (for nearly all preference relations).
My rough guess for Question 2.1:
The model likely cares about number of characters because it allows it to better encode things with fixed-width fonts that contain some sort of spatial structure, such as ASCII art, plaintext tables, 2-D games like sudoku, tic-tac-toe, and chess, and maybe miscellaneous other things like some poetry, comments/strings in code[1], or the game of life.
A priori, storing this feature categorically is probably a far more efficient encoding/representation than linearly (especially since length likely has at most 10 common values). However, the most useful/common operation one might want to do with this feature is “compute the length of the concatenation of two tokens,” and so we also want our encodings to facilitate efficient addition. For a categorical embedding, we’d need to store an addition lookup table, which requires something like quadratic space[2], whereas a linear embedding would allow sums to be computed basically trivially[3].
This argument isn’t enough on its own, since we also need to move the stored length info between tokens in order to add them, which is severely bottlenecked by the low rank of attention heads. If this were “more of a bottleneck” than the type of MLP computation that’s necessary to implement an addition table, then it’d make sense to store length categorically instead.
I don’t know if I could’ve predicted which bottleneck would’ve won out before seeing this post. I suspect I would’ve guessed the MLP computation (implying a linear representation), but I wouldn’t have been very confident. In fact, I wouldn’t be surprised if, despite length being linearly represented, there are still a few longer outlier tokens (that are particularly common in the context of length-relevant tasks) whose lengths are stored categorically and then added using something like a smaller lookup table.
- ^
The code itself would, of course, be the biggest example, but I’m not sure how relevant non-whitespace token length is for most formatting
- ^
In particular, you’d need a lookup table of at least size , where is the longest single string you’d want to track the length of, and is the length of the longest token. I expect to be on the order of hundreds, and to be at most about 10 (since we can ignore a few outlier tokens)
- ^
linear operations are pretty free, and addition of linearly represented features is as linear as it gets
- ^
[Note: One idea is to label the dataset w/ the feature vector e.g. saying this text is a latex $ and this one isn’t. Then learn several k-sparse probes & show the range of k values that get you whatever percentage of separation]
You’ve already thanked Wes, but just wanted to note that his paper may be of interest here.
If you’re interested, “When is Goodhart catastrophic?” characterizes some conditions on the noise and signal distributions (or rather, their tails) that are sufficient to guarantee being screwed (or in business) in the limit of many studies.
The downside is that because it doesn’t make assumptions about the distributions (other than independence), it sadly can’t say much about the non-limiting cases.
Very small typo: when you define LayerNorm, you say when I think you mean ? Please feel free to ignore if this is wrong!!!
I do agree that looking at alone seems a bit misguided (unless we’re normalizing by looking at cosine similarity instead of dot product). However, the extent to which this is true is a bit unclear. Here are a few considerations:
At first blush, the thing you said is exactly right; scaling up and scale down will leave the implemented function unchanged.
However, this’ll affect the L2 regularization penalty. All else equal, we’d expect to see , since that minimizes the regularization penalty.
However, this is all complicated by the fact that you can also alternatively scale the LayerNorm’s gain parameter, which (I think) isn’t regularized.
Lastly, I believe GPT2 uses GELU, not ReLU? This is significant, since it no longer allows you to scale and without changing the implemented function.
We are surprised by the decrease in Residual Stream norm in some of the EleutherAI models.
...
According to the model card, the Pythia models have “exactly the same” architectures as their OPT counterparts
I could very well be completely wrong here, but I suspect this could primarily be an artifact of different unembeddings.It seemed to me from the model card that although the Pythia models have “exactly the same” architecture, they only have the same number of non-embedding parameters. The Pythia models all have more total parameters than their counterparts and therefore more embedding parameters, implying that they’re using a different embedding/unembedding scheme. In particular, the EleutherAI models use the GPT-NeoX-20B tokenizer instead of the GPT-2 tokenizer (they also use rotary embeddings, which I don’t expect to matter as much).
In addition, all the decreases in Residual Stream norm occur in the last 2 layers, which is exactly where I would’ve expected to see artifacts of the embedding/unembedding process[1]. I’m not familiar enough with the differences in the tokenizers to have predicted the decreasing Residual Stream norm ex ante, but it seems kinda likely ex post that whatever’s causing this large systematic difference in EleutherAI models’ norms is due to them using a different tokenizer.
- ^
I also would’ve expected to see these artifacts in the first layer, which we don’t really see, so take this with a grain of salt, I guess. I do still think this is pretty characteristic of “SGD trying its best to deal with unembedding shenanigans by doing weird things in the last layer or two, leaving the rest mostly untouched,” but this might just be me pattern-matching to a bad internal narrative/trope I’ve developed.
- ^
Hmmm, I suspect that when most people say things like “the reward function should be a human-aligned objective,” they’re intending something more like “the reward function is one for which any reasonable learning process, given enough time/data, would converge to an agent that ends up with human-aligned objectives,” or perhaps the far weaker claim that “the reward function is one for which there exists a reasonable learning process that, given enough time/data, will converge to an agent that ends up with human-aligned objectives.”
I guess that I’m imagining that the {presence of a representation of a path}, to the extent that it’s represented in the model at all, is used primarily to compute some sort of “top-right affinity” heuristic. So even if it is true that, when there’s no representation of a path, subtracting the {representation of a path}-vector should do nothing, I think that subtracting the “top-right affinity” vector that’s downstream of this path representation should still do something regardless of whether there is or isn’t currently a path representation.
So I guess the disagreement in our intuitions (or the intuitions suggested by our respective hypotheses) maybe just boils down to “is the thing we’re editing closer to a {path representation} or a {top-right affinity heuristic}?” Maybe this weakly implies that this effect might weaken/disappear if you tried to do your AVE at a later layer (as I suggest at the end of this comment), since that might be more likely to represent a {top-right affinity heuristic} than a {path representation}?
It’s possible, however, that I’m misunderstanding your point. To help clarify, can I ask what you mean by “representation of a path” on a slightly more mechanistic level?
Do you mean you can find some set of activations (after the edited layer) from which you can faithfully reconstruct the path to the top right?
Or do you perhaps mean something weaker, like being able to find some activation that strongly and robustly correlates with “top-right-path-existence” or “top-right-path-length”, or something like that?[1]
Or maybe you didn’t mean anything specific and were just trying to draw a comparison to other reasoning processes? If this is the case, I think I don’t quite buy that this is too likely to be informative about the maze model’s internal cognition without further justification.
Or maybe you meant something else entirely!!! I’m sure I’ve left out many very reasonable possibilities, so please do correct me when I’m wrong!
- ^
Btw, it seems like a cheap and relatively informative experiment to just try computing neural correlates with variables like “distance to top-right-most reachable point” or “how close top-right-most reachable point is to the top-right”. This might be worth doing even if this isn’t what you meant by “representations of a path”, since it could shed light on what channels/layers are most important or best to perform AVE on.
And the top-right vector also transfers across mazes? Why isn’t it maze-specific?
This makes a lot of sense if the top-right vector is being used to do something like “choose between circuits” or “decide how to weight various heuristics” instead of (or in addition to) actually computing any heuristic itself. There is an interesting question of how capable the model architecture is of doing things like that, which maybe warrants thinking about.[1]
This could be either the type of thinking that looks like “try to find examples of this in the model by intelligently throwing in illuminating inputs” or the type that looks like “try to hand-write some parameters that implement ‘two subcircuits with a third circuit assigning the relative weighting between the two’, starting with smaller (but architecturally representative) toy models.”
- ^
I’m concerned that this type of thinking would be overly specific to the model architecture that you happen to be using, which might not help learn about the more general phenomena of shards/values/etc, but it’s possible that it might be useful nonetheless if you’re planning on studying these models at length.
- ^
The patches compose!
In the framework of the comment above regarding the add/subtract thing, I’d also be interested in examining the function
diff(s,t) = f(input+t*top_right_vec+s*cheese_vec) - f(input)
.The composition claim here is saying something like
diff(s,t) = diff(s,0) + diff(0,t)
. I’d be interested to see when this is true. It seems like your current claim is that this (approximately) holds when s<0 and t>0 and neither are too large, but maybe it holds in more or fewer scenarios. In particular, I’m surprised at the weird hard boundaries ats=0
andt=0
.
I wish I knew why.
Same.
I don’t really have any coherent hypotheses (not that I’ve tried for any fixed amount of time by the clock) for why this might be the case. I do, however, have a couple of vague suggestions for how one might go about gaining slightly more information that might lead to a hypothesis, if you’re interested.
The main one involves looking at the local nonlinearities of the few layers after the intervention layer at various inputs, by which I mean examining
diff(t) = f(input+t*top_right_vec) - f(input)
as a function of t (for small values of t, in particular) (wheref=nn.Sequential({the n layers after the intervention layer})
for various small integers n).One of the motivations for this is that it feels more confusing that [adding works and subtracting doesn’t] than that [increasing the coefficient strength does diff things in diff regimes, ie for diff coefficient strengths], but if you think about it, both of those are just us being surprised/confused that the function I described above is locally nonlinear for various values of t.[1] It seems possible, then, that examining the nonlinearities in the subsequent few layers could shed some light on a slightly more general phenomenon that’ll also explain why adding works but subtracting doesn’t.
It’s also possible, of course, that all the relevant nonlinearities kick in much further down the line, which would render this pretty useless. If this turns out to be the case, one might try finding “cheese vectors” or “top-right vectors” in as late a layer as possible[2], and then re-attempt this.
- ^
We only care more about the former confusion (that adding works and subtracting doesn’t) because we’re privileging t=0, which isn’t unreasonable, but perhaps zooming out just a bit will help, idk
- ^
I’m under the impression that the current layer wasn’t chosen for much of a particular reason, so it might be a simple matter to just choose a later layer that performs nearly as well?
- 7 Apr 2023 7:35 UTC; 10 points) 's comment on Maze-solving agents: Add a top-right vector, make the agent go to the top-right by (
- ^
Copying: how much the head output increases the logit of [A] compared to the other logits.
Please correct me if I’m wrong, but I believe you mean [B] here instead of [A]?
From the “Conclusion and Future Directions” section of the colab notebook:
Most of all, we cannot handwave away LayerNorm as “just doing normalization”; this would be analogous to describing ReLU as “just making things nonnegative”.
I don’t think we know too much about what exactly LayerNorm is doing in full-scale models, but at least in smaller models, I believe we’ve found evidence of transformers using LayerNorm to do nontrivial computations[1].
- ^
I think I vaguely recall something about this in either Neel Nanda’s “Rederiving Positional Encodings” stuff, or Stefan Heimersheim + Kajetan Janiak’s work on 4-layer Attn-only transformers, but I could totally be misremembering, sorry.
- ^
Sorry for the mundane comment, but in the “Isolating the Nonlinearity” section of the colab notebook, you say
Note that a vector in dimensions with mean 0 has variance 1 if and only if it has length
I think you might’ve meant to say there instead of , but please do correct me if I’m wrong!!!
Sorry for the pedantic comment, but I think you might’ve meant to have in the denominator here.
Thanks for the great post! I have a question, if it’s not too much trouble:
Sorry for my confusion about something so silly, but shouldn’t the following be “when ”?
When there is no place where the derivative vanishes
I’m also a bit confused about why we can think of as representing “which moment of the interference distribution we care about.”
Perhaps some of my confusion here stems from the fact that it seems to me that the optimal number of subspaces, , is an increasing function of , which doesn’t seem to line up with the following:
Hence when is large we want to have fewer subspaces
What am I missing here?
places which seem like canonical examples of very-probably-messy-territory repeatedly turn out to not be so messy
May I ask for a few examples of this?
The claim definitely seems plausible to me, but I can’t help but think of examples like gravity or electromagnetism, where every theory to date has underestimated the messiness of the true concept. It’s possible that these aren’t really much evidence against the claim but rather indicative of a poor ontology:
People who expect a “clean” territory tend to be shocked by how “messy” the world looks when their original ontology/model inevitably turns out to not fit it very well.
However, it feels hard to differentiate (intuitively or formally) cases where our model is a poor fit from cases where the territory is truly messy. Without being able to confidently make this distinction, the claim that the territory itself isn’t messy seems a bit unfalsifiable. Any evidence of territories turning out to be messy could be chalked up to ill-fitting ontologies.
Hopefully, seeing more examples like the competitive markets for nails will help me better differentiate the two or, at the very least, help me build intuition for why less messy territories are more natural/common.
Thank you!
If it’s easy enough to run, it seems worth re-training the probes exactly the same way, except sampling both your train and test sets with replacement from the full dataset. This should avoid that issue. It has the downside of allowing some train/test leakage, but that seems pretty fine, especially if you only sample like 500 examples for train and 100 for test (from each of cities and neg_cities).
I’d strongly hope that after doing this, none of your probes would be significantly below 50%.