Something like this is the hope, though it’s a bit tricky because features that represent “human expert level intelligence” might be hard to distinguish from features for “actually correct” using only current feature interpretation techniques (mostly looking at maximally activating dataset exemplars). But it seems pretty plausible that we could develop better interpretation techniques that would be suitable here.
Sam Marks
I give a counterargument to this in the typo-riddled, poorly-written Tweet here. Sadly I won’t have a chance to write up thoughts here more cleanly for a few days.
ETA: Briefly, the key points are:Honesty issues for introspection aren’t obviously much worse than they are for simple probing. (But fair if you’re already not excited about simple probing.)
When you can ask models arbitrary questions about their cognition, I think it’s probably quite difficult for a model to tell on which inputs it can get away with lying.
While I agree the example in Sycophancy to Subterfuge isn’t realistic, I don’t follow how the architecture you describe here precludes it. I think a pretty realistic set-up for training an agent via RL would involve computing scalar rewards on the execution machine or some other machine that could be compromised from the execution machine (with the scalar rewards being sent back to the inference machine for backprop and parameter updates).
I continue to think that capabilities from in-context RL are and will be a rounding error compared to capabilities from training (and of course, compute expenditure in training has also increased quite a lot in the last two years).
I do think that test-time compute might matter a lot (e.g. o1), but I don’t expect that things which look like in-context RL are an especially efficient way to make use of test-time compute.
Evaluating Sparse Autoencoders with Board Game Models
Why would it 2x the cost of inference? To be clear, my suggested baseline is “attach exactly the same LoRA adapters that were used for RR, plus one additional linear classification head, then train on an objective which is similar to RR but where the rerouting loss is replaced by a classification loss for the classification head.” Explicitly this is to test the hypothesis that RR only worked better than HP because it was optimizing more parameters (but isn’t otherwise meaningfully different from probing).
(Note that LoRA adapters can be merged into model weights for inference.)
(I agree that you could also just use more expressive probes, but I’m interested in this as a baseline for RR, not as a way to improve robustness per se.)
Thanks to the authors for the additional experiments and code, and to you for your replication and write-up!
IIUC, for RR makes use of LoRA adapters whereas HP is only a LR probe, meaning that RR is optimizing over a more expressive space. Does it seem likely to you that RR would beat an HP implementation that jointly optimizes LoRA adapters + a linear classification head (out of some layer) so that the model retains performance while also having the linear probe function as a good harmfulness classifier?
(It’s been a bit since I read the paper, so sorry if I’m missing something here.)
Anthropic has asked employees
[...]
Anthropic has offered at least one employee
As a point of clarification: is it correct that the first quoted statement above should be read as “at least one employee” in line with the second quoted statement? (When I first read it, I parsed it as “all employees” which was very confusing since I carefully read my contract both before signing and a few days ago (before posting this comment) and I’m pretty sure there wasn’t anything like this in there.)
- Jul 1, 2024, 1:36 AM; 7 points) 's comment on Habryka’s Shortform Feed by (
(I’m a full-time employee at Anthropic.) It seems worth stating for the record that I’m not aware of any contract I’ve signed whose contents I’m not allowed to share. I also don’t believe I’ve signed any non-disparagement agreements. Before joining Anthropic, I confirmed that I wouldn’t be legally restricted from saying things like “I believe that Anthropic behaved recklessly by releasing [model]”.
- Jun 30, 2024, 7:12 PM; 35 points) 's comment on Habryka’s Shortform Feed by (
there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:
As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.
You’re correct that I was neglecting this threat model—good point, and thanks.
So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.
Hmm, here’s another way to frame this discussion:
Sam: Before evaluating the RTFs in the paper, you should apply a ~sigmoid function with the transition region centered on ~10^-6 or something. Note that all of the RTFs we’re discussing come after this transition.
Ryan: While it’s true that we should first be applying a ~sigmoid, we should have uncertainty over where the transition region begins. When taking this uncertainty into account, the way to decide importance boils down to comparing the observed RTF to your distribution of thresholds in a way which is pretty similar to what people were implicitly doing anyway.
Does that sound like a reasonable summary? If so, I think I basically agree with you.
(I do think that it’s important to keep in mind that for threat models where reward hacking is reinforced, small RTFs might matter a lot, but maybe that was already obvious to everyone.)
In this comment, I’ll use reward tampering frequency (RTF) to refer to the proportion of the time the model reward tampers.
I think that in basically all of the discussion above, folks aren’t using a correct mapping of RTF to practical importance. Reward hacking behaviors are positively reinforced once they occur in training; thus, there’s a rapid transition in how worrying a given RTF is, based on when reward tampering becomes frequent enough that it’s likely to appear during a production RL run.
To put this another way: imagine that this paper had trained on the final reward tampering environment the same way that it trained on the other environments. I expect that:
the RTF post-final-training-stage would be intuitively concerning to everyone
the difference between 0.01% and 0.02% RTF pre-final-training-stage wouldn’t have a very big effect on RTF post-final-training-stage.
In other words, insofar as you’re updating on this paper at all, I think you should be updating based on whether the RTF > 10^-6 (not sure about the exact number, but something like this) and the difference between RTFs of 1e-4 and 2e-4 just doesn’t really matter.
Separately, I think it’s also defensible to not update much on this paper because it’s very unlikely that AI developers will be so careless as to allow models they train a chance to directly intervene on their reward mechanism. (At least, not unless some earlier thing has gone seriously wrong.) (This is my position.)
Anyway, my main point is that if you think the numbers in this paper matter at all, it seems like they should matter in the direction of being concerning. [EDIT: I mean this in the sense of “If reward tampering actually occurred with this frequency in training of production models, that would be very bad.” Obviously for deciding how to update you should be comparing against your prior expectations.]
This was awesome. Here are some more stories in the same style.
Homeless person or professor?
It can be hard to tell in Cambridge, Massachusetts. That’s partly because some professors—mostly the MIT ones—can look very disheveled. But partly it’s because some homeless people can be surprisingly intellectual, e.g. it’s not uncommon to find homeless people crouched in the shade reading a book.
My favorite example is a homeless man in Harvard Square. His name in my head is “Black Santa” because he’s a old man with a full belly and white beard, and he’s always surrounded by trash-bag-sacks not of toys, but of his possessions. He’s always in the same spot, a stretch of Harvard Square that lots of homeless people hang out in. But while the other, mostly young, homeless in the area typically spend their time begging or zonked out, I only ever see Black Santa writing in a small notebook.
What’s he writing all the time? As best I can tell from peeping over his shoulder as I pass by, he’s writing poetry. Sometimes I spot him reciting it out loud before he goes back to scribbling.
Some day I hope to muster the courage to strike up a conversation out of the blue with him and learn more.
Remember that time I broke all my limbs at once...?
Most of my chats with homeless people happened when I had a fractured leg and was moving around in crutches. I think for the local homeless people—many of whom have disabilities, and essentially all of whom have friends with disabilities—the crutches made me seem more familiar. Or maybe it was just that I was spending more time sitting at bus stops with nothing to do but talk.
The most interesting, and the saddest, conversation I had was with a man who recognized my knee brace: “oh yeah I had one of those once too. Two of them actually.”
“Did you break your leg twice? I’m sorry.”
“Well, two of them at the same time I mean—both my legs were broken. Both my arms too. I got hit by a car and it broke both my arms and legs.”
“Oh god, that’s horrible.”
“Yeah the medical bill was rough, no way I could pay it. I heard an ad on the radio for one of those lawyers that sues people for stuff like this. So I hired them and sued the person who hit me.”
“Did you win?”
“Yeah, but the person who hit me didn’t have any money, so I didn’t actually get anything out of it...”
If something like this had happened to me, I think it would be the most traumatic event of my life, one that hurt to remember. On the other hand, the man telling the story told it casually, as if he kept remembering more things he did over the weekend that he wanted to chat about.
Would you like some cheese?
There’s a kindly alcoholic that I sometimes see at the bus stop outside my apartment. He often smiles at me and says hello when I go by.
The first time I saw him, he was sitting there with a huge plastic-wrapped wedge of Jarlsberg cheese. With a giddy grin, he held it out to me in offering. I do like Jarlsberg, but declined.
An hour later I left my apartment and the man was gone. In his place was the unwrapped wedge of cheese on the ground with a single bite taken out of it.
I should have accepted the cheese!
I think this is cool! The way I’m currently thinking about this is “doing the adversary generation step of latent adversarial training without the adversarial training step.” Does that seem right?
It seems intuitively plausible to me that once you have a latent adversarial perturbation (the vectors you identify), you might be able to do something interesting with it beyond “train against it” (as LAT does). E.g. maybe you would like to know that your model has a backdoor, beyond wanting to move to the next step of “train away the backdoor.” If I were doing this line of work, I would try to some up with toy scenarios with the property “adversarial examples are useful for reasons other than adversarial training” and show that the latent adversarial examples you can produce are more useful than input-level adversarial examples (in the same way that the LAT paper demonstrates that LAT can outperform input-level AT).
Yep, you’re totally right—thanks!
Oh, one other issue relating to this: in the paper it’s claimed that if is the argmin of then is the argmin of . However, this is not actually true: the argmin of the latter expression is . To get an intuition here, consider the case where and are very nearly perpendicular, with the angle between them just slightly less than . Then you should be able to convince yourself that the best factor to scale either or by in order to minimize the distance to the other will be just slightly greater than 0. Thus the optimal scaling factors cannot be reciprocals of each other.
ETA: Thinking on this a bit more, this might actually reflect a general issue with the way we think about feature shrinkage; namely, that whenever there is a nonzero angle between two vectors of the same length, the best way to make either vector close to the other will be by shrinking it. I’ll need to think about whether this makes me less convinced that the usual measures of feature shrinkage are capturing a real thing.
ETA2: In fact, now I’m a bit confused why your figure 6 shows no shrinkage. Based on what I wrote above in this comment, we should generally expect to see shrinkage (according to the definition given in equation (9)) whenever the autoencoder isn’t perfect. I guess the answer must somehow be “equation (10) actually is a good measure of shrinkage, in fact a better measure of shrinkage than the ‘corrected’ version of equation (10).” That’s pretty cool and surprising, because I don’t really have a great intuition for what equation (10) is actually capturing.
Ah thanks, you’re totally right—that mostly resolves my confusion. I’m still a little bit dissatisfied, though, because the term is optimizing for something that we don’t especially want (i.e. for to do a good job of reconstructing ). But I do see how you do need to have some sort of a reconstruction-esque term that actually allows gradients to pass through to the gated network.
(The question in this comment is more narrow and probably not interesting to most people.)
The limitations section includes this paragraph:
One worry about increasing the expressivity of sparse autoencoders is that they will overfit when
reconstructing activations (Olah et al., 2023, Dictionary Learning Worries), since the underlying
model only uses simple MLPs and attention heads, and in particular lacks discontinuities such as step
functions. Overall we do not see evidence for this. Our evaluations use held-out test data and we
check for interpretability manually. But these evaluations are not totally comprehensive: for example,
they do not test that the dictionaries learned contain causally meaningful intermediate variables in the
model’s computation. The discontinuity in particular introduces issues with methods like integrated
gradients (Sundararajan et al., 2017) that discretely approximate a path integral, as applied to SAEs
by Marks et al. (2024).I’m not sure I understand the point about integrated gradients here. I understand this sentence as meaning: since model outputs are a discontinuous function of feature activations, integrated gradients will do a bad job of estimating the effect of patching feature activations to counterfactual values.
If that interpretation is correct, then I guess I’m confused because I think IG actually handles this sort of thing pretty gracefully. As long as the number of intermediate points you’re using is large enough that you’re sampling points pretty close to the discontinuity on both sides, then your error won’t be too large. This is in contrast to attribution patching which will have a pretty rough time here (but not really that much worse than with the normal ReLU encoders, I guess). (And maybe you also meant for this point to apply to attribution patching?)
I’m a bit perplexed by the choice of loss function for training GSAEs (given by equation (8) in the paper). The intuitive (to me) thing to do here would be would be to have the and terms, but not the term, since the point of is to tell you which features should be active, not to itself provide good feature coefficients for reconstructing . I can sort of see how not including this term might result in the coordinates of all being extremely small (but barely positive when it’s appropriate to use a feature), such that the sparsity term doesn’t contribute much to the loss. Is that what goes wrong? Are there ablation experiments you can report for this? If so, including this term still currently seems to me like a pretty unprincipled way to deal with this—can the authors provide any flavor here?
Here are two ways that I’ve come up with for thinking about this loss function—let me know if either of these are on the right track. Let denote the gated encoder, but with a ReLU activation instead of Heaviside. Note then that is just the standard SAE encoder from Towards Monosemanticity.
Perspective 1: The usual loss from Towards Monosemanticity for training SAEs is (this is the same as your and up to the detaching thing). But now you have this magnitude network which needs to get a gradient signal. Let’s do that by adding an additional term -- your . So under this perspective, it’s the reconstruction term which is new, with the sparsity and auxiliary terms being carried over from the usual way of doing things.
Perspective 2 (h/t Jannik Brinkmann): let’s just add together the usual Towards Monosemanticity loss function for both the usual architecture and the new modified archiecture: .
However, the gradients with respect to the second term in this sum vanish because of the use of the Heaviside, so the gradient with respect to this loss is the same as the gradient with respect to the loss you actually used.
I believe that equation (10) giving the analytical solution to the optimization problem defining the relative reconstruction bias is incorrect. I believe the correct expression should be .
You could compute this by differentiating equation (9), setting it equal to 0 and solving for . But here’s a more geometrical argument.
By definition, is the multiple of closest to . Equivalently, this closest such vector can be described as the projection . Setting these equal, we get the claimed expression for .
As a sanity check, when our vectors are 1-dimensional, , and , we my expression gives (which is correct), but equation (10) in the paper gives .
Probably I misunderstood your concern. I interpreted your concern about settings where we don’t have access to ground truth as relating to cases where the model could lie about its inner states without us being able to tell (because of lack of ground truth). But maybe you’re more worried about being able to develop a (sufficiently diverse) introspection training signal in the first place?
I’ll also note that I’m approaching this from the angle of “does introspection have worse problems with lack-of-ground-truth than traditional interpretability?” where I think the answer isn’t that clear without thinking about it more. Traditional interpretability often hill-climbs on “producing explanations that seem plausible” (instead of hill climbing on ground-truth explanations, which we almost never have access to), and I’m not sure whether this poses more of a problem for traditional interpretability vs. black-box approaches like introspection.