CallumMcDougall

Karma: 2,040

CallumMcDougall Feb 22, 2025, 3:59 PM
3 points
0
in reply to: yihe’s comment on: Induction heads—illustrated
Sorry I didn’t get to this message earlier, glad you liked the post though! The answer is that attention heads can have multiple different functions—the simplest way is to store things entirely orthogonally so they lie in fully independent subspsaces, but even this isn’t necessary because it seems like transformers take advantage of superposition to represent multiple concepts at once, more so than they have dimensions.

CallumMcDougall Jan 1, 2025, 6:34 PM
5 points
0
in reply to: Fabien Roger’s comment on: How to replicate and extend our alignment faking demo
Oh, interesting, wasn’t aware of this bug. I guess this is probably fine since most people replicating it will be pulling it rather than copying and pasting it into their IDE. Also this comment thread is now here for anyone who might also get confused. Thanks for clarifying!

CallumMcDougall Dec 30, 2024, 2:40 PM
3 points
0
on: How to replicate and extend our alignment faking demo
+1, thanks for sharing! I think there’s a formatting error in the notebook, where the tags like <OUTPUT> were all removed and replaced with empty strings (e.g. see attached photo). We’ve recently made the ARENA evals material public, and we’ve got a working replication there which I think has the tags in the right place (section 2 of 3 on the page linked here)

CallumMcDougall Dec 6, 2024, 10:28 AM
4 points
1
on: [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
Amazing post! Forgot to do this for a while, but here’s a linked diagram explaining how I think about feature absorption, hopefully ppl find it helpful!

CallumMcDougall Oct 12, 2024, 9:05 PM
4 points
0
in reply to: chanind’s comment on: Toy Models of Feature Absorption in SAEs
I don’t know of specific examples, but this is the image I have in my head when thinking about why untied weights are more free than tied weights:
I think more generally this is why I think studying SAEs in the TMS setup can be a bit challenging, because there’s often too much symmetry and not enough complexity for untied weights to be useful, meaning just forcing your weights to be tied can fix a lot of problems! (We include it in ARENA mostly for illustration of key concepts, not because it gets you many super informative results). But I’m keen for more work like this trying to understand feature absorption better in more tractible cases

CallumMcDougall Jul 4, 2024, 9:06 AM
2 points
0
in reply to: eggsyntax’s comment on: How ARENA course material gets made
Oh yeah this is great, thanks! For people reading this, I’ll highlight SLT + developmental interp + mamba as areas which I think are large enough to have specific exercise sections but currently don’t

CallumMcDougall Apr 5, 2024, 1:26 PM
2 points
0
in reply to: Johnny Lin’s comment on: SAE-VIS: Announcement Post
Thanks!! Really appreciate it

CallumMcDougall Apr 1, 2024, 11:09 AM
2 points
0
in reply to: Connor Kissane’s comment on: SAE-VIS: Announcement Post
Thanks so much! (-:

CallumMcDougall Mar 31, 2024, 3:44 PM
3 points
0
in reply to: Neel Nanda’s comment on: SAE-VIS: Announcement Post
Thanks so much, really glad to hear it’s been helpful!

CallumMcDougall Jan 15, 2024, 8:17 AM
2 points
0
in reply to: Zack_M_Davis’s comment on: Six (and a half) intuitions for KL divergence
Thanks, really appreciate this (and the advice for later posts!)

CallumMcDougall Dec 29, 2023, 3:49 PM
7 points
0
in reply to: wassname’s comment on: Intro to Superposition & Sparse Autoencoders (Colab exercises)
Yep, definitely! If you’re using MSE loss then it’s got a pretty straightforward to use backprop to see how importance relates to the loss function. Also if you’re interested, I think Redwood’s paper on capacity (which is the same as what Anthropic calls dimensionality) look at derivative of loss wrt the capacity assigned to a given feature

CallumMcDougall Dec 12, 2023, 4:27 PM
2 points
0
in reply to: James Dao’s comment on: Intro to Superposition & Sparse Autoencoders (Colab exercises)
Thanks (-:

CallumMcDougall Dec 12, 2023, 4:24 PM
3 points
0
in reply to: wassname’s comment on: Intro to Superposition & Sparse Autoencoders (Colab exercises)
Good question! In the first batch of exercises (replicating toy models of interp), we play around with different importances. There are some interesting findings here (e.g. when you decrease sparsity to the point where you no longer represent all features, it’s usually the lower-importance features which collapse first). I chose not to have the SAE exercises use varying importance, although it would be interesting to play around with this and see what you get!
As for what importance represents, it’s basically a proxy for “how much a certain feature reduces loss, when it actually is present.” This can be independent from feature probability. Anthropic included it in their toy models paper in order to make those models truer to reality, in the hope that the setup could tell us more interesting lessons about actual models. From the TMS paper:
Not all features are equally useful to a given task. Some can reduce the loss more than others. For an ImageNet model, where classifying different species of dogs is a central task, a floppy ear detector might be one of the most important features it can have. In contrast, another feature might only very slightly improve performance.
If we’re talking features in language models, then importance would be “average amount that this feature reduces cross entropy loss”. I open-sourced an SAE visualiser which you can find here. You can navigate through it and look at the effect of features on loss. It doesn’t actually show the “overall importance” of a feature, but you should be able to get an idea of the kinds of situations where a feature is super loss-reducing and when it isn’t. Example of a highly loss-reducing feature: feature #8, which fires on Django syntax and strongly predicts the “django” token. This seems highly loss-reducing because (although sparse) it’s very often correct when it fires with high magnitude. On the other hand, feature #7 seems less loss-reducing, because a lot of the time it’s pushing for something incorrect (maybe there exist other features which balance it out).

CallumMcDougall Nov 13, 2023, 9:53 AM
2 points
0
in reply to: hillz’s comment on: Mech Interp Challenge: November—Deciphering the Cumulative Sum Model
Winner = highest-quality solution over the time period of a month (solutions get posted at the start of the next month, along with a new problem).

Note that we’re slightly de-emphasising the competition side now that there are occasional hints which get dropped during the month in the Slack group. I’ll still credit the best solution in the Slack group & next LW post, but the choice to drop hints was to make the problem more accessible and hopefully increase the overall reach of this series.

CallumMcDougall Sep 27, 2023, 9:58 AM
2 points
0
in reply to: jacquesthibs’s comment on: ARENA 2.0 - Impact Report
Thanks for the latter point, glad you got that impression!
These are super valid concerns, and it’s true that there’s lots of information we won’t have for a while. That said, we also have positive evidence from the first iteration of ARENA (which is about a year old now). There were only 5 full-time participants, and they’ve all gone on to do stuff I’m excited about, including the following (note that obviously some of these 5 have done more than one of the stuff on this list):
- internships at CHAI,
- working with Owain Evans (including some recent papers),
- building a community around open-source interpretability tooling,
- employment by EleutherAI for interp,
- participating in SERI MATS streams,
- work trialling at LEAP labs,
- being funded to work on independent research on ELK.
I’d also point to programs like MLAB which have similar goals and (as far as I’m aware) an even higher success rate of getting people into alignment work. Not saying that nobody from these programs goes on to do capabilities (I imagine at least a few do), but I’d be very surprised if this outweighs the positive effect from people going on to do alignment work.
One last point here—a big part of the benefit from programs like MLAB / ARENA is the connections made with people in alignment, feeling a sense of motivation & community, not just the skilling up (anecdotally, I quit my job and started working in alignment full-time after doing MLAB2, despite not then being at a point where I could apply for full-time jobs). I also get this impression from conversations w/ people who participated in & ran MLABs in the past. It’s not as simple as “go into upskilling programs, become super competent at either alignment or capabilities work, then choose one or the other”—there’s a myriad of factors which I expect to update people towards work in alignment after they go through programs like these.
If check-ins with ARENA 1.0 or 2.0 participants (or indeed MLAB participants) a year more from now reveal that a nontrivial fraction of them are working in capabilities then I’d certainly update my position here, but I’ll preregister that this doesn’t seem at all likely to me. It’s true that alignment can be a messy field with limited opportunities and clear paths, but this is becoming less of a problem as the years go on.

CallumMcDougall Sep 26, 2023, 7:54 PM
13 points
0
in reply to: Bird Concept’s comment on: ARENA 2.0 - Impact Report
Upvoted overall karma, because I think this is a valuable point to bring up and we could have done a better job discussing it in this post.
To be clear, contribution to capabilities research is a very important consideration for us, and apologies if we didn’t address this comprehensively here. A few notes to this effect:
- We selected strongly on prior alignment familiarity (particularly during the screening & interview process), and advertised it in locations we expected to mainly attract people who had prior alignment familiarity
- We encouraged interaction with the alignment researchers who were using the space while the program was running (e.g. SERI MATS scholars & other independent researchers), and often had joint talks and discussion groups
- As mentioned in this writeup, many of the participants (nearly a majority) are still working on some form of their capstone projects. Additionally, every participant chose their project to be alignment-related in some way (although this isn’t something we made strictly compulsory)
- Participants uniformly responded on the survey that they considered themselves very likely to pursue a career in technical AI safety (median probability on this was over 70%), although this kind of data isn’t necessarily reflective
- Subjectively, we had the opportunity to get to know each participant during the program, discussing their future plans and their perspectives on alignment. There’s definitely a non-zero risk that people who are upskilled in this kind of way won’t go on to do safety research, but we feel pretty confident that we (1) found people for whom this was already low-probability, (2) minimized this probability to the best of our ability

CallumMcDougall Sep 7, 2023, 9:25 AM
2 points
0
in reply to: shen yue’s comment on: Induction heads—illustrated
Hi, sorry for the late response! The layer 0 attention head should have query at position 1, and value at position 0 (same as key). Which diagram are you referring to?

CallumMcDougall Aug 18, 2023, 2:42 PM
20 points
1
on: Against Almost Every Theory of Impact of Interpretability
(context: I ran the most recent iteration of ARENA, and after this I joined Neel Nanda’s mech interp stream in SERI MATS)
Registering a strong pushback to the comment on ARENA. The primary purpose of capstone projects isn’t to turn people into AI safety technical researchers or to produce impressive capstones, it’s to give people engineering skills & experience working on group projects. The initial idea was not to even push for things that were safety-specific (much like Redwood’s recommendations—all of the suggested MLAB2 capstones were either mech interp or non-safety, iirc). The reason many people gravitated towards mech interp is that they spent a lot of time around researchers and people who were doing interesting work in mech interp, and it seemed like a good fit for both getting a feel for AI safety technical research and for general skilling up in engineering.
Additionally, I want to mention that participant responses to the question “how have your views on AI safety changed?” included both positive and negative updates on mech interp, but much more uniformly showed positive updates on AI safety technical research as a whole. Evidence like this updates me away from the hypothesis that mech interp is pulling safety researchers from other disciplines. To give a more personal example, I had done alignment research before being exposed to mech interp, but none of it made much of an impression on me. I didn’t choose mech interp instead of other technical safety research, I chose it instead of a finance career.
This being said, there is an argument that ARENA (at least the most recent iteration) had too much of a focus on mech interp, and this is something we may try to rectify in future iterations.

CallumMcDougall Aug 7, 2023, 2:43 PM
4 points
0
in reply to: Gunnar_Zarncke’s comment on: Computational Thread Art
Yep, the occlusion effect is pretty large for colored images, that’s why I use a layering system (e.g. 20% of all white threads, then 20% of all blue, then 20% of black, and cycle through). I go in reverse order, so the ones found first by the algorithm are the last ones to be placed. I also put black on top and white lowest down, cause white on top looks super jarring. The effect is that the In the colab you can play around with the order of the threads. If you reverse the order then the image looks really bad. You can also create gifs of the image forming, and see that the first threads to be added always look like random noise (because at the time the algorithm picks those lines, it’s probably already matched all the important features of that color).

The pieces do get pretty thick, usually about 1cm. Actually, for the gantry (the one shown in the video) I had to intervene to raise its height every hour or so, otherwise the arm would get snagged on the threads.

Multi-color dithering looks cool, I hadn’t come across that before, although I suspect it wouldn’t be much of an improvement (e.g. because you can apply a Gaussian blur to the monochrome images you get from dithering without it really changing the result much).

CallumMcDougall Jul 7, 2023, 11:12 AM
1 point
0
in reply to: Perusha Moodley’s comment on: Induction heads—illustrated
Thanks so much for this comment, I really appreciate it! Glad it was helpful for you 🙂