Neel Nanda

Karma: 9,215

Neel Nanda 25 Jan 2025 14:14 UTC
10 points
0
on: Tail SP 500 Call Options
Interesting. Does anyone know what the counterparty risk is like here? Eg, am I gambling on the ETF continuing to be provided, the option market maker not going bust, the relevant exchange continuing to exist, etc. (the first and third generally seem like reasonable bets, but in a short timelines world everything is high variance...)

Neel Nanda 25 Jan 2025 12:20 UTC
5 points
0
in reply to: Adam Karvonen’s comment on: SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
Yeah, if you’re doing this, you should definitely pre compute and save activations

Neel Nanda 24 Jan 2025 10:58 UTC
3 points
0
in reply to: John Hughes’s comment on: Tips and Code for Empirical Research Workflows
I’ve been really enjoying voice to text + LLMs recently, via a great Mac App called Super Whisper (which can work with local speech to text models, so could also possibly be used for confidential stuff) - combining Super Whisper and Claude and Cursor means I can just vaguely ramble at my laptop about what experiments should happen and they happen, it’s magical!

Neel Nanda 21 Jan 2025 10:58 UTC
13 points
2
on: Some lessons from the OpenAI-FrontierMath debacle
I agree that OpenAI training on Frontier Math seems unlikely, and not in their interests. The thing I find concerning is that having high quality evals is very helpful for finding capabilities improvements—ML research is all about trying a bunch of stuff and seeing what works. As benchmarks saturate, you want new ones to give you more signal. If Epoch have a private benchmark they only apply to new releases, this is fine, but if OpenAI can run it whenever they want, this is plausibly fairly helpful for making better systems faster, since this makes hill climbing a bit easier.

Neel Nanda 21 Jan 2025 1:00 UTC
LW: 8 AF: 4
6
AF
on: Tips and Code for Empirical Research Workflows
This looks extremely comprehensive and useful, thanks a lot for writing it! Some of my favourite tips (like clipboard managers and rectangle) were included, which is always a good sign. And I strongly agree with “Cursor/LLM-assisted coding is basically mandatory”.

I passed this on to my mentees—not all of this transfers to mech interp, in particular the time between experiments is often much shorter (eg a few minutes, or even seconds) and often almost an entire project is in de-risking mode, but much of it transfers. And the ability to get shit done fast is super important

Neel Nanda 19 Jan 2025 13:23 UTC
22 points
0
in reply to: Jonathan Claybrough’s comment on: Jonathan Claybrough’s Shortform
This seems fine to me (you can see some reasons I like Epoch here). My understanding is that most Epoch staff are concerned about AI Risk, though tend to longer timelines and maybe lower p(doom) than many in the community, and they aren’t exactly trying to keep this secret.

Your argument rests on an implicit premise that Epoch talking about “AI is risky” in their podcast is important, eg because it’d change the mind of some listeners. This seems fairly unlikely to me—it seems like a very inside baseball podcast, mostly listened to by people already aware of AI risk arguments, and likely that Epoch is somewhat part of the AI risk-concerned community. And, generally, I don’t think that all media produced by AI risk concerned people needs to mention that AI risk is a big deal—that just seems annoying and preachy. I see Epoch’s impact story as informing people of where AI is likely to go and what’s likely to happen, and this works fine even if they don’t explicitly discuss AI risk

Neel Nanda 9 Jan 2025 15:54 UTC
LW: 7 AF: 5
2
AF
in reply to: maxnadeau’s comment on: AI Timelines
I don’t know much about CTF specifically, but based on my maths exam/olympiad experience I predict that there’s a lot of tricks to go fast (common question archetypes, saved code snippets, etc) that will be top of mind for people actively practicing, but not for someone with a lot of domain expertise who doesn’t explicitly practice CTF. I also don’t know how important speed is for being a successful cyber professional. They might be able to get some of this speed up with a bit of practice, but I predict by default there’s a lot of room for improvement.

Neel Nanda 8 Jan 2025 17:32 UTC
5 points
0
in reply to: Neel Nanda’s comment on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
Tagging @philh @bilalchughtai @eventuallyalways @jbeshir in case this is relevant to you (though pooling money to get GWWC interested in helping may make more sense, if it can enable smaller donors and has lower fees)

Neel Nanda 8 Jan 2025 17:30 UTC
20 points
2
in reply to: habryka’s comment on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
For anyone considering large-ish donations (in the thousands), there are several ways to do this in general for US non-profits, as a UK taxpayer. (Several of these also work for people who pay tax in both the US and UK)
The one I’d recommend here is using the Anglo-American Charity—you can donate to them tax-deductibly (a UK charity) and they’ll send it to a US non-profit. I hear from a friend that they’re happy to forward it to every.org so this should be easy here. The main annoying thing is fees—for amounts below £15K it’s min(4%, £250) (so 4% above £6250, a flat fee of £250 below).
Fee structure; 4% on donations under £15k, 3% on gifts between £15,001 to £50,000 and 2% on gifts over £50k. Minimum gift of £1,000. Minimum fee £250.
The minimum donation is £1,000.00.
But this is still a major saving, even in the low thousands, especially if you’re in a high tax bracket. Though it may make more sense for you to donate to another charity.
Another option is the charitable aid foundation’s donor advised gift, which is a similar deal with worse fees: min(4%, £400) on amounts below £150K.
If you’re a larger donor (eg £20K+) it may make sense to set up a donor advised fund, which will often let you donate to worldwide non-profits.

Neel Nanda 2 Jan 2025 12:11 UTC
5 points
0
in reply to: kave’s comment on: The Plan − 2024 Update
Sure, but I think that human cognition tends to operate at a level of abstract above the configuration of atoms in a 3D environment. Like “that is a chair” is a useful way to reason about an environment. Whilethat “that is a configuration of pixels that corresponds to a chair when projected at a certain angle in certain lighting conditions” must first be converted to “that is a chair” before anything useful can be done. Text just has a lot of useful preprocessing applied already and is far more compressed

Neel Nanda 31 Dec 2024 20:30 UTC
14 points
2
in reply to: Lucius Bushnaq’s comment on: The Plan − 2024 Update
Strong +1, that argument didn’t make sense to me. Images are a fucking mess—they’re a grid of RGB pixels, of a 3D environment (interpreted through the lens of a camera) from a specific angle. Text is so clean and pretty in comparison, and has much richer meaning, and has a much more natural mapping to concepts we understand

Neel Nanda 31 Dec 2024 20:25 UTC
36 points
5
in reply to: Garrett Baker’s comment on: The Plan − 2024 Update
Fwiw, this is not at all obvious to me, and I would weakly bet that larger models are harder to interpret (even beyond there just being more capabilities to study)

Neel Nanda 29 Dec 2024 19:46 UTC
4 points
0
in reply to: habryka’s comment on: evhub’s Shortform
I would be very surprised if it had changed for early employees. I considered the donation matching part of my compensation package (it 2.5x the amount of equity, since it was a 3:1 match on half my equity), and it would be pretty norm violating to retroactively reduce compensation

Neel Nanda 28 Dec 2024 21:24 UTC
11 points
2
in reply to: habryka’s comment on: evhub’s Shortform
I gather that they changed the donation matching program for future employees, but the 3:1 match still holds for prior employees, including all early employees (this change happened after I left, when Anthropic was maybe 50 people?)

I’m sad about the change, but I think that any goodwill due to believing the founders have pledged much of their equity to charity is reasonable and not invalidated by the change

Neel Nanda 25 Dec 2024 21:08 UTC
LW: 9 AF: 8
4
AF
on: How to replicate and extend our alignment faking demo
Thanks a lot for sharing all this code and data, seems super useful for external replication and follow-on work. It might be good to link this post from the Github readme—I initially found the Github via the paper, but not this post, and I found this exposition in this post more helpful than the current readme

Learning Multi-Level Features with Matryoshka SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

19 Dec 2024 15:59 UTC

33 points

4 comments11 min readLW link

Neel Nanda 16 Dec 2024 22:45 UTC
2 points
0
in reply to: Sam Marks’s comment on: Sam Marks’s Shortform
That’s technically even more conditional as the intervention (subtract the parallel component) also depends on the residual stream. But yes. I think it’s reasonable to lump these together though, orthogonalisation also should be fairly non destructive unless the direction was present, while steering likely always has side effects

Neel Nanda 16 Dec 2024 6:03 UTC
12 points
4
in reply to: Sam Marks’s comment on: Sam Marks’s Shortform
Note that this is conditional SAE steering—if the latent doesn’t fire it’s a no-op. So it’s not that surprising that it’s less damaging, a prompt is there on every input! It depends a lot on the performance of the encoder as a classifier though

Neel Nanda 16 Dec 2024 6:00 UTC
3 points
0
on: Remap your caps lock key
When do you use escape?

Neel Nanda 12 Dec 2024 16:19 UTC
6 points
2
in reply to: Zach Stein-Perlman’s comment on: Zach Stein-Perlman’s Shortform
It seems unlikely that openai is truly following the test the model plan? They keep eg putting new experimental versions onto lmsys, presumably mostly due to different post training, and it seems pretty expensive to be doing all the DC evals again on each new version (and I think it’s pretty reasonable to assume that a bit of further post training hasn’t made things much more dangerous)

Neel Nanda

Learn­ing Multi-Level Fea­tures with Ma­tryoshka SAEs

Learning Multi-Level Features with Matryoshka SAEs