Neel Nanda

Karma: 11,942

Neel Nanda Aug 1, 2025, 9:18 AM
48 points
9
in reply to: Buck’s comment on: Buck’s Shortform
Hype! A 15 min brainstorm
What would you work on if not control? Bonus points for sketching out the next 5+ new research agendas you would pursue, in priority order, assuming each previous one stopped being neglected
What is the field of ai safety messing up? Bonus: For (field) in {AI safety fields}: What are researchers in $field wrong about/making poor decisions about, in a way that significantly limits their impact?
What are you most unhappy about with how the control field has grown and the other work happening elsewhere?
What are some common beliefs by AI safety researchers about their domains of expertise that you disagree with (pick your favourite domain)?
What beliefs inside Constellation have not percolated into the wider safety community but really should?
What have you changed your mind about in the last 12 months?
You say that you don’t think control will work indefinitely and that’s sufficiently capable models will break it. Can you make that more concrete? What kind of early warning signs could we observe? Will we know when we reach models capable enough that we can no longer trust control?
If you were in charge of Anthropic what would you do?
If you were David Sacks, what would you do?
If you had had a hundred cracked mats scholars and $10,000 of compute each, what would you have them do?
If I gave you billions of dollars and 100 top researchers at a Frontier lab, what would you do?
I’m concerned that the safety community spends way too much energy on more meta things like control, evals, interpretability, etc. And has somewhat lost sight of solving the damn alignment problem. Takes? if you agree what do you think someone who wants to solve the alignment problem should actually be doing about it right now?
What are examples of the safety questions that you think are important, and can likely be studied on models in the next 2 years but not on today’s publicly available frontier models? (0.5? 1? 5? Until the 6 months before AGI?)
If you were wrong about a belief that you are currently over 50% on to do with safety, what do you predict it is and why?
What model organisms would you be most excited to see people produce? (Ditto any other the open source work)
What are some mistakes you predict many listeners are making? Bonus points for mistakes you think I personally am making
What is the most positive true thing you have to say about the field of ambitious mechanistic interpretability
What does redwood look for when hiring people, especially junior researchers?
What kind of mid-career professionals would you be most excited to see switch to control. What about other areas of air safety?
What should AGI lab safety researchers be doing differently to have a greater impact? Feel free to give a different answer per lab

Neel Nanda Jul 31, 2025, 1:22 AM
15 points
2
on: Optimizing The Final Output Can Obfuscate CoT (Research Note)
Interesting results. Thanks for doing the investigation. I would love to see some examples of the chain of thoughts before and after optimisation and whether it’s basically the same thing but reworded or very semantically different. Did you notice anything interesting when reading them?

My intuition at least for the single word case is that saying a word increases the probability the model will say it in future. Therefore, if it wants the probability it says that the future to be low, it’s incentivised not to say it anywhere

The llm judge results seem notably more interesting and I’d love to better understand what’s up there

Neel Nanda Jul 30, 2025, 10:22 AM
5 points
3
in reply to: the gears to ascension’s comment on: About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong
Yeah, I feel much better about malicious use specific ones. Agreed that HLE is more generic and this is much worse

Neel Nanda Jul 30, 2025, 2:46 AM
14 points
9
in reply to: the gears to ascension’s comment on: About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong
While I definitely agree there are negative externalities here, I also think there are extremely positive externalities from key decision makers being better informed, especially knowing how close we are to certain capabilities like AI enabled bioterrorism or cybercrime, or automated R&D/an intelligence explosion, etc. Information is great, and generally I think has a fairly positive effect even if the decision maker is not highly competent. Bioterrorism and cybercrime at least are not things I’m concerned about AGI researchers hill climbing on, automated R&D is much dicier

Private benchmarks seem solid here too

Neel Nanda MATS Applications Open (Due Aug 29)

Neel NandaJul 30, 2025, 12:55 AM

21 points

0 comments7 min readLW link

(tinyurl.com)

Neel Nanda Jul 27, 2025, 7:17 PM
19 points
11
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform

Retreats make things feel much more real to people and result in people being more agentic and approaching their choices more effectively.

Strongly agreed on this point, it’s pretty hard to substitute for the effect of being immersed in a social environment like that

Neel Nanda Jul 25, 2025, 11:45 AM
7 points
0
in reply to: Jan Betley’s comment on: Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Oh, I said try to ensure for a reason. I do think it’s somewhat tractable though

Neel Nanda Jul 25, 2025, 11:20 AM
53 points
4
on: Neel Nanda’s Shortform
To anyone currently going through NeurIPS rebuttals fun for the first time, some advice:
Firstly, if you’re feeling down about reviews, remember that peer review has been officially shown to be a ridiculous random number generator in an RCT—half of spotlight papers are rejected by another review committee! Don’t tie your self-worth to whether the roulette wheel landed on black or red. If their critiques don’t make sense, they often don’t (and were plausibly written by an LLM). And if they do make sense (and remember to control for your defensiveness), then this is great—you have valuable feedback that can improve the paper!
1. Read this guide to get a sense of what rebuttals are about
  1. Generally, be nice and polite, even if your reviewers are really annoying
2. You have three goals here:
  1. Improving the paper! Often reviewers raise some good and useful points, and ultimately one of the key goals is doing good research and communicating it well to the world
  2. Convince reviewers to like you, so they increase their score
  3. For unreasonable reviewers who dislike you, your goal is to convince the area chair (and other reviewers) that this person is wrong and unreasonable. This means you still should write a careful and well-argued rebuttal, even to onbnoxious reviewers, but have a different target audience in mind.
    Meta: The way the process works is that the area chair makes the final decision, and has a lot of discretion to overrule reviewers, but by default if lazy will go by the average reviewer score. You want to either increase average reviewer score, or convince the area chair to ignore the bad ones. Convincing a reviewer is just a means to the end.
3. One of the key things to do in the rebuttal is to improve the paper. Realistically, you can’t upload a new version, so your actual goal is to convince people that you have improved the paper. It is an adversarial setting and people will generally assume you are lying if you just give empty words, especially saying you will do X by the camera ready. So the key question to ask is how can you show proof of work? Running experiments and reporting the results is one good way (or even just saying that you’ve done them).
4. A common piece of feedback is “this is badly written”.
  1. This common because it’s often true! Writing papers is hard (some advice).
  2. If you receive this feedback, try to fix it (eg give an LLM the reviews and your paper, and maybe my post, and ask it to give concrete feedback on how to improve things, along with quotes). This will improve the paper even if you don’t get in
  3. One difficulty is that even if you improve the writing a bunch, this is hard to convince anyone of in the rebuttal, since they’re normally not willing to re-read in detail (and NeurIPS doesn’t even let you re upload).
  4. My best strategy is to make a long changelog of what you improved, to signal high effort, and put it in a top-level comment
  5. If the reviewer complained about a specific paragraph or section, copy in the reworded version of that
  6. It often helps to add an appendix with a glossary for key terms, ideally both intuitive and technical definitions
5. I recommend the following process:
  1. Copy all reviews to a google doc.
  2. Go through and comment on each complaint in each review, sorting them into misunderstandings, disagreements with you, presentation issues, and technical issues—either do this while on a call, or async
  3. Brainstorm how to address each complaint—prioritise the important ones
  4. Write a bullet point outline, and try to get feedback
  5. Write it up nicely and send.
6. You typically want to have a comment per review, and a ~~top level comment covering critiques from multiple reviewers.~~ For some insane reason NeurIPS 2025 removed your ability to make a top level comment—copy and paste between reviewers I guess?
  1. Picture the area chair as your audience for the top level comment. You want to begin with a paragraph about the strengths of your paper, as noted by reviewers, supported by reviewer quotes—imagine you’re writing something the area chair can copy and paste into a meta-review about accepting you. Reviewer quotes are key for any positive claims as no one will trust you to be honest.
  2. If one reviewer hates you, the top-level comment is a good opportunity to try to discredit them by emphasising how other reviewers disagree, as politely as possible. For example, we appreciated the constructive critique from bad reviewer that X, and have changed Y to fix it. But we are glad to see that good reviewers A and B thought Z followed by quotes supporting Z from the good reviewers, where Z contradicts X as much as possible.
7. Some technical complaints are best addressed by doing new experiments—you have 1-2 weeks to do this but should ask yourself how long it’ll take and whether this is the best use of time. Time is constrained and you want to maximise returns per unit time, and new experiments often take much longer than writing or conceptual rebuttals—prioritise these carefully.

Neel Nanda Jul 25, 2025, 11:06 AM
8 points
2
in reply to: the gears to ascension’s comment on: nikola’s Shortform
Seems very false at GDM and Anthropic, and fairly false at OpenAI—I don’t know enough about other labs for strong takes. I expect it to depend a lot on how agentic the team is, and how good they are at finding ways to have a safety impact that aren’t resisted by a non-safety-conscious labs, vs are just trying to follow the incentive gradient

But eg, there’s a clear commercial incentive to not have an AI causing bioterrorism, which aligns incentives nicely

Neel Nanda Jul 25, 2025, 10:55 AM
2 points
0
in reply to: Jan Betley’s comment on: Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
I could get on board with “lets try to deal at least with the easy problems, and ~~hope~~ ensure the hard ones don’t happen”?

Reasoning-Finetuning Repurposes Latent Representations in Base Models

Jake Ward, lccqqqqq and Neel Nanda

Jul 23, 2025, 4:18 PM

34 points

1 comment2 min readLW link

(arxiv.org)

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan and Neel Nanda

Jul 23, 2025, 2:57 PM

74 points

3 comments5 min readLW link

Neel Nanda Jul 23, 2025, 8:17 AM
2 points
0
in reply to: TristanTrim’s comment on: How useful is mechanistic interpretability?
I’d disagree there. The idea here is that you tell which research ideas do and do not work by seeing if they let you do anything new. What failure looks like is totally different, it’s about TRAINING systems and the objectives you train for, not about iterating on finding the right interpretability techniques

Unfaithful chain-of-thought as nudged reasoning

Paul Bogdan, Uzay Macar, Arthur Conmy and Neel Nanda

Jul 22, 2025, 10:35 PM

49 points

3 comments10 min readLW link

Neel Nanda Jul 20, 2025, 4:33 PM
5 points
0
in reply to: Kabir Kumar’s comment on: OpenAI Claims IMO Gold Medal
No, I’m not sure where you think I claimed that

Neel Nanda Jul 17, 2025, 10:19 AM
LW: 4 AF: 3
2
AF
in reply to: Zach Stein-Perlman’s comment on: Zach Stein-Perlman’s Shortform
I would argue that all fixing research is accelerated by having found examples, because it gives you better feedback on whether you’ve found or made progress towards fixes, by studying what happened on your examples. (So long as you are careful not to overfit and just fit that example or something). I wouldn’t confidently argue that it can more directly help by eg helping you find the root cause, though things like “training data attribution to the problematic data, remove it, and start fine tuning again” might just work

Neel Nanda Jul 17, 2025, 10:15 AM
LW: 3 AF: 2
1
AF
in reply to: habryka’s comment on: Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Ah, thanks, I think I understand your position better

Ultimately, even if an AI were to re-discover the value of convergent instrumental goal each time it gets instantiated into a new session/context, that would still get you approximately the same risk model.

This is a crux for me. I think that if we can control the goals the AI has in each session or context, then avoiding the worst kind of long-term instrumental goals becomes pretty tractable, because we can give the AI site constraints of our choice and tell it that the aside constraints take precedence over following the main goal. For example, we could make the system prompt say the absolutely most important thing is that you never do anything where if your creators were aware of it and fully informed about the intent and consequences, they would disprove within that constraints please follow the following user instructions.

To me, much of the difficulty of AI alignment comes from things like not knowing how to give it specific goals, let alone specific side constraints. If we can just literally enter these in natural language and have them be obeyed with a bit of prompt engineering fiddling that seems significantly safer to me. And the better the AI the easier I expected to be to specify the goals and natural language in ways that will be correctly understood

Neel Nanda Jul 17, 2025, 2:11 AM
LW: 18 AF: 11
8
AF
in reply to: Zach Stein-Perlman’s comment on: Zach Stein-Perlman’s Shortform

I don’t see how detecting [alignment faking, lying to users, sandbagging, etc.] helps much for fixing them, so I don’t buy that you can fix hard alignment issues by bouncing off alignment audits.

Strong disagree. I think that having real empirical examples of a problem is incredibly useful—you can test solutions and see if they go away! You can clarify your understanding of the problem, and get a clearer sense of upstream causes. Etc.

This doesn’t mean it’s sufficient, or that it won’t be too late, but I think you should put much higher probability in a lab solving a problem per unit time when they have good case studies.

It’s the difference between solving instruction following when you have GPT3 to try instruction tuning on, vs only having GPT2 Small

Neel Nanda Jul 16, 2025, 2:04 AM
2 points
3
in reply to: Brian Slesinsky’s comment on: Narrow Misalignment is Hard, Emergent Misalignment is Easy
I doubt it, models are probably good at that kind of problem already

Neel Nanda Jul 15, 2025, 7:33 PM
3 points
0
in reply to: habryka’s comment on: Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Well no, but that’s because the model doesn’t know the intent of the user, and it’s in a very weird situation. If you look at the chains of thought, sometimes the model says ah. I’m being tested right now. Clearly the real test is whether identify the shutdown script and disable it.

Neel Nanda

Neel Nanda MATS Ap­pli­ca­tions Open (Due Aug 29)

Rea­son­ing-Fine­tun­ing Repur­poses La­tent Rep­re­sen­ta­tions in Base Models

Steer­ing Out-of-Distri­bu­tion Gen­er­al­iza­tion with Con­cept Abla­tion Fine-Tuning

Un­faith­ful chain-of-thought as nudged reasoning

Neel Nanda MATS Applications Open (Due Aug 29)

Reasoning-Finetuning Repurposes Latent Representations in Base Models

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Unfaithful chain-of-thought as nudged reasoning