Software Engineer (formerly) at Microsoft who may focus on the alignment problem for the rest of his life (please bet on the prediction market here).
Sheikh Abdur Raheem Ali
He might want to consider taking a look at https://manifund.org/projects/ozempic-for-sleep-proof-of-concept-research-for-safely-reducing-sleep
True. Lower haircuts allow for more leverage, typically defined as the diagonal elements of the hat matrix, does wearing a fedora impact the degree of hair chaos?
Hey Neel, I’ve heard you make similar remarks informally at talks or during Q&A sessions in past in-person panels and events, and it’s great that you’ve written them up so that they’re available in a nuanced format to a broader audience. I agree with the points you’ve made, but have a slightly different perspective on how it connects to the example of people asking for your strategic takes specifically, which I’ll share below (without presumption).
TL;DR: “Good strategic takes are hard to measure; but status is easy to recognize”.
I. Executive Summary
People aren’t necessarily confusing research prowess with strategic insight. Rather, they recognize you as having achieved elite social standing within the field of AI more broadly and want:
Access to perceived insider knowledge
Connection to high-status individuals
The latest thinking from those “in the room”
II. Always Use The Best Introduction
Before reading this post, I believed that the median person asking these questions was motivated by your impressive academic performance during your undergraduate studies, something that can be (over)simplified to “wow, this guy studied pure math at Cambridge and ranked top of his class, he’s one of the smartest people in the world, and smart people are correct about lots of things, he might have a correct answer to this question I have!”. I’m quite embarrassed to admit that this is pretty much what was going through my head when I attended a session you were holding during EAG last year, and I wouldn’t be surprised if others there were thinking that too.
Similarly along those lines, I recall reaching out to one of your former mentees for a 1:1 thinking, “wow, this guy studied computer science at Cambridge and ranked top of his class, he’s one of the smartest people in the world, and smart people are correct about lots of things!”. I also took the time to read his dissertation, and found it interesting, but that first impression mattered a lot more than it should have. An analogy is that when people are selecting the model to use for a task, they want to use the best model for that task. But if a model takes the top spot on the leaderboard where test scores are easy to measure, then that tends to mess with human psychology which irrationally pattern matches and assumes generalization across every possible task.
III. My Key Takeaway:
My key takeaway was that although that this winner-take-all dynamic may have played one factor, your model assigns more weightage towards the work you’ve done after graduating and pioneering the field of mechinterp.
IV. Credentials vs. Accomplishments
To be clear, founding mechinterp is a greater accomplishment than any formal credential. But even though teams of researchers at frontier labs are working on this agenda, it’s not mainstream yet (just take a look at mechinterp.com), whereas the handle of “math/cs genius” is generic enough as a concept to be legible to the average person. The arguments in your post about research being an empirical science requiring skills not especially relevant to strategy are locally valid, but these points are the furthest thing from the mind of those waiting in line at conferences to ask what your p(doom) is.
V. The Tyranny of the Marginal Spice Jar
Often the demands placed upon us by our environment play an instrumental role in shaping our skillset, because we adapt against the pressures placed upon us. I’m thankfully not in a leadership position where the role calls for executive project management decisions which require a solid understanding of the broader field and industry. I’m also grateful that I’m not a public figure with a reputation to maintain whose every move is open to scrutiny and close examination. I also understand that blog posts aren’t meant to be epistemically bulletproof.
I think that it’s true that when the people you speak with the most (e.g work colleagues or MATS scholars) ask you about your thoughts, their respect is based on the merits of the technical research you’ve published. And in general, when anyone publishes great AI research, then that does inspire interest in that person’s AI takes.
VI. Unnecessarily Skippable Digression Into Social Bubbles and Selection Effects
Your social circle is heavily filtered by a competitive application process which strongly selects for predicted ability to do quality research. This can distort intuitions around the prevalence of certain traits which are not as well represented in the common population. For example, authoring code or research papers requires to some extent that your brain is adapted for processing text content, the implications of which I haven’t seen discussed in depth anywhere on lesswrong. If someone expresses a strong preference for reading above watching a video when both options are available, it’s almost like a secret handshake, because so many cracked engineers have told me this that it’s become a green flag. In this world, entertainment culture and information transfer happens from books, web novels, articles, etc.
There’s an entirely separate world occupied by someone with the opposite preference, i.e wanting to watching a video above reading text when both options are available, an example secret handshake for that is when my Uber driver tells me that they’re cutting down on instagram. I admit this is a shallow heuristic but it’s become a red flag I watch out for indicating a potential vulnerability to predatory social media dark patterns or television binge-watching. It’s not an issue of self-control, people in the first group need to apply cognitive effort to pick things up from videos, but might have difficulties setting aside an engaging fantasy web serial. Most treatment of this topic I’ve seen addresses the second group, which feels alienating to me, as if there’s this ongoing dimorphism between producers and users of consumer software.
I’m typically skeptical of “high IQ bubble” typed arguments since they tend to prove too much, so I’ll make a more specific point. I agree with you that within these groups, conflation between perceived research skills and strategic skill does occur. My (minor) contention is that I don’t think that this particular mistake is the one being made by the average person asking a speaker about their strategic takes at the end of a talk.
VII. Main Argument: Research Takes?! What Research Takes?!!
Like, these sort of questions aren’t just being fielded by researchers in the field, you know. Why do people ask random celebrities and movie stars about their takes on geopolitics? Are they genuinely conflating acting skill with strategic skill? What about pro athletes? Is physical skill being conflated with strategic skill too? Do you believe that if a rich heiress with no research background was giving a talk about AI risk, that no one in the audience would be interested in her big picture takes? It makes no sense. Other comments have pointed this out already, so I’m sorry about adding another rant to the pile, but there exists a simpler explanation which does a better job of tracking reality!
The missing ingredient here is clout.
Various essays go into the relationship between competence and power, but what you’re describing as “research skill” can be renamed expertise. These folks aren’t mistaking you for someone high in “strategic skill”, instead they are making the correct inference that you are an elite. They want in on the latest gossip behind the waitlist at the exclusive private social where frontier lab employees are joking around about what name they’ll use for tomorrow’s new model. They’re holding their breath waiting for invention and hyperstition and self-fulfilling prophecy. They want to know the story of how Elon Musk will save the U.S AISI and call it xAISI.
IX. Concluding Apologetic Remarks
I’m not sure if this was an aim for the above post, but it’s an understandable impulse to want to distance oneself from scenes where it’s easier to find elites (good strategic takes) than experts (good research takes), because there can be a certain culture attached which often fails to act in a way that consistently upholds virtuous truth-seeking.
Overall, I think that taking a public stance can warp the landscape being describing in ways that are hard to predict, and appreciate your approach here compared to the influencer extreme of “my strategic takes are all great, the best, and bigly” versus the corporate extreme of “oh there are so many great takes, how could I pick one, great takes, thanks all”. The position of “yeah I’ve got takes but chill they’re mid” is a reasonable midpoint, and it would be nice to have people defer more intelligently in general.
Relevant quote:
However, I don’t know how much alignment faking results will generalize. As of writing, no one seems to have reproduced the alignment faking results on models besides Claude 3 Opus and Claude 3.5 Sonnet. Even Claude 3.7 Sonnet doesn’t really alignment fake: “Claude 3.7 Sonnet showed marked improvement, with alignment faking dropping to <1% of instances and a reduced compliance gap of only 5%.”
Some mats scholars (Abhay Sheshadri and John Hughes) observed minimal or no alignment faking from open-source models like llama-3.1-70b and llama-3.1-405b. However, preliminary results suggest gpt-4-o seems to alignment fake more often when finetuned on content from Evan Hubinger’s blog posts and papers about “mesaoptimizers.”
From: Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models
I’m not trying to say that any of this applies in your case per se. But when someone in a leadership position hires a personal assistant, their goal may not necessarily be to increase their short term happiness, even if this is a side effect. The main benefit is to reduce load on their team.
If there isn’t a clear owner for ops adjacent stuff, people in high-performance environments will randomly pick up ad-hoc tasks that need to get done, sometimes without clearly reporting this out to anyone, which is often societally inefficient relative to their skillset and a bad allocation of bandwidth given the organization’s priorities.
A great personal assistant wouldn’t just help you get more done and focus on what matters, but also handle various things which may be spilling over to those who are paying attention to your needs and acting to ensure they are met without you noticing or explicitly delegating.
Is data poisoning less effective on models which alignment fake?
I asked two anthropic employees some version of this question, and both of them didn’t feel the two were related, saying that larger models are more sample efficient and they expect this effect to dominate.
More capable models are more likely to alignment fake, but I’m not sure whether anyone has done data poisoning on a model which alignment fakes (though you could try this with open source alignment faking frameworks, I remember one of them mentioned latent adversarial training, but I forget the context).
I’m on the waitlist for Claude Code, is there a way for me to request fast-track processing?
What exactly is a punishment?
For anyone else who stumbles across this thread: when modifying the superwhisper toggle settings, hit spacebar then control, instead of control then spacebar. Also, it turns out that Control + Space is the default shortcut for switching keyboard input sources (at least on macOS Sequoia 15.3.1), make sure to disable that by going to System Settings → Keyboard → Keyboard Shortcuts → Input Sources.
What keybinding do you set for it?
IIRC, ⌥+space conflicts with the default for ChatGPT and Alfred.app (which I use for clipboard history).
This is also because of Jevon’s Paradox. As the cost of doing an experiment reduces with experience, the number of experiments run tends to rise.
I’ve retracted the comment as no longer endorsed, thank you.
What was the writing process like for this piece?
Thanks for sharing your notes Daniel!
I’m unable to open the google docs file in the third link.
I’ve tried speaking with a few teams doing AI safety work, including:
• assistant professor leading an alignment research group at a top university who is starting a new AI safety org
• anthropic independent contractor who has coauthored papers with the alignment science team
• senior manager at nvidia working on LLM safety (NeMo-Aligner/NeMo-Guardrails)
• leader of a lab doing interoperability between EU/Canada AI standards
• ai policy fellow at US Senate working on biotech strategies
• executive director of an ai safety coworking space who has been running weekly meetups for ~2.5 years
• startup founder in stealth who asked not to share details with anyone outside CAISI
• chemistry olympiad gold medalist working on a dangerous capabilities evals project for o3
• mats alumni working on jailbreak mitigation at an ai safety & security org
• ai safety research lead running a mechinterp reading group and interning at EleuthrAI
Some random brief thoughts:
• CAISI’s focus seems to be on stuff other than x-risks (i.e, misinformation, healthcare, privacy).
• I’m afraid of being too unfiltered and causing offence.
• Some of the statements made in the interviews are bizarrely devoid of content, such as:“AI safety work is not only a necessity to protect our social advances, but also essential for AI itself to remain a meaningful technology.”
• Others seem to be false as stated, such as:
“our research on privacy-preserving AI led us to research machine unlearning — how to remove data from AI systems — which is now an essential consideration for deploying large-scale AI systems like chatbots.”
• (I think a lot of unlearning research is bullshit, but besides that, is anyone deploying large models doing unlearning?)
• The UK AISI research agendas seemed a lot more coherent with better developed proposals and theories of impact.
• They’re only recruiting for 3 positions for a research council that meets once a month?
• CAD 27m of CAISI’s initial funding is ~15% of the UK AISI’s GBP 100m initial funding, but more than the U.S AISI’s initial funding (USD $10m).
• Another source says $50m CAD, but that’s distributed over 5 years compared to a $2.4b budget for AI in general, so about 2% of the AI budget goes to safety?
• I was looking for scientific advancements which would be relevant at the national scale. I read through every page of anthropic/redwood’s alignment faking paper, which is considered the best empirical alignment research paper of 2024, but it was a firehose of info and I don’t have clear recommendations that can be put into a slide deck.
• Instead of learning more about what other people were doing on a shallow level it might’ve been more beneficial to focus on my own research questions or practice training project relevant skills.
Wow, point #1 resulted in a big update for me. I had never thought about it that way, but it makes a lot of sense. Kudos!
Ilya Sutskever had two armed bodyguards with him at NeurIPS
I don’t understand how Ilya hiring personal security counts as evidence, especially at large events like a conference. Famous people often attract unwelcome attention, and having professional protection close by can help deescalate or deter random acts of violence, it is a worthwhile investment in safety if you can afford it. I see it as a very normal thing to do. Ilya would be vulnerable to potential assassination attempts even during his tenure at OpenAI.
(responding only to the first point)
It is possible to do experiments more efficiently in a lab because you have privileged access to top researchers whose bandwidth is otherwise very constrained. If you ask for help in Slack, the quality of responses tends to be comparable to teams outside labs, but the speed is often faster because the hiring process selects strongly for speed. It can be hard to coordinate busy schedules, but if you have a collaborator’s attention, what they say will make sense and be helpful. People at labs tend to be unusually good communicators, so it is easier to understand what they mean during meetings, whiteboard sessions, or 1:1s. This is unfortunately not universal amongst engineers. It’s also rarer for projects to be managed in an unfocused way leading to them fizzling out without adding value, and feedback usually leads to improvement rather than deadlock over disagreements.
Also, lab culture in general benefits from high levels of executive function. For instance, when a teammate says they spent an hour working on a document, you can be confident that progress has been made even if not all changes pass review. It’s less likely that they suffered from writer’s block or got distracted by a lower priority task. Some of these factors also apply at well-run startups, but they don’t have the same branding, and it’d be difficult for a startup to e.g line up four reviewers of this calibre: https://assets.anthropic.com/m/24c8d0a3a7d0a1f1/original/Alignment-Faking-in-Large-Language-Models-reviews.pdf.
I agree that (without loss of generality) the internal RL code isn’t going to blow open source repos out of the water, and if you want to iterate on a figure or plot, that’s the same amount of work no matter where you are even if you have experienced people helping you make better decisions. But you’re missing that lab infra doesn’t just let you run bigger experiments, it also lets you run more small experiments, because resourcing for compute/researcher at labs is quite high by non-lab standards. When I was at Microsoft, it wasn’t uncommon for some teams to have the equivalent of roughly 2 V100s, which is less than what students can rent from vast or runpod for personal experiments.
Thank you, this is great work. I filled out the external researcher interest form but was not selected for Team 4.
I’m not sure that Team 4 were on par with what professional jailbreakers could achieve in this setting. I look forward to follow up experiments. This is bottlenecked by the absence of an open source implementation of auditing games. I went over the paper with a colleague. Unfortunately we don’t have bandwidth to replicate this work ourselves. Is there a way to sign up to be notified once a playable auditing game is available?
I’d also be eager to help beta-test pre-release versions. Let me know if someone is planning to put this up on the web as a product which allows the general public to play crowdsourced auditing games.