Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Rohin Shah
On reflection, it’s not actually about which position is more common. My real objection is that imo it was pretty obvious that something along these lines would be the crux between you and Neel (and the fact that it is a common position is part of why I think it was obvious).
Inasmuch as you are actually trying to have a conversation with Neel or address Neel’s argument on its merits, it would be good to be clear that this is the crux. I guess perhaps you might just not care about that and are instead trying to influence readers without engaging with the OP’s point of view, in which case fair enough. Personally I would find that distasteful / not in keeping with my norms around collective-epistemics but I do admit it’s within LW norms.
(Incidentally, I feel like you still aren’t quite pinning down your position—depending on what you mean by “reliably” I would probably agree with “marginalist approaches don’t reliably improve things”. I’d also agree with “X doesn’t reliably improve things” for almost any interesting value of X.)
What exactly do you mean by ambitious mech interp, and what does it enable? You focus on debugging here, but you didn’t title the post “an ambitious vision for debugging”, and indeed I think a vision for debugging would look quite different.
For example, you might say that the goal is to have “full human understanding” of the AI system, such that some specific human can answer arbitrary questions about the AI system (without just delegating to some other system). To this I’d reply that this seems like an unattainable goal; reality is very detailed, AIs inherit a lot of that detail, a human can’t contain all of it.
Maybe you’d say “actually, the human just has to be able to answer any specific question given a lot of time to do so”, so that the human doesn’t have to contain all the detail of the AI, and can just load in the relevant detail for a given question. To do this perfectly, you still need to contain the detail of the AI, because you need to argue that there’s no hidden structure anywhere in the AI that invalidates your answer. So I still think this is an unattainable goal.
Maybe you’d then say “okay fine, but come on, surely via decent heuristic arguments, the human’s answer can get way more robust than via any of the pragmatic approaches, even if you don’t get something like a proof”. I used to be more optimistic about this but things like self-repair and negative heads make it hard in practice, not just in theory. Perhaps more fundamentally, if you’ve retreated this far back, it’s unclear to me why we’re calling this “ambitious mech interp” rather than “pragmatic interp”.
To be clear, I like most of the agendas in AMI and definitely want them to be a part of the overall portfolio, since they seem especially likely to provide new affordances. I also think many of the directions are more future-proof (ie more likely to generalize to future very different AI systems). So it’s quite plausible that we don’t disagree much on what actions to take. I mostly just dislike gesturing at “it would be so good if we had <probably impossible thing> let’s try to make it happen”.
Fair, I’ve edited the comment with a pointer. It still seems to me to be a pretty direct disagreement with “we can substantially reduce risk via [engineering-type / category 2] approaches”.
My claim is “while it certainly could be net negative (as is also the case for ~any action including e.g. donating to AMF), in aggregate it is substantially positive expected risk reduction”.
Your claim in opposition seems to be “who knows what the sign is, we should treat it as an expected zero risk reduction”.
Though possibly you are saying “it’s bad to take actions that have a chance of backfiring, we should focus much more on robustly positive things” (because something something virtue ethics?), in which case I think we have a disagreement on decision theory instead.
I still want to claim that in either case, my position is much more common (among the readership here), except inasmuch as they disagree because they think alignment is very hard and that’s why there’s expected zero (or negative) risk reduction. And so I wish you’d flag when your claims depend on these takes (though I realize it is often hard to notice when that is the case).
Makes sense, I still endorse my original comment in light of this answer (as I already expected something like this was your view). Like, I would now say
Imo the vast, vast majority of progress in the world happens via “engineering-type / category 2” approaches, so if you do think you can win via “engineering-type / category 2″ approaches you should generally bias towards them
while also noting that the way we are using the phrase “engineering-type” here includes a really large amount of what most people would call “science” (e.g. it includes tons of academic work), so it is important when evaluating this claim to interpret the words “engineering” and “science” in context rather than via their usual connotations.
2) the AI safety community doesn’t try to solve alignment for smarter than human AI systems
I assume you’re referring to “whole swathes of the field are not even aspiring to be the type of thing that could solve misalignment”.
Imo, chain of thought monitoring, AI control, amplified oversight, MONA, reasoning model interpretability, etc, are all things that could make the difference between “x-catastrophe via misalignment” and “no x-catastrophe via misalignment”, so I’d say that lots of our work could “solve misalignment”, though not necessarily in a way where we can know that we’ve solved misalignment in advance.
Based on Richard’s previous writing (e.g. 1, 2) I expect he sees this sort of stuff as not particularly interesting alignment research / doesn’t really help, so I jumped ahead in the conversation to that disagreement.
even if no one breakthrough or discovery “solves alignment”, a general frame of “lets find principled approaches” is often more generative than “let’s find the cheapest 80⁄20 approach”
Sure, I broadly agree with this, and I think Neel would too. I don’t see Neel’s post as disagreeing with it, and I don’t think the list of examples that Richard gave is well described as “let’s find the cheapest 80⁄20 approach”.
I wish when you wrote these comments you acknowledged that some people just actually think that we can substantially reduce risk via what you call “marginalist” approaches. Not everyone agrees that you have to deeply understand intelligence from first principles else everyone dies. (EDIT: See Richard’s clarification downthread.) Depending on how you choose your reference class, I’d guess most people disagree with that.
Imo the vast, vast majority of progress in the world happens via “marginalist” approaches, so if you do think you can win via “marginalist” approaches you should generally bias towards them.
Ah yeah, I think with that one the audiences were “researchers heavily involved in AGI Safety” (LessWrong) and “ML researchers with some interest in reward hacking / safety” (Medium blog)
We mostly just post more informal stuff directly to LessWrong / Alignment Forum (see e.g. our interp updates).
Having a separate website doesn’t seem that useful to readers. I generally see the value proposition of a separate website as attaching the company’s branding to the post, which helps the company build a better reputation. It can also help boost the reach of an individual piece of research, but this is a symmetric weapon, and so applying it to informal research seems like a cost to me, not a benefit. Is there some other value you see?
(Incidentally, I would not say that our blog is targeted towards laypeople. I would say that it’s targeted towards researchers in the safety community who have a small amount of time to spend and so aren’t going to read a full paper. E.g. this post spends a single sentence explaining what scheming is and then goes on to discuss research about it; that would absolutely not work in a piece targeting laypeople.)
AI capabilities progress is smooth, sure, but it’s a smooth exponential.
In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can’t I just take the log of it and call that “AI capabilities” and then say it is a smooth linear increase?
That means that the linear gap in intelligence between the previous model and the next model keeps increasing rather than staying constant, which I think suggests that this problem is likely to keep getting harder and harder rather than stay “so small” as you say.
It seems like the load-bearing thing for you is that the gap between models gets larger, so let’s try to operationalize what a “gap” might be.
We could consider the expected probability that AI_{N+1} would beat AI_N on a prompt (in expectation over a wide variety of prompts). I think this is close-to-equivalent to a constant gap in ELO score on LMArena.[1] Then “gap increases” would roughly mean that the gap in ELO scores on LMArena between subsequent model releases would be increasing. I don’t follow LMArena much but my sense is that LMArena top scores have been increasing relatively linearly w.r.t time and sublinearly w.r.t model releases (just because model releases have become more frequent). In either case I don’t think this supports an “increasing gap” argument.
Personally I prefer to look at benchmark scores. The Epoch Capabilities Index (which I worked on) can be handwavily thought of as ELO scores based on benchmark performance. Importantly, the data that feeds into it does not mention release date at all—we put in only benchmark performance numbers to estimate capabilities, and then plot it against release date after the fact. It also suggests AI capabilities as operationalized by handwavy-ELO are increasing linearly over time.
I guess the most likely way in which you might think capabilities are exponential is by looking at the METR time horizons result? Of course you could instead say that capabilities are linearly increasing by looking at log time horizons instead. It’s not really clear which of these units you should use.
Mostly I think you should not try to go from the METR results to “are gaps in intelligence increasing or staying constant” but if I had to opine on this: the result says that you have a constant doubling time T for the time horizon. One way to think about this is that the AI at time 2T can do work at 50% success rate that AI at time T could do at 25% probability if you provide a decomposition into two pieces each of time T (each of which it completes with probability 50%). I kinda feel like this suggests more like “constant gap” rather than “increasing gap”.
Note that I do expect the first two trends to become superlinear eventually via an intelligence explosion, and the METR trend to become superexponential / superlinear (depending on units) probably some time before that (though probably we will just become unable to measure it well). But your claim seems to be about current progress, and for current progress I think it’s basically not true that the gap between successive models is increasing rather than staying constant.
Even in the intelligence explosion, capabilities progress is only superlinear w.r.t time, I expect it would still behave in the same way w.r.t inputs like compute and labor (where automated researchers should also count as labor). I’m not sure how I expect it to behave w.r.t successive model generations, partly because I’m not sure “successive model generations” will even be a sensible abstraction at that point. In any case, I don’t expect that to be particularly important in assessing the chances of success of a bootstrapping-type plan.
I am overall skeptical that the CoT will be that useful (since it’s not clear to me that a model doing AI safety research sabotage has to put that in the CoT).
It seems pretty wild to go from “it is possible for an AI to subvert a technique” to “the technique will not be that useful”. Is that really what you mean? Are you bearish on all control work?
(At the object-level, I’d say that you’re drastically limiting the power of the research sabotage that can be done if the model is forced not to put any of it in the CoT, and you should be very happy about this even if the model can still do some research sabotage.)
Tbc, I also don’t expect CoT to be that useful for longer-term concerns, but that’s mostly because I expect CoTs to become extremely illegible or to stop existing altogether (partly due to my having “long” timelines; on Anthropic-level short timelines I’d be quite bullish on CoT).
- ^
Though I don’t know that much about LMArena and I expect in practice there are confounders, e.g. as they change the distribution of models that are being evaluated the meaning of the scores will change.
Just switch to log_10 space and add? Eg 10k times 300k = 1e4 * 3e5 = 3e9 = 3B. A bit slower but doesn’t require any drills.
Seb is explicitly talking about AGI and not ASI. It’s right there in the tweet.
Most people in policy and governance are not talking about what happens after an intelligence explosion. There are many voices in AI policy and governance and lots of them say dumb stuff, e.g. I expect someone has said the next generation of AIs will cause huge unemployment. Comparative advantage is indeed one reasonable thing to discuss in response to that conversation.
Stop assuming that everything anyone says about AI must clearly be a response to Yudkowsky.
GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash
it’d be fine if you held alignment constant but dialed up capabilities.
I don’t know what this means so I can’t give you a prediction about it.
I don’t really see why it’s relevant how aligned Claude is if we’re not thinking about that as part of it
I just named three reasons:
Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that “the doomers were right”)
Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don’t see this, which is evidence against that particular threat model.
Is it relevant to the object-level question of “how hard is aligning a superintelligence”? No, not really. But people are often talking about many things other than that question.
For example, is it relevant to “how much should I defer to doomers”? Yes absolutely (see e.g. #1).
I bucket this under “given this ratio of right/wrong responses, you think a smart alignment researcher who’s paying attention can keep it in a corrigibility basin even as capability levels rise?”. Does that feel inaccurate, or, just, not how you’d exactly put it?
I’m pretty sure it is not that. When people say this it is usually just asking the question: “Will current models try to take over or otherwise subvert our control (including incompetently)?” and noticing that the answer is basically “no”.[1] What they use this to argue for can then vary:
Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that “the doomers were right”)
Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don’t see this, which is evidence against that particular threat model.[2]
I agree with (1), disagree with (2) when (2) is applied to superintelligence, and for (3) it depends on details.
In Leo’s case in particular I don’t think he’s using the observation for much, it’s mostly just a throwaway claim that’s part of the flow of the comment, but inasmuch as it is being used it is to say something like “current AIs aren’t trying to subvert our control, so it’s not completely implausible on the face of it that the first automated alignment researcher to which we delegate won’t try to subvert our control”, which is just a pretty weak claim and seems fine, and doesn’t imply any kind of extrapolation to superintelligence. I’d be surprised if this was an important disagreement with the “alignment is hard” crowd.
- ^
There are demos of models doing stuff like this (e.g. blackmail) but only under conditions selected highly adversarially. These look fragile enough that overall I’d still say current models are more aligned than e.g. rationalists (who under adversarially selected conditions have been known to intentionally murder people).
- ^
E.g. One naive threat model says “Orthogonality says that an AI system’s goals are completely independent of its capabilities, so we should expect that current AI systems have random goals, which by fragility of value will then be misaligned”. Setting aside whether anyone ever believed in such a naive threat model, I think we can agree that current models are evidence against such a threat model.
Which, importantly, includes every fruit of our science and technology.
I don’t think this is the right comparison, since modern science / technology is a collective effort and so can only cumulate thinking through mostly-interpretable steps. (This may also be true for AI, but if so then you get interpretability by default, at least interpretability-to-the-AIs, at which point you are very likely better off trying to build AIs that can explain that to humans.)
In contrast, I’d expect individual steps of scientific progress that happen within a single mind often are very uninterpretable (see e.g. “research taste”).
If we understood an external superhuman world-model as well as a human understands their own world-model, I think that’d obviously get us access to tons of novel knowledge.
Sure, I agree with that, but “getting access to tons of novel knowledge” is nowhere close to “can compete with the current paradigm of building AI”, which seems like the appropriate bar given you are trying to “produce a different tool powerful enough to get us out of the current mess”.
Perhaps concretely I’d wildly guess with huge uncertainty that this would involve an alignment tax of ~4 GPTs, in the sense that if you had an interpretable world model from GPT-10 similar in quality to a human’s understanding of their own world model, that would be similarly useful as GPT-6.
a human’s world-model is symbolically interpretable by the human mind containing it.
Say what now? This seems very false:
See almost anything physical (riding a bike, picking things up, touch typing a keyboard, etc). If you have a dominant hand / leg, try doing some standard tasks with the non-dominant hand / leg. Seems like if the human mind could symbolically interpret its own world model this should be much easier to do.
Basically anything to do with vision / senses. Presumably if vision was symbolically interpretable to the mind then there wouldn’t be much of a skill ladder to climb for things like painting.
Symbolic grammar usually has to be explicitly taught to people, even though ~everyone has a world model that clearly includes grammar (in the sense that they can generate grammatical sentences and identify errors in grammar)
Tbc I can believe it’s true in some cases, e.g. I could believe that some humans’ far-mode abstract world models are approximately symbolically interpretable to their mind (though I don’t think mine is). But it seems false in the vast majority of domains (if you are measuring relative to competent, experienced people in those domains, as seems necessary if you are aiming for your system to outperform what humans can do).
What exactly do you propose that a Bayesian should do, upon receiving the observation that a bounded search for examples within a space did not find any such example?
(I agree that it is better if you can instead construct a tight logical argument, but usually that is not an option.)
I also don’t find the examples very compelling:
Security mindset—Afaict the examples here are fictional
Superforecasters—In my experience, superforecasters have all kinds of diverse reasons for low p(doom), some good, many bad. The one you describe doesn’t seem particularly common.
Rethink—Idk the details here, will pass
Fatima Sun Miracle: I’ll just quote Scott Alexander’s own words in the post you link:
I will admit my bias: I hope the visions of Fatima were untrue, and therefore I must also hope the Miracle of the Sun was a fake. But I’ll also admit this: at times when doing this research, I was genuinely scared and confused. If at this point you’re also scared and confused, then I’ve done my job as a writer and successfully presented the key insight of Rationalism: “It ain’t a true crisis of faith unless it could go either way”.
[...]
I don’t think we have devastated the miracle believers. We have, at best, mildly irritated them. If we are lucky, we have posited a very tenuous, skeletal draft of a materialist explanation of Fatima that does not immediately collapse upon the slightest exposure to the data. It will be for the next century’s worth of scholars to flesh it out more fully.
Overall, I’m pleasantly surprised by how bad these examples are. I would have expected much stronger examples, since on priors I expected that many people would in fact follow EFAs off a cliff, rather than treating them as evidence of moderate but not overwhelming strength. To put it another way, I expected that your FA on examples of bad EFAs would find more and/or stronger hits than it actually did, and in my attempt to better approximate Bayesianism I am noticing this observation and updating on it.
It’s instead arguing with the people who are imagining something like “business continues sort of as usual in a decentralized fashion, just faster, things are complicated and messy, but we muddle through somehow, and the result is okay.”
The argument for this position is more like: “we never have a ‘solution’ that gives us justified confidence that the AI will be aligned, but when we build the AIs, the AIs turn out to be aligned anyway”.
You seem to instead be assuming “we don’t get a ‘solution’, and so we build ASI and all instances of ASI are mostly misaligned but a bit nice, and so most people die”. I probably disagree with that position too, but imo it’s not an especially interesting position to debate, as I do agree that building ASI that is mostly misaligned but a bit nice is a bad outcome that we should try hard to prevent.
Yeah, that’s fair for agendas that want to directly study the circumstances that lead to scheming. Though when thinking about those agendas, I do find myself more optimistic because they likely do not have to deal with long time horizons, whereas capabilities work likely will have to engage with that.
Note many alignment agendas don’t need to actually study potential schemers. Amplified oversight can make substantial progress without studying actual schemers (but probably will face the long horizon problem). Interpretability can make lots of foundational progress without schemers, that I would expect to mostly generalize to schemers. Control can make progress with models prompted or trained to be malicious.
(I have the same critique of the first two paragraphs, but thanks for the edit, it helps)
Fwiw, I am actively surprised that you have a p(doom) < 50%, I can name several lines of evidence in the opposite direction:
You’ve previously tried to define alignment based on worst-case focus and scientific approach. This suggests you believe that “marginalist” / “engineering” approaches are ~useless, from which I inferred (incorrectly) that you would have a high p(doom).
I still find the conjunction of the two positions you hold pretty weird.
I’m a strong believer in logistic success curves for complex situations. If you’re in the middle part of a logistic success curve in a complex situation, then there should be many things that can be done to improve the situation, and it seems like “engineering” approaches should work.
It’s certainly possible to have situations that prevent this. Maybe you have a bimodal distribution, e.g. 70% on “near-guaranteed fine by default” and 30% on “near-guaranteed doom by default”. Maybe you think that people have approximately zero ability to tell which things are improvements. Maybe you think we are at the far end of the logistic success curve today, but timelines are long and we’ll do the necessary science in time. But these views seem kinda exotic and unlikely to be someone’s actual views. (Idk maybe you do believe the second one.)
Obviously I had not thought through this in detail when I originally wrote my comment, and my wordless inference was overconfident in hindsight. But I stand by my overall sense that a person who thinks “engineering” approaches are near-useless will likely also have high p(doom) -- not just as a sociological observation, but also as a claim about which positions are consistent with each other.
In your writing you sometimes seem to take as a background assumption that alignment will be very hard. For example, I recall you critiquing assistance games because (my paraphrase) “that’s not what progress on a hard problem looks like”. (I failed to dig up the citation though.)
You’re generally taking a strategy that appears to me to be high variance, which people usually justify via high p(doom) / playing to your outs.
A lot of your writing is similarly flavored to other people who have high p(doom).
In terms of evidence that you have a p(doom) < 50%, I think the main thing that comes to mind is that you argued against Eliezer about this in late 2021, but that was quite a while ago (relative to the evidence above) and I thought you had changed your mind. (Also iirc the stuff you said then was consistent with p(doom) ~ 50%, but it’s long enough ago that I could easily be forgetting things.)
You could point to ~any reasonable subcommunity within AI safety (or the entire community) and I’d still be on board with the claim that there’s at least a 10% chance that will make things worse, which I might summarize as “they won’t reliably improve things”, so I still feel like this isn’t quite capturing the distinction. (I’d include communities focused on “science” in that, but I do agree that they are more likely not to have a negative sign.) So I still feel confused about what exactly your position is.