Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Rohin Shah
I was under the impression you expected slower catch-up progress.
Note that I think the target we’re making quantitative forecasts about will tend to overestimate that-which-I-consider-to-be “catch-up algorithmic progress” so I do expect slower catch-up progress than the naive inference from my forecast (ofc maybe you already factored that in).
Thanks, I hadn’t looked at the leave-one-out results carefully enough. I agree this (and your follow-up analysis rerun) means my claim is incorrect. Looking more closely at the graphs, in the case of Llama 3.1, I should have noticed that EXAONE 4.0 (1.2B) was also a pretty key data point for that line. No idea what’s going on with that model.
(That said, I do think going from 1.76 to 1.64 after dropping just two data points is a pretty significant change, also I assume that this is really just attributable to Grok 3 so it’s really more like one data point. Of course the median won’t change, and I do prefer the median estimate because it is more robust to these outliers.)
There’s a related point, which is maybe what you’re getting at, which is that these results suffer from the exclusion of proprietary models for which we don’t have good compute estimates.
I agree this is a weakness but I don’t care about it too much (except inasmuch as it causes us to estimate algorithmic progress by starting with models like Grok). I’d usually expect it to cause estimates to be biased downwards (that is, the true number is higher than estimated).
Another reason to think these models are not the main driver of the results is that there are high slopes in capability buckets that don’t include these models, such as 30, 35, 40 (log10 slopes of 1.22, 1.41, 1.22).
This corresponds to 16-26x drop in cost per year? Those estimates seem reasonable (maybe slightly high) given you’re measuring drop in cost to achieve benchmark scores.
I do think that this is an overestimate of catch-up algorithmic progress for a variety of reasons:
Later models are more likely to be benchmaxxed
(Probably not a big factor, but who knows) Benchmarks get more contaminated over time
Later models are more likely to have reasoning training
None of these apply to the pretraining based analysis, though of course it is biased in the other direction (if you care about catch-up algorithmic progress) by not taking into account distillation or post-training.
I do think 3x is too low as an estimate for catch-up algorithmic progress, inasmuch as your main claim is “it’s a lot bigger than 3x” I’m on board with that.
Speaking colloquially, I might say “these results indicate to me that catch-up algorithmic progress is on the order of one or 1.5 orders of magnitude per year rather than half an order of magnitude per year as I used to think”. And again, my previous belief of 3× per year was a belief that I should have known was incorrect because it’s based only on pre-training.
Okay fair enough, I agree with that.
As I said in footnote 9, I am willing to make bets about my all-things-considered beliefs. You think I’m updating too much based on unreliable methods? Okay come take my money.
I find this attitude weird. It takes a lot of time to actually make and settle a bet. (E.g. I don’t pay attention to Artificial Analysis and would want to know something about how they compute their numbers.) I value my time quite highly; I think one of us would have to be betting seven figures, maybe six figures if the disagreement was big enough, before it looked good even in expectation (ie no risk aversion) as a way for me to turn time into money.
I think it’s more reasonable as a matter of group rationality to ask that an interlocutor say what they believe, so in that spirit here’s my version of your prediction, where I’ll take your data at face value without checking:
[DeepSeek-V3.2-Exp is estimated by Epoch to be trained with 3.8e24 FLOP. It reached an AAII index score of 65.9 and was released on September 29, 2025. It is on the compute-efficiency frontier.] I predict that by September 29, 2026, the least-compute-used-to-train model that reaches a score of 65 will be trained with around 3e23 FLOP, with the 80% CI covering 6e22–1e24 FLOP.
Note that I’m implicitly doing a bunch of deference to you here (e.g. that this is a reasonable model to choose, that AAII will behave reasonably regularly and predictably over the next year), though tbc I’m also using other not-in-post heuristics (e.g. expecting that DeepSeek models will be more compute efficient than most). So, I wouldn’t exactly consider this equivalent to a bet, but I do think it’s something where people can and should use it to judge track records.
Your results are primarily driven by the inclusion of Llama 3.1-405B and Grok 3 (and Grok 4 when you include it).Those models are widely believed to be cases where a ton of compute was poured in to make up for poor algorithmic efficiency. If you remove those I expect your methodology would produce similar results as prior work (which is usually trying to estimate progress at the frontier of algorithmic efficiency, rather than efficiency progress at the frontier of capabilities).I could imagine a reply that says “well, it’s a real fact that when you start with a model like Grok 3, the next models to reach a similar capability level will be much more efficient”. And this is true! But if you care about that fact, I think you should instead have two stylized facts, one about what happens when you are catching up to Grok or Llama, and one about what happens when you are catching up to GPT, Claude, or Gemini, rather than trying to combine these into a single estimate that doesn’t describe either case.
Your detailed results are also screaming at you that your method is not reliable. It is really not a good sign when your analysis that by construction has to give numbers in produces results that on the low end include 1.154, 2.112, 3.201 and on the high end include 19,399.837 and even (if you include Grok 4) 2.13E+09 and 2.65E+16 (!!)
I find this reasoning unconvincing because their appendix analysis (like that in this blog post) is based on more AI models than their primary analysis!
The primary evidence that the method is unreliable is not that the dataset is too small, it’s that the results span such a wide interval, and it seems very sensitive to choices that shouldn’t matter much.
Cool result!
these results demonstrate a case where LLMs can do (very basic) meta-cognition without CoT
Why do you believe this is meta-cognition? (Or maybe the question is, what do you mean by meta-cognition?)
It seems like it could easily be something else. For example, probably when solving problems the model looks at the past strategies it has used and tries some other strategy to increase the likelihood of solving the problem. It does this primarily in the token space (looking at past reasoning and trying new stuff) but this also generalizes somewhat to the activation space (looking at what past forward passes did and trying something else). So when you have filler tokens the latter effect still happens, giving a slight best-of-N type boost, producing your observed results.
Filler tokens don’t allow for serially deeper cognition than what architectural limits allow
This depends on your definition of serial cognition, under the definitions I like most the serial depth scales logarithmically with the number of tokens. This is because as you increase parallelism (in the sense you use above), that also increases serial depth logarithmically.
The basic intuitions for this are:
If you imagine mechanistically how you would add N numbers together, it seems like you would need logarithmic depth (where you recursively split it in half, compute the sums of each half, and then add the results together). Note that attention heads compute sums of numbers that scale with the number of tokens.
There are physical limits to the total amount of local computation that can be done in a given amount of time, due to speed-of-light constraints. So inasmuch as “serial depth” is supposed to capture intuitively what computation you can do with “time”, it seems like serial depth should increase as total computation goes up (reflecting the fact that you have to break up the computation into local pieces, as in the addition example above).
Ah, I realized there was something else I should have highlighted. You mention you care about pre-ChatGPT takes towards shorter timelines—while compute-centric takeoff was published two months after ChatGPT, I expect that the basic argument structure and conclusions were present well before the release of ChatGPT.
While I didn’t observe that report in particular, in general Open Phil worldview investigations took > 1 year of serial time and involved a pretty significant and time-consuming “last mile” step where they get a bunch of expert review before publication. (You probably observed this “last mile” step with Joe Carlsmith’s report, iirc Nate was one of the expert reviewers for that report.) Also, Tom Davidson’s previous publications were in March 2021 and June 2021, so I expect he was working on the topic for some of 2021 and ~all of 2022.
I suppose a sufficiently cynical observer might say “ah, clearly Open Phil was averse to publishing this report that suggests short timelines and intelligence explosions until after the ChatGPT moment”. I don’t buy it, based on my observations of the worldview investigations team (I realize that it might not have been up to the worldview investigations team, but I still don’t buy it).
I guess one legible argument I could make to the cynic would be that on the cynical viewpoint, it should have taken Open Phil a lot longer to realize they should publish the compute-centric takeoff post. Does the cynic really think that, in just two months, a big broken org would be able to:
Observe that people no longer make faces at shorter timelines
Have the high-level strategy-setting execs realize that they should change their strategy to publish more shorter timelines stuff
Communicate this to the broader org
Have a lower-level person realize that the internal compute-centric takeoff report can now be published when previously it was squashed
Update the report to give it the level of polish that it observably has
Get it through the comms bureaucracy that are probably still operating on the past heuristics and haven’t figured out what to do in this new world
That’s just so incredibly fast for big broken orgs to move.
I think I agree with all of that under the definitions you’re using (and I too prefer the bounded rationality version). I think in practice I was using words somewhat differently than you.
(The rest of this comment is at the object level and is mostly for other readers, not for you)
Saying it’s “crazy” means it’s low probability of being (part of) the right world-description.
The “right” world-description is a very high bar (all models are wrong but some are useful), but if I go with the spirit of what you’re saying I think I might not endorse calling bio anchors “crazy” by this definition, I’d say more like “medium” probability of being a generally good framework for thinking about the domain, plus an expectation that lots of the specific details would change with more investigation.
Honestly I didn’t have any really precise meaning by “crazy” in my original comment, I was mainly using it as a shorthand to gesture at the fact that the claim is in tension with reductionist intuitions, and also that the legibly written support for the claim is weak in an absolute sense.
Saying it’s “the best we have” means it’s the clearest model we have—the most fleshed-out hypothesis.
I meant a higher bar than this; more like “the most informative and relevant thing for informing your views on the topic” (beyond extremely basic stuff like observing that humanity can do science at all, or things like reference class priors). Like, I also claim it is better than “query your intuitions about how close we are to AGI, and how fast we are going, to come up with a time until we get to AGI”. So it’s not just the clearest / most fleshed-out, it’s also the one that should move you the most, even including various illegible or intuition-driven arguments. (Obviously scoped only to the arguments I know about; for all I know other people have better arguments that I haven’t seen.)
If it were merely the clearest model or most fleshed-out hypothesis, I agree it would usually be a mistake to make a large belief update or take big consequential actions on that basis.
I also want to qualify / explain my statement about it being a crazy argument. The specific part that worries me (and Eliezer, iiuc) is the claim that, at a given point in time, the delta between natural and artificial artifacts will tend to be approximately constant across different domains. This is quite intuition-bending from a mechanistic / reductionist viewpoint, and the current support for it seems very small and fragile (this 8 page doc). However, I can see a path where I would believe in it much more, which would involve things like:
Pinning down the exact methodology: Which metrics are we making this “approximately constant delta” claim for? It’s definitely not arbitrary metrics.
The doc has an explanation, but I don’t currently feel like I could go and replicate it myself while staying faithful to the original intuitions, so I currently feel like I am deferring to the authors of the doc on how they are choosing their metrics.
Once we do this, can we explain how we’re applying it in the AI case? Why should we anchor brain size to neural net size, instead of e.g. brain training flops to neural net training flops?
More precise estimates: Iirc there is a lot of “we used this incredibly heuristic argument to pull out a number, it’s probably not off by OOMs so that’s fine for our purposes”, which I think is reasonable but makes me uneasy.
Get more data points: Surely there are more artifacts to compare.
Check for time invariance: Repeat the analysis at different points in time—do we see a similar “approximately constant gap” if we look at manmade artifacts from (say) 2000 or 1950? Equivalently, this theory predicts that the rate of progress on the chosen metrics should be similar across domains; is that empirically supported?
Flesh out a semi-mechanistic theory (that makes this prediction).
The argument would be something like: even if you have two very very different optimization procedures (evolution vs human intelligence), as long as they are far from optimality, it is reasonable to model them via a single quantity (“optimization power” / “effective fraction of the search space covered”) which is the primary determinant of the performance you get, irrespective of what domain you are in (as long as the search space in the domain is sufficiently large / detailed / complicated). As a result, as long as you focus on metrics that both evolution and humans were optimizing, you should expect the difference in performance on the metric to be primarily a function of the difference in optimization power between evolution and humans-at-that-point-in-time, and to be approximately independent of the domain.
Once you have such a theory, check whether the bio anchors application is sensible according to the theory.
I anticipate someone asking the followup “why didn’t Open Phil do that, then?” I don’t know what Open Phil was thinking, but I don’t think I’d have made a very different decision. It’s a lot of work, not many people can do it, and many of those people had better things to do, e.g. imo the compute-centric takeoff work was indeed more important and caused bigger updates than I think the work above would have done (and was probably easier to do).
Would you agree that updating / investing “a lot” in an argument that’s kind of crazy in some absolute sense, would be an epistemic / strategic mistake, even if that argument is the best available specific argument in a relative sense?
Hmm, maybe? What exactly is the alternative?
Some things that I think would usually be epistemic / strategic mistakes in this situation:
Directly adopting the resulting distribution
Not looking into other arguments
Taking actions that would be significantly negative in “nearby” worlds that the argument suggests are unlikely. (The “nearby” qualifier is to avoid problem-of-induction style issues.)
Some things that I don’t think would immediately qualify as epistemic / strategic mistakes (of course they could still be mistakes depending on further details):
Making a large belief update. Generally it seems plausible that you start with some very inchoate opinions (or in Bayesian terms an uninformed prior) and so any argument that seems to have some binding to reality, even a pretty wild and error-prone one, can still cause a big update. (Related: The First Sample Gives the Most Information.)
This is not quite how I felt about bio anchors in 2020 -- I do think there was some binding-to-reality in arguments like “make a wild guess about how far we are, and then extrapolate AI progress out and see when it reaches that point”—but it is close.
Taking consequential actions as a result of the argument. Ultimately we are often faced with consequential decisions, and inaction is also a choice. Obviously you should try to find actions that don’t depend on the particular axis you’re uncertain about (in this case timelines), but sometimes that would still leave significant potential value on the table. In that case you should use the best info available to you.
I actually think this is one of the biggest strengths of Open Phil around 2016-2020. At that time, (1) the vast, vast majority of people believed that AGI was a long time away (especially as revealed by their actions, but also via stated beliefs), and (2) AI timelines was an incredibly cursed topic where there were few arguments that had any semblance of binding to reality. Nevertheless, they saw some arguments that had some tiny bit of binding to reality, concluded that there was a non-trivial chance of AGI within 2 decades, and threw a bunch of effort behind acting on that scenario because it would be so important if it happened.
So imo they did invest “a lot” in the belief that AGI could be soon (though of course that wasn’t based just on bio anchors), and I think this looks like a great call in hindsight.
Some takes:
To answer the main question, I agree circa 2020 if you had to ascribe beliefs to Open Phil the org it would be something like “~30 year median timelines” and “~10% utter AI ruin”. (And I had some visibility on this beyond public info.) I think this throws away tons of detail about actual Open Phil at the time but imo it’s a reasonable move to make.
I don’t think Open Phil had a goal of “promoting” those beliefs, and don’t recall them doing things that would be reasonably considered to be primarily about “promoting” those beliefs.
Obviously they published reports about those topics, but the reason for that is to advance discourse / collective epistemics.
Iirc, bio anchors was the only one that said ~30 year median timelines, I believe semi-informative priors was significantly longer and compute-centric takeoff was significantly shorter.
I don’t think these beliefs made much of a difference to their grant making on object level technical research (relative to beliefs of shorter timelines / higher probability of ruin), just because people almost always want to act on shorter timelines / higher ruin probabilities (than 10%) because those worlds are easier to influence. This is similar to Buck’s take so see his comment for elaboration.
I kinda suspect that upon reading the previous bullet you will say “sure, but the researchers will have contorted their beliefs to make themselves palatable to Open Phil”.
They cared a lot about avoiding this. At one point while I was at CHAI (funded by Open Phil), we were talking with someone from Open Phil and I asked about their belief on something, and they declined to answer because they didn’t want their grantees overfitting to their beliefs. I actually found it quite frustrating and tried to argue with them that we were in fact very opinionated and weren’t about to simply adopt their beliefs without questioning them.
Of course it is possible that researchers conformed to Open Phil’s beliefs anyway.
Certainly there was a lot of deference from some communities to Open Phil’s views here, though I would guess that was driven more by their reputation as good thinkers rather than their ability to direct funding. I think there’s a ton of deference to you (Eliezer) in the rationalist community, for similar reasons. Both groups tend to underestimate how much deference they are doing.
Maybe you are also making some claim about Open Phil’s (un)willingness to fund work more focused on comms or influencing social reality towards shorter timelines / higher probabilities of ruin? I don’t have any knowledge about that, but mostly I don’t recall many shovel-ready opportunities being available in 2020.
On the actual object level beliefs:
On timelines, my rough take is “nobody knows, it’s not particularly important for technical work beyond ‘could be soon’, it’s often actively negative to discuss because it polarizes people and wastes time, so ignore it where possible”. But if I’m forced to give an opinion then I’m most influenced by the bio anchors lineage of work and the METR time horizon work.
Generally my stance is roughly: Bio anchors / compute-centric timelines is the worst form of timelines prediction, except all those other approaches that have been tried from time to time.
I agree that the timelines in the original bio anchors report were too long, but mainly because it failed to capture acceleration caused by AI prior to TAI. This was pointed out by Open Phil, specifically in Tom Davidson’s compute-centric takeoff speeds report.
Your views are presumably captured by your critique of bio anchors. I don’t find it convincing; my feelings about that critique are similar to Scott Alexander’s. In the section “Response 4: me”, he says: “Given these two assumptions—that natural artifacts usually have efficiencies within a few OOM of artificial ones, and that compute drives progress pretty reliably—I am proud to be able to give Ajeya’s report the coveted honor of “I do not make an update of literally zero upon reading it”.”
In particular, the primary rejoinder to “why should bio anchors bind to reality; people do things via completely different mechanisms than nature” is “we tried applying this style of reasoning in a few cases where we know the answer, and it seems like it kinda sorta works”.
(Remember that I only want to defend “worst form of timelines prediction except all the other approaches”. I agree this is kind of a crazy argument in some absolute sense.)
So in my view Open Phil looks great in retrospect on this question.
I still have a pretty low probability on AI ruin (by the standards of LessWrong), as I have for a long time, since we haven’t gotten much empirical evidence on the matter. So the 10% AI ruin seems fine in retrospect to me.
Best source on the reasons for disagreement is Where I agree and disagree with Eliezer (written by Paul but I endorsed almost all of it in a comment).
Best source that comes from me is this conversation with AI Impacts, though note it was in 2019 and I wouldn’t endorse all of it any more.
The first two paragraphs of my original comment were trying to do this.
(I have the same critique of the first two paragraphs, but thanks for the edit, it helps)
The second is where you summarize my position as being that we need deep scientific understanding or else everyone dies (which I think you can predict is a pretty unlikely position for me in particular to hold).
Fwiw, I am actively surprised that you have a p(doom) < 50%, I can name several lines of evidence in the opposite direction:
You’ve previously tried to define alignment based on worst-case focus and scientific approach. This suggests you believe that “marginalist” / “engineering” approaches are ~useless, from which I inferred (incorrectly) that you would have a high p(doom).
I still find the conjunction of the two positions you hold pretty weird.
I’m a strong believer in logistic success curves for complex situations. If you’re in the middle part of a logistic success curve in a complex situation, then there should be many things that can be done to improve the situation, and it seems like “engineering” approaches should work.
It’s certainly possible to have situations that prevent this. Maybe you have a bimodal distribution, e.g. 70% on “near-guaranteed fine by default” and 30% on “near-guaranteed doom by default”. Maybe you think that people have approximately zero ability to tell which things are improvements. Maybe you think we are at the far end of the logistic success curve today, but timelines are long and we’ll do the necessary science in time. But these views seem kinda exotic and unlikely to be someone’s actual views. (Idk maybe you do believe the second one.)
Obviously I had not thought through this in detail when I originally wrote my comment, and my wordless inference was overconfident in hindsight. But I stand by my overall sense that a person who thinks “engineering” approaches are near-useless will likely also have high p(doom) -- not just as a sociological observation, but also as a claim about which positions are consistent with each other.
In your writing you sometimes seem to take as a background assumption that alignment will be very hard. For example, I recall you critiquing assistance games because (my paraphrase) “that’s not what progress on a hard problem looks like”. (I failed to dig up the citation though.)
You’re generally taking a strategy that appears to me to be high variance, which people usually justify via high p(doom) / playing to your outs.
A lot of your writing is similarly flavored to other people who have high p(doom).
In terms of evidence that you have a p(doom) < 50%, I think the main thing that comes to mind is that you argued against Eliezer about this in late 2021, but that was quite a while ago (relative to the evidence above) and I thought you had changed your mind. (Also iirc the stuff you said then was consistent with p(doom) ~ 50%, but it’s long enough ago that I could easily be forgetting things.)
However, I’d guess that I’m more willing than you to point to a broad swathe of people pursuing marginalist approaches and claim that they won’t reliably improve things.
You could point to ~any reasonable subcommunity within AI safety (or the entire community) and I’d still be on board with the claim that there’s at least a 10% chance that will make things worse, which I might summarize as “they won’t reliably improve things”, so I still feel like this isn’t quite capturing the distinction. (I’d include communities focused on “science” in that, but I do agree that they are more likely not to have a negative sign.) So I still feel confused about what exactly your position is.
On reflection, it’s not actually about which position is more common. My real objection is that imo it was pretty obvious that something along these lines would be the crux between you and Neel (and the fact that it is a common position is part of why I think it was obvious).
Inasmuch as you are actually trying to have a conversation with Neel or address Neel’s argument on its merits, it would be good to be clear that this is the crux. I guess perhaps you might just not care about that and are instead trying to influence readers without engaging with the OP’s point of view, in which case fair enough. Personally I would find that distasteful / not in keeping with my norms around collective-epistemics but I do admit it’s within LW norms.
(Incidentally, I feel like you still aren’t quite pinning down your position—depending on what you mean by “reliably” I would probably agree with “marginalist approaches don’t reliably improve things”. I’d also agree with “X doesn’t reliably improve things” for almost any interesting value of X.)
What exactly do you mean by ambitious mech interp, and what does it enable? You focus on debugging here, but you didn’t title the post “an ambitious vision for debugging”, and indeed I think a vision for debugging would look quite different.
For example, you might say that the goal is to have “full human understanding” of the AI system, such that some specific human can answer arbitrary questions about the AI system (without just delegating to some other system). To this I’d reply that this seems like an unattainable goal; reality is very detailed, AIs inherit a lot of that detail, a human can’t contain all of it.
Maybe you’d say “actually, the human just has to be able to answer any specific question given a lot of time to do so”, so that the human doesn’t have to contain all the detail of the AI, and can just load in the relevant detail for a given question. To do this perfectly, you still need to contain the detail of the AI, because you need to argue that there’s no hidden structure anywhere in the AI that invalidates your answer. So I still think this is an unattainable goal.
Maybe you’d then say “okay fine, but come on, surely via decent heuristic arguments, the human’s answer can get way more robust than via any of the pragmatic approaches, even if you don’t get something like a proof”. I used to be more optimistic about this but things like self-repair and negative heads make it hard in practice, not just in theory. Perhaps more fundamentally, if you’ve retreated this far back, it’s unclear to me why we’re calling this “ambitious mech interp” rather than “pragmatic interp”.
To be clear, I like most of the agendas in AMI and definitely want them to be a part of the overall portfolio, since they seem especially likely to provide new affordances. I also think many of the directions are more future-proof (ie more likely to generalize to future very different AI systems). So it’s quite plausible that we don’t disagree much on what actions to take. I mostly just dislike gesturing at “it would be so good if we had <probably impossible thing> let’s try to make it happen”.
Fair, I’ve edited the comment with a pointer. It still seems to me to be a pretty direct disagreement with “we can substantially reduce risk via [engineering-type / category 2] approaches”.
My claim is “while it certainly could be net negative (as is also the case for ~any action including e.g. donating to AMF), in aggregate it is substantially positive expected risk reduction”.
Your claim in opposition seems to be “who knows what the sign is, we should treat it as an expected zero risk reduction”.
Though possibly you are saying “it’s bad to take actions that have a chance of backfiring, we should focus much more on robustly positive things” (because something something virtue ethics?), in which case I think we have a disagreement on decision theory instead.
I still want to claim that in either case, my position is much more common (among the readership here), except inasmuch as they disagree because they think alignment is very hard and that’s why there’s expected zero (or negative) risk reduction. And so I wish you’d flag when your claims depend on these takes (though I realize it is often hard to notice when that is the case).
Makes sense, I still endorse my original comment in light of this answer (as I already expected something like this was your view). Like, I would now say
Imo the vast, vast majority of progress in the world happens via “engineering-type / category 2” approaches, so if you do think you can win via “engineering-type / category 2″ approaches you should generally bias towards them
while also noting that the way we are using the phrase “engineering-type” here includes a really large amount of what most people would call “science” (e.g. it includes tons of academic work), so it is important when evaluating this claim to interpret the words “engineering” and “science” in context rather than via their usual connotations.
2) the AI safety community doesn’t try to solve alignment for smarter than human AI systems
I assume you’re referring to “whole swathes of the field are not even aspiring to be the type of thing that could solve misalignment”.
Imo, chain of thought monitoring, AI control, amplified oversight, MONA, reasoning model interpretability, etc, are all things that could make the difference between “x-catastrophe via misalignment” and “no x-catastrophe via misalignment”, so I’d say that lots of our work could “solve misalignment”, though not necessarily in a way where we can know that we’ve solved misalignment in advance.
Based on Richard’s previous writing (e.g. 1, 2) I expect he sees this sort of stuff as not particularly interesting alignment research / doesn’t really help, so I jumped ahead in the conversation to that disagreement.
even if no one breakthrough or discovery “solves alignment”, a general frame of “lets find principled approaches” is often more generative than “let’s find the cheapest 80⁄20 approach”
Sure, I broadly agree with this, and I think Neel would too. I don’t see Neel’s post as disagreeing with it, and I don’t think the list of examples that Richard gave is well described as “let’s find the cheapest 80⁄20 approach”.
I wish when you wrote these comments you acknowledged that some people just actually think that we can substantially reduce risk via what you call “marginalist” approaches. Not everyone agrees that you have to deeply understand intelligence from first principles else everyone dies. (EDIT: See Richard’s clarification downthread.) Depending on how you choose your reference class, I’d guess most people disagree with that.
Imo the vast, vast majority of progress in the world happens via “marginalist” approaches, so if you do think you can win via “marginalist” approaches you should generally bias towards them.
Ah yeah, I think with that one the audiences were “researchers heavily involved in AGI Safety” (LessWrong) and “ML researchers with some interest in reward hacking / safety” (Medium blog)
Yeah it’s just the reason you give, though I’d frame it slightly differently. I’d say that the point of “catch-up algorithmic progress” was to look at costs paid to get a certain level of benefit, and while historically “training compute” was a good proxy for cost, reasoning models change that since inference compute becomes decoupled from training compute.
I reread the section you linked. I agree that the tasks that models do today have a very small absolute cost such that, if they were catastrophically risky, it wouldn’t really matter how much inference compute they used. However, models are far enough from that point that I think you are better off focusing on the frontier of currently-economically-useful-tasks. In those cases, assuming you are using a good scaffold, my sense is that the absolute costs do in fact matter.