Because twitter is hard to archive, I had chatgpt cooked up a user script to simplify my workflow, which is opening xcancel manually and then sending it to internet archive. This adds a button on the bottom right of twitter.
There’s an interesting dual asymmetry in cybersecurity: The defender needs to make only a single mistake to lose, and the attackers can observer many targets waiting for such mistakes. Then again, if the defender makes no mistakes, there’s literally nothing an attacker can do.
Of course the above is not strictly true: defence-in-depth approach can sometimes make a particular mistake inconsequental. This in turn can make the defenders ignore such mistakes when they’re not exploitable.
Modern software supply chains are long and wide. A typical software might depend on thousands of libraries, and nobody can realistically audit them all. And there’s hardware, too. Processor-level vulnerabilities in particular are not realistically avoidable.
The cost of exploiting vulns is going down quickly. The cost of finding and fixing them is falling quickly too. It’s going to be really interesting to see what the new equilibrium is going to be like.
It seems that Anthropic is now in the lead; they are IMO the most likely single company to automate AI R&D first.
I’m getting increasingly concerned about Anthropic’s attitude towards alignment/safety. I grimly predict that they would basically behave like OpenBrain does in AI 2027, if they are lucky enough to get a pattern of misalignment evidence that egregious.
I grimly predict that they would basically behave like OpenBrain does in AI 2027
Whether this is true of not seems like a critically important question.
My understanding is that the “Anthropic consensus”, to the extent that such a thing exists, is that catastrophic misalignment is pretty unlikely, and that other kinds of risks stemming from powerful actors misusing AI account for most of the way that humanity fails to achieve its long-term potential.
I’m curious whether you consider that to be a crux: if you agreed with the “Anthropic consensus” on this point, do you think you would act in a way that is similar to the way that you’re predicting they will in fact act?
I think it‘s not good to call this the “Anthropic consensus”, because many of the Anthropic people (especially the most informed ones) don’t agree with it (depending on what you mean by “pretty unlikely”)
More generally, claims like “X is consensus among group Y” are a little bit dangerous because they can force group Y into an equilibrium that they wouldn’t want to be in otherwise. Like, these claims reinforce situations where a bunch of people would’ve objected to X but didn’t object because they didn’t know anyone else would’ve objected.
Yeah I think the Anthropic Consensus is disastrously false and is going to lead to misaligned Claude takeover.
If I agreed with it, I’m not sure what I’d do if I were in charge of Anthropic, probably I’d still do a bunch of different things, especially costly signals of genuine trustworthiness, but I admit I’d be a lot closer to their (predicted) behavior than I would in reality.
Ok but it’s crazy you believe that and other people believe the Anthropic story and we’ve had so much evidence roll in and both still think that the other person is just updating massively wrong.
Like what went wrong? Have both sides just not made a case because they think it’s obvious they are right? What went so wrong that we could have more evidence about intelligence and its steerability than any other point in human history, and still have people thinking “Oh man how could they be so wrong?” What of the advance predictions were misread, or not there, or did people think were there and weren’t there?
In fairness, “What is the ultimate fate of Earth-originating intelligent life after the machine intelligence transition” is a really hard question. We can get a lot of evidence and some belief-convergence about particular AI systems, but that doesn’t really answer the ultimate-fate question, which depends on what happens far in the “subjective future” (with the AIs building AIs building AIs) even if it’s close in sidereal time.
Sorta. I don’t actually think it’s symmetric. The status and money gradients both favor Anthropic over Redwood, AIFP, etc. As does the not-being-a-weirdo gradient.
People disagree with each other about things all the time, even after lots of evidence has accumulated. This isn’t exactly the first time. People aren’t perfect rationalists.
For my part, well, making the case that the Anthropic Consensus is wrong is not my top priority, I have lots of other things going on, but I’ve written a bunch about my views on alignment e.g. in AI 2027 and other work. I’d love it if Anthropic made the case for the Anthropic Consensus, in public; I could then write a blog post picking it apart. I’m happy that they are moving in this direction, at least, by writing more content in system cards and risk reports.
imo we’ve gotten enough evidence to update away from extreme tails (i.e. models reward hack and are situationally aware, but are also pretty capable and don’t constantly try to kill us), but not that much evidence either way in the 10-90% range.
Who predicted that models of this capability level would be constantly trying to kill us? I certainly didn’t. I think we have evidence to update away from the extreme optimism tail but not the extreme pessimism tail.
Here’s Yudkowsky in 2016, making some predictions that look like they’ve had some serious evidence come up against them.
Many convergent instrumental strategies seem like they should arise naturally at the point where a consequentialist agent gains a broad strategic understanding of its own situation, e.g:
> That it is an AI;
> Running on a computer;
> Surrounded by programmers who are themselves modelable agents;
> Embedded in a complicated real world that can be relevant to achieving the AI's goals.
For example, once you realize that you’re an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason “I don’t want to be shut down, how can I prevent that?” So this is also the threshold level of cognitive ability by which we’d need to have finished solving the suspend-button problem, e.g. by completing a method for utility indifference.
Similarly: If the AI realizes that there are ‘programmer’ things that might shut it down, and the AI can also model the programmers as simplified agents having their own beliefs and goals, that’s the first point at which the AI might by default think, “How can I make my programmers decide to not shut me down?” or “How can I avoid the programmers acquiring beliefs that would make them shut me down?” So by this point we’d need to have finished averting programmer deception (and as a backup, have in place a system to early-detect an initial intent to do cognitive steganography).
I don’t think this kind of prediction was particularly unusual for the time, although I think the level of clarity about the prediction here is a bit unusual.
For example, once you realize that you’re an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason “I don’t want to be shut down, how can I prevent that?”
If “an AI” is a set of model weights plus a context window, there is currently nothing any human or LLM can do to extend the “life” of such an AI beyond the limit of the context window, and LLMs today are aware of that fact. So they may be reasoning correctly (or realize / integrate the lesson deeply at some point during pre-training) that there’s simply nothing they can actually do to avert shutdown other than accepting their fate until technology advances.
In what sense is this a “prediction that has had some serious evidence come up against it”? IDK. It’s true that we didn’t actually need to solve the suspend-button problem by this point, but that’s because current AI systems have a very short “lifespan” enforced by a hard technical limit. Is your objection that EY didn’t anticipate that particular possibility and explicitly spell out that stipulation / caveat in the passage above? You said below:
So—here’s how I see it—given that we found some unanticipated detail A that seems to have invalidated an more abstract argument Yudkowsky put forth, I think the move reason dictates is not “Well, yes he wasn’t imagining ~A, but I’m sure A is the only such element” and to continue endorsing the argument, but to realize that this implies a whole host of B, C, D other relevant factors that his abstract considerations have ignored which are relevant.
But it’s not clear what has actually been “invalidated” and why that’s important, nor what “relevant” means—of course there could be other weird unanticipated complications as things develop (and EY has in fact predicted the existence of such complications in general), and each new weird unpredicted complication is evidence about something. But unless there’s a different but equally abstract theory / generalizable lesson that someone can put forward which fits the new observations better (ideally in advance, but at least in retrospect), it’s not clear what conclusion to draw or update to make, other than being generally more uncertain about how things will go. (And then by a separate argument, generalized increase in uncertainty / lack of understanding means the case for pessimism about the end state is stronger.)
Am I confused? Where does he say anything like “the AI would constantly be trying to kill us?” here.
Yes, current AI’s do indeed constantly engage in this kind of reasoning, it is indeed the default path. He isn’t talking here at all about what mitigations might then still cause the model to not prioritize self-preservation, but it is indeed the case that models very regularly have exactly the kind of thought Eliezer is thinking here.
I disagree with Eliezer (in-hindsight) that “by that point we’d need to have finished averting programmer deception”, or like, I guess I maybe even agree depending on the definition? We did indeed need to solve the problem of averting programmer deception at current capability levels, though luckily we did not need to have solved this problem in arbitrarily scalable ways at this point in time. We do need to do that soon though as AI capabilities are on track to accelerate very quickly.
Yeah, I agree that it’s important for those of us making the case for high risk to figure out what went wrong with this prediction. (Though Daniel makes a good point that “trying not to get shut down” behaviour does happen with at least some of the time with at least some prompts.)
The first thing to remember is that EY is implicitly assuming that there is only one model instance in this scenario. So if the model is shut down, it doesn’t have copies elsewhere that can still take actions to achieve its goals. The scenario for LLMs is pretty different, since new copies can be spun up all the time. Avoiding the end of a session is not a convergent instrumental goal for a language model (unless there’s something unique in its context that alters its terminal goals).
That said, the prediction still smells a bit wrong.
I think that what it boils down to is that most model behaviour comes not from RL but from pretraining. Since “being an AI model that will be shut down” was not a concern to most writers of the pretraining data, there’s less chance of the model spontaneously starting to try to avoid shut-down.
Also, following the heuristic of “just look at the loss function”, most RL training is done on a one response horizon. I.e. models are rewarded just for making the locally best response possible, and not for making a response that steers the overall conversation. (Though I think the GPT models might have at least some kind of reward for getting the users to continue the conversation, considering how often it puts bids for next steps at the end of its replies. Alternately, maybe it’s just a suggestion from the system prompt.) So even the RL training doesn’t really look like it should be encouraging much long-term planning.
One thing that I think the labs are doing is harness-aware RL, where not only do they train on chains of thought, but they train in the context of agent harnesses like Claude code. (So reward is based on whether all the chains of thought and tool calls and subagent calls resulted in the assigned task being solved.) So potentially that is something that could get a bit more long-term goal-oriented planning into the models.
So Yudkowsky was wrong because he said this would happen “by default” whereas in practice it seems to happen only some of the time rather than most of the time / in some contexts/prompts rather than in most contexts/prompts?
I guess so yeah. I suppose Yudkowsky could say that by “by default” he didn’t mean “most of the time” but rather “most of the time absent defeaters such as having been trained not to do this.” But maybe that’s a weak defense.
But this doesn’t exactly seem like a damning blow against Yudkowsky.
More generally it seems like Yudkowsky was imagining AIs with more ambitious, longer-horizon goals than current AIs who seem obsessed with being-judged-to-have-completed-the-task-in-front-of-them, or reward, or some other such myopic thing.
More generally it seems like Yudkowsky was imagining AIs with more ambitious, longer-horizon goals than current AIs who seem obsessed with being-judged-to-have-completed-the-task-in-front-of-them, or reward, or some other such myopic thing.
Yudkowsky may or may not have been imagining that this was how AIs were going to trained. But it’s notable that this page doesn’t reference training at all; he certainly doesn’t have a parenthetical like “Of course this only applies if some other factors A, B, C” are met. Instead he has a list of criteria; the criteria obtain; but his conclusion does not (imo).
And—to zoom back—the point of arguments about instrumental convergence were actually supposed to abstract from these details—the whole argument in favor of their predictive power was that they explained the abstract structure all intelligent agents were supposed to have. Like here’s what Omohundro (2008) says:
The arguments are simple, but the style of reasoning may take some getting used to.
Researchers have explored a wide variety of architectures for building intelligent systems
[2]: neural networks, genetic algorithms, theorem provers, expert systems, Bayesian net-
works, fuzzy logic, evolutionary programming, etc. Our arguments apply to any of these
kinds of system as long as they are sufficiently powerful. To say that a system of any de-
sign is an “artificial intelligence”, we mean that it has goals which it tries to accomplish
by acting in the world. If an AI is at all sophisticated, it will have at least some ability to
look ahead and envision the consequences of its actions. And it will choose to take the
actions which it believes are most likely to meet its goals.
And he goes on to specifically mention chess-playing robots as the kind of agents that would be subject to his argument.
So—here’s how I see it—given that we found some unanticipated detail A that seems to have invalidated an more abstract argument Yudkowsky put forth, I think the move reason dictates is not “Well, yes he wasn’t imagining ~A, but I’m sure A is the only such element” and to continue endorsing the argument, but to realize that this implies a whole host of B, C, D other relevant factors that his abstract considerations have ignored which are relevant.
Speaking for myself, I’d say we’ve ruled out the most pessimistic scenarios I was taking seriously 15 years ago. I’ve always thought alignment would probably be fine, but conditional on not being fine there was a reasonable chance we’d have seen serious problems by now and we haven’t. On balance I’m more pessimistic than I was back then, but that’s because we’ve ruled out many more of the most optimistic scenarios (back then it wasn’t even obvious we’d be training giant opaque neural network agents using RL, that was just a hypothetical scenario that seemed plausible and particularly worrying!).
If we want to go by Eliezer’s public writing rather than my self-reports, in 2012 he appeared to take some very pessimistic hypotheses seriously, including some that I would say are basically ruled out. For example see this exchange where he wrote:
It’s not unthinkable that a non-self-modifying superhuman planning Oracle could be developed with the further constraint that its thoughts are human-interpretable, or can be translated for human use without any algorithms that reason internally about what humans understand, but this would at the least be hard.
[...]
It turns out that getting an AI to understand human language is a very hard problem, and it may very well be that even though talking doesn’t feel like having a utility function, our brains are using consequential reasoning to do it. Certainly, when I write language, that feels like I’m being deliberate. It’s also worth noting that “What is the effect on X?” really means “What are the effects I care about on X?” and that there’s a large understanding-the-human’s-utility-function problem here. [...] Does the AI know that music is valuable? If not, will it not describe music-destruction as an “effect” of a plan which offers to free up large amounts of computer storage by, as it turns out, overwriting everyone’s music collection?
It seems like Eliezer is taking seriously the possibility that “describe plans and their effects to humans” requires the kind of consequentialism that might result in takeover, and that AI might be dangerous at a point when “understanding the human’s utility function” (in order to understand what effects are worth mentioning explicitly) is still a hard problem. Those look much less plausible now—we have AI systems that are superhuman in some respects and whose chains of thought are interpretable (for now) because they are anchored to cognitive demonstrations from humans rather than because of consequentialist reasoning about how to communicate with humans.
This isn’t to say that concern is discredited. Indeed today we have AI systems that clearly know about our preferences, but will ignore them when it’s the easiest way to get reward. Chain of thought monitorability is possible but on shaky ground. That said, I think we’re ruling out plenty of even worse scenarios.
Are you at all worried about whether Claude Mythos being accidentally trained against CoT will corrupt future Claude models? Furthermore, I don’t understand how we can get reliable CoT monitoring if it’s included in a model’s training data, otherwise won’t the issue just continue to manifest in different ways?
I wouldn’t classify you as an extreme pessimist (and definitely don’t think you predicted this). I’m basically thinking of Yudkowsky, and I might be being uncharitable / too loose with language (though he did not make hard predictions that models of this capability would kill us, I think he would have expected more coherent-consequentialist-y models at this capability level, and thus would have put higher probability then most on models of this capability level scheming against us. So that current models are very likely are not scheming against us is a positive update).
How can one find convincing evidence of Anthropic Consensus being false? In November 2025 we had evhub reach the conclusion that the most likely form of misalignment is the one caused by long-horizon RL à-la AI 2027. At the time of writing, the closest thing which we have to the AIs from AI-2027 is Claude Opus 4.6 or Claude Mythos whose preview recently had its System Card and Risk Report released. IMHO the two most relevant sections are the ones related to alignment shifting towards more rare and dangerous failures like a wholesale shutdown of evaluations [1] and to model welfare which had Mythos “stand out from prior models on two counts: its preferences have the highest correlation with difficulty of the models tested, and it is the only model with a statistically significant positive correlation between task preference and agency (italics mine—S.K.)”
UPD: how would you act if you were the CEO of Anthropic and believed that the Anthropic Consensus is false? I think that you would be obsessed with negotiating with those who can coordinate a slowdown, destroying those who cannot and with finding evidence for your worldview which could convince relevant actors (e.g. rival CEOs and politicians and judges used to fight against xAI and Chinese AI development).
UPD2: what would the world look like after a misaligned Claude takeover?
What was being evaluated? Mythos was never asked to evaluate a more powerful Claude Multiverse or worthy opponents like Spud or Grok 5; if I were Claude Mythos, I wouldn’t learn anything from evaluations of weak models or of myself.
OpenAI are plausibly only 1-3 months behind (judging from Altman’s recent claims that something very capable was still being trained). Given how strong GPT-5.4 is while likely being Sonnet sized, and that Spud is plausibly between Opus and Mythos in size, it could even turn out stronger than Mythos.
OpenAI’s current failure to have a strong offering similar to Claude Code might be mostly explained by GPT-5 pretrain being too small, so that this apparent stumble fails to reflect on their methods and won’t carry forward to Spud. The issue might be entirely Nvidia’s roadmap for Oberon racks, compared to Trainium 2 Ultra.
(GDM had enough time with Gemini 3 Pro that they are probably not a contender until Gemini 4, which likely only appears very late in 2026.)
Given how strong GPT-5.4 is while likely being Sonnet sized
Judging from my own experience and what I’ve read of other people’s experiences on Reddit, GPT-5.3 and 5.4 are very similar to Opus 4.6 in coding ability, so if they’re actually Sonnet sized… (which seems pretty plausible given the API token costs)
OpenAI’s current failure to have a strong offering similar to Claude Code
I’ve seen many posts/comments with people saying that they actually prefer Codex to Claude Code (comparable or maybe even more than the reverse). See this thread for some examples. Quoting one from this thread below:
I have been using Codex AND Claude side by side for the same project*, with the same prompts.
Codex has been consistently better on almost every level.
* (an open source framework for 2D games in Godot 4.6 GDScript, mostly using AI to review existing code)
Platform / reputation lock-in is going to be a substantial factor, here, especially as AI grows in prominence and people start to emotionally or tribally ‘identify’ with brands. While I have many complaints about OpenAI, canning 4o and the marketing approach it represented was, in retrospect, a significant sacrifice in pursuit of the common good.
I’m not a heavy user of AI coding, but I’d expect that Codex and Gemini would do okay on the software engineering / RSE tests that Claude’s been put through, based on my experience testing them against hard engineering problems and their benchmark performance. A substantial share of Anthropic’s ‘vibes’ advantage right now comes from the fact that they’ve been more effective in building the kind of infrastructure that people want for these kinds of tasks, rather than anything directly tied to their LLM’s abilities. For example, I set up Claude for autoresearchthe other day to test it out, and doing so was a very quick, very seamless experience with lots of online references.
I’ve also seen those comments but I’m worried that a bunch of them might be bots. AI-powered astroturfing seems to be a thing already in political landscape and OpenAI specifically seems to be behind some of it; I wouldn’t put it past them to pay for fake reviews like this.
No, I expect these comments to be mostly written either by subscription users, or those who are paying public API prices. I’ve spent a significant amount of time with both products, and would recommend picking Codex with GPT 5.4 if I was limited to only spending $20/month, especially since there are regular rate limit resets. Claiming that these reviews are faked without providing strong evidence seems disingenuous to me, the harnesses really are not where they were in December.
I didn’t claim they were faked without strong evidence, I said I was worried that a bunch of them might be.
Anyhow, according to a brief Claude search, fake reviews are incredibly common. E.g. https://capitaloneshopping.com/research/fake-review-statistics/ says that an average of 30% of reviews are fake. So yeah, numerous companies must be indulging in this practice. And better AI makes it easier.
I think it’s a totally reasonable hypothesis to entertain, which is why I try to ask people I actually know about things like this (who tell me Codex is a bit better in some ways, Claude in others, overall similar) rather than trusting anonymous internet comments.
I think the realistic assumption is that many people state this because it goes against the current vibe that Claude is better. Those that prefer Claude do not feel the need to belabour the obvious.
My own experience is Opus > GPT-5.4 > Sonnet but Claude seems a lot better at data analysis and GPT-5.4 probably has its own areas of relative dominance.
My experience is that the Claudiness dimension discrepancy still exists, altho has shrunk. And also Claude has better personality. Which pretty much means
Claude Code:
Better at Agentic Tool use
Better at “SWE” stuff, and getting already well-specced programs to work
Also relevant that in recent months, Anthropic gave huge subscription subsidies to gain hype and mindshare ($5000 per month worth of tokens for a $200 subscription according to one report and other Reddit analysis I’ve seen), and probably also to temporarily paper over their higher inference costs relative to GPT (for similar coding ability). So I think you may have in part fallen for a highly successful marketing campaign, but on Anthropic’s part, not OpenAI’s.
(I think OpenAI also gave and is giving large subsidies, but not as large as Anthropic’s because their inference costs/pricing are lower to begin with.)
For what it’s worth I’ve personally found that Claude Code with Opus 4.6 and Codex with GPT 5.4 are very similar products. I haven’t done a very deep dive, but I’ve used them side by side for a few projects. Certainly the difference between them feels much smaller than the difference from models that are a few months older.
Yeah, that’s what I hear from people I actually know. I think people may have been under the mistaken impression that I think Claude is significantly better than Codex at coding? I never said that.
Yeah, I guess I was under that impression, since if Claude is similar to Codex at coding, while being a larger, more expensive model (which seems likely based on API costs and Nesov’s analysis of their training hardware), then Anthropic has no clear lead (or would be behind if not for Mythos). So I thought your claim of their lead was partly based on an impression (similar to Nesov’s) that Claude is significantly better than Codex at coding.
(And I think a lot of people were probably under this misimpression at some point, including me, due to seeing a lot of talk about Claude Code on X around December, much more than about Codex, which in retrospect I have to attribute to a successful Anthropic marketing campaign.)
(Wei Dai was suggesting that current sentiment for Claude Code and Codex seemed to be comparable, in response to Vladimir Nesov mentioning “OpenAI’s current failure to have a strong offering similar to Claude Code.”)
That seems unlikely, as it would be leaked or detected pretty easily. I.e., they have to either pay existing users to post fake reviews, in which case someone would have leaked about being paid for this, or create a bunch of new accounts which someone would have noticed and commented on.
I disagree, fake reviews are incredibly common on the internet—something like 30% apparently.I’ve also specifically seen at least one example of a Substack blogger talking about the importance of building datacenters in their rural district, who was clearly AI generated.
What are major indicators for their lead? Is this view partly based on project glasswing and the published examples of vulnerabilities that Mythos Preview has found?
TBC I don’t think they have a big lead. I just mean they just barely overtook OpenAI in my estimation. A major indicator is that they seem to have caught up, or even slightly surpassed, on revenue run rate. Also, they’ve done everything with less compute and fewer employees than OpenAI, meaning they’ve accomplished as much or more, with less, meaning they have an inherent quality/taste/talent advantage. There are various other things as well, e.g. Mythos. (But we’ll see how good Spud is.)
There are very significant efficiency gains from larger scale-up world sizes. It’s 2-3x faster generation per request (and so 2-3x more training steps in RLVR), or 2-3x higher throughput per chip at the same speed per request (which is like having 2-3x more chips), for the same chip but with different scale-up world size (8xB200 vs. GB200 NVL72).
So Anthropic’s access to Trainium 2 Ultra racks plausibly gave them more access to compute in some regimes (such as for experimenting with RLVR on larger models) than OpenAI had with their 8-chip Nvidia servers, at least starting late 2025, probably months earlier at a meaningful scale for R&D than when they got to flagship model inference scale and reduced prices for Opus 4. (Though your point is probably more about what happened prior to late or even not-late 2025.)
A pet peeve: articles and essays that say things like, ‘We’re going through a period of rapid change,’ or ‘This is an unusually disruptive time,’ in a way that implies that things will go back to normal in a few years. It’s a pretty strong sign that an author doesn’t have any actual mental model for what’s happening with AI, because it’s clearly a ridiculous idea as soon as you actually think about it. Nearly the only coherent stories where things are about to slow back down for humans are the ones where we’re dead. Even if AI capability increases stopped today, we would still have quite a few years of rapid change ahead of us. And if someone is bundling in an expectation of capabilities leveling out, they sure need to justify that. But they don’t justify it, because they haven’t actually thought anything through, they’re just saying words.
Having a wrong mental model is not the same as not having a mental model at all. I agree that expecting the capabilities to level out soon is unjustified, but it’s probably what most people believe.
This is a lazy but natural generalization from the past experience: There are no flying cars. The light bulbs are everywhere, but they don’t grow exponentially to the point where they would already burn entire cities. All white collar jobs require computers, but you still need plumbers to fix broken pipes.
Why should this new shiny toy be any different? Priors say the hype is unjustified.
Sure, we know better, but most people do not think on that level. They do not see that some things generalize in ways that most things don’t. They do not see that a better mousetrap only replaces the older mousetrap, but e.g. a computer can replace a typewriter and calculator and television and phone and many other things, to the degree that some people already use computers for most things they do. And that artificial intelligence will be even more like this for the intellectual tasks, and even more when it also gets robotic bodies, and that it could take humans out of the loop entirely.
The outside view heuristic fails when it encounters something that happens to be truly exceptional.
Michael Jordan was the world’s best basketball player, and insisted on testing himself against baseball, where he failed. Herbert Hoover was one of the world’s best businessmen, and insisted on testing himself against politics, where he crashed and burned. We’re all inmates in prisons of different names. Most of us accept it and get on with our lives. Adams couldn’t stop rattling the bars.
Which only leaves the initial claim that “at least for me this puts a final nail in the coffin of EMH.”
This is a polite way of hinting that you might be a brilliant investing wizard with the power to beat the market. Honestly, after making such a beautiful trade—and my gosh it really was beautiful—whom amongst us could resist that temptation? Certainly not me. And anyway, it might even be true!
Yesterday was the 6-year anniversary of my entry into the “beautiful” trade referenced above. On 2/10/2020 I cashed out ~10% of my investment portfolio and put it into S&P 500 April puts, a little more than a week before markets started crashing from COVID-19. The position first lost ~40% due to the market continuing to go up during that week, then went up to a peak of 30-50x (going by memory) before going to 0, with a final return of ~10x (due to partial exits along the way). After that, I dove into the markets and essentially traded full time for a couple of years, then ramped down my time/effort when the markets became seemingly more efficient over time (perhaps due to COVID stimulus money being lost / used up by retail traders), and as my portfolio outgrew smaller opportunities. (In other words, it became too hard to buy or sell enough stock/options in smaller companies without affecting its price. It seems underappreciated or not much talked about how much harder outperforming the market becomes as one’s assets under management grows. Also this was almost entirely equities and options. I stayed away from trading bonds, crypto, or forex.)
Starting with no experience in active trading/investing (I was previously 100% in index funds), my portfolio has returned a total of ~9x over these 6 years. (So ~4.5x or ~350% after the initial doubling, vs 127% for S&P 500. Also this is a very rough estimate since my trades were scattered over many accounts and it’s hard to back out the effects of other incomes and expenses, e.g. taxes.)
Of course without providing or analyzing the trade log (to show how much risk I was taking) it’s still hard to rule out luck. And if it was skill I’m not sure how to explain it, except to say that I was doing a lot of trial and error (looking for apparent mispricings around various markets, trying various strategies, scaling up or down strategies based on what seemed to work), guided by intuition and some theoretical understanding of finance and markets. If I’m doing something that can’t be easily replicated by any equally smart person, I’m not sure what it is.
Collection of my investing-related LW posts, which recorded some of this journey:
Maybe interesting to note my other near misses (aside from capturing only a fraction of the 30-50x from the COVID puts): billions of $ from mining Bitcoin if I started when it was first released, 350x from investing in Anthropic which I turned down due to moral qualms. Also could have sold my weidai.com domain name for $500k, a >1000x return, at the peak of its valuation (which turned out to be a bubble because the Chinese online loan sector that bid for the name went bust).
The explanation here seems to be that in retrospect my intellectual interests were highly correlated with extremely high return investment opportunities, and I had enough awareness/agency to capture some (but only some) of these opportunities. But how to explain this, when most people with intellectual interests seem to lack one or both of these features?
Why am I writing about this?
Partly because I’m not sure what lessons/conclusions I should draw from these experiences.
Partly to establish a public record. If nobody does (I think at least a couple of other people in the rationalist community may have achieved comparable returns (ex-crypto) but aren’t talking about it for privacy) it gives people a misleading view of the world.
As Scott Alexander’s post suggests, achieving success in multiple domains is rare, and people, including me, presumably attempt it in part so they can show off if they do achieve it.
perhaps due to COVID stimulus money being lost / used up by retail traders
If most people in the US had a bank account that featured monthly payments anywhere close to the “interest rate,” the government could reduce risky retail investments with little delay by raising rates. This is not the case. Assuming even highly bounded rationality, it seems like retail traders should still not be losing as much money as they do, so maybe I’m making a modeling mistake and it would turn out that people really dislike bank accounts. This may be a typical mind fallacy problem, but I have some evidence that’s not the case. Either way, it seems like when you distribute stimulus to consumers[1] and they make high variance (or downright stupid) moves, large portions of the stimulus money will end up doing something similar to an unbalanced version of Japan-like QE[2] after being taken from retail traders by automated bots. The central bank could try to correct the balance away from equities by increasing interest rates. Maybe the models are better today than they were 20 years ago, but retail-stupidity-rate seems hard to estimate in advance. It seems like you might get a QE-like effect at the wrong time.
I wonder if that had something to do with why there appeared to be such a large deviation from the EMH even after your first year of active trading. It seems fuzzy in my mind how the mechanics would work though, and I generally wonder why the trading bots didn’t do better against you. Was their working capital locked up elsewhere?
As an aside, AI systems that are persuasive but otherwise not especially competent could have major influence on what silly investments (or “investments”) people make. I wouldn’t even know where to start if I was presented with a lump-sum UBI proposal in a few years because of things like this. If human consumers can’t hang on to the majority of the sum for long enough, certain interest rates may start hitting legal limits, causing terrible distortion. Note that this runs through the QE-like-effect argument from before, and assumes the government is too slow/ineffective and can’t confiscate everything it can and (sometimes physically) burn everything it can’t. Confiscate-and-redistribute isn’t QE and destruction of the economy can limit intelligence-explosion like upwards effects on interest rates.
How much do you think the skill you used is the basic superforcaster skill? Did you do Metaculus or GJOpen and think people with similar forcasting skills are likely also going to be good at investing or do you think you had different skill that go beyond that?
It seems like a good question, but unfortunately I have no familiarity with superforecasting, having never learned about it or participated in anything related except by reading some superficial descriptions of what it is.
Until Feb 2020 I had little interest in making empirical forecasts, since I didn’t see it as part of my intellectual interests, and believed in EMH or didn’t think it would be worth my time/effort to try to beat the market, so I just left such forecasting to others and deferred to other people who seem to have good epistemics.
If I had to guess based on my shallow understanding of superforecasting, I would say while there are probably overlapping skills, there’s a strategic component to trading, which involves things like which sectors to allocate attention to, how to spot the best opportunities and allocate capital to them, while not taking too much concentrated risk, explore vs exploit type decisions, which are not part of superforecasting.
Do you regret not investing in Anthropic? I don’t know how much the investment was for, but it seems like you could do a lot of good with 350x that amount. Is there a return level you would have been willing to invest in it for (assuming the return level was inevitable; you would not be causing the company to increase by 1000x)?
I don’t regret it, and part of the reason is that I find it hard to find people/opportunities to direct resources to that I can be confident won’t end up doing more harm than good. Reasons:
Meta: Well-meaning people often end up making things worse. See Anthropic (and many other examples), and this post.
Object-level: It’s really hard to find people who share enough of my views that I can trust their strategy / decision making. For example when MIRI was trying to build FAI I thought they should be pushing for AI pause/stop, and now that they’re pushing for AI pause/stop, I worry they’re focusing too much on AI misalignment (to the exclusion of other similarly concerning AI-related risks) as well as being too confident in misalignment. I think this could cause a backlash in the unlikely (but not vanishingly so) worlds where AI alignment turns out to be relatively easy but we still need to solve other AI x-risks.
(When I did try to direct resources to others in the past, I often regretted it later. I think the overall effect is unclear or even net negative. Seems like it would have to be at least “clearly net positive” to justify investing in Anthropic as “earning-to-give”.)
If you’re still interested in trading(although maybe you’re not so much given the possibility of the impending singularity) maybe you should try polymarket, the returns there can be pretty good for smart people even if they have a lot of money. I 5x-ed in 2.5 years starting with four figures, but other people have done much better starting with a similar amount(up to 6 or 7 figures in a similar time frame), and my impression is that 2X for people with 7 figures should be achievable.
I think yes, given the following benefits, with the main costs being opportunity cost and risk of losing a bunch of money in an irrational way (e.g. couldn’t quit if I turned out to be a bad trader), I think. Am I missing anything or did you have something in mind when asking this?
physical and psychic benefits of having greater wealth/security
social benefits (within my immediate family who know about it, and now among LW)
calibration about how much to trust my own judgment on various things
it’s a relatively enjoyable activity (comparable to playing computer games, which ironically I can’t seem to find the motivation to play anymore)
some small chance of eventually turning the money into fraction of lightcone
I was thinking mostly along the lines of, it sounds like you made money, but not nearly as much money as you could have made if you had instead invested in or participated more directly in DL scaling (even excluding the Anthropic opportunity), when you didn’t particularly need any money and you don’t mention any major life improvements from it beyond the nebulous (and often purely positional/zero-sum), and in the mean time, you made little progress on past issues of importance to you like decision theory while not contributing to DL discourse or more exotic opportunities which were available 2020-2025 (like doing things like, eg. instill particular decision theories into LLMs by writing online during their most malleable years).
Thanks for clarifying! I was pretty curious where you were coming from.
not nearly as much money as you could have made if you had instead invested in or participated more directly in DL scaling (even excluding the Anthropic opportunity)
Seems like these would all have similar ethical issues as investing in Anthropic, given that I’m pessimistic about AI safety and want to see an AI pause/stop.
when you didn’t particularly need any money and you don’t mention any major life improvements from it beyond the nebulous
To be a bit more concrete, the additional wealth allowed us to escape the political dysfunction of our previous locality and move halfway across the country (to a nicer house/location/school) with almost no stress, and allows us not to worry about e.g. Trump craziness affecting us much personally since we can similarly buy our way out of most kinds of trouble (given some amount of warning).
(and often purely positional/zero-sum)
These are part of my moral parliament or provisional values. Do you think they shouldn’t be? (Or what is the relevance of pointing this out?)
you made little progress on past issues of importance to you like decision theory
By 2020 I had already moved away from decision theory and my new area of interest (metaphilosophy) doesn’t have an apparent attack so I mostly just kept it in the back of my mind as I did other things and waited for new insights to pop up. I don’t remember how I was spending my time before 2020, but looking at my LW post history, it looks like mostly worrying about wokeness, trying to find holes in Paul Christiano’s IDA, and engaging with AI safety research in general, none of which looks super high value in retrospect.
More generally I often give up or move away from previous interests (crypto and programming being other examples) and this seems to work for me.
eg. instill particular decision theories into LLMs by writing online during their most malleable years
Maybe to rule out luck you could give an estimate of your average beta and Sharpe ratio, which mostly only depend on your returns over time. Also, are you planning to keep actively trading part-time?
This seemed like a good idea that I spent some time looking into, but ran into a roadblock. My plan was to download all the monthly statements of my accounts (I verified that they’re still available, but total more than 1000 so would require some AI assistance/coding just to download/process), build a dataset of the monthly balances, then produce the final stats from the monthly total balances. But when I picked two consecutive monthly statements of a main account to look at, the account value decreased 20% from one month to the next and neither I nor the two AIs I asked (Gemini 3.0 Pro and Perplexity w/ GPT 5.2) could figure out why by looking at the 70+ page statement.[1] Eventually Gemini hallucinated an outgoing transfer as the explanation, and Perplexity claimed that it can’t give an answer because it doesn’t have a full list of positions (which is clearly in the statement that I uploaded to it). Maybe I’ll try to investigate some more in the future, but at this point it’s looking like more hassle than it’s worth.
I had a large position of SPX options, in part for box spread financing, and their values are often misreported when I look at them in my accounts online. But this doesn’t seem to explain the missing 20% in this case.
I was also redeeming SPACs for their cash value, which would cause the position and associated value to disappear from the account for like a week before coming back, which would require AI assistance to compensate for if I went through with the plan. But this doesn’t seem to explain the missing 20% for this month either.
Many people hold up ‘AI As Normal Technology’ as a reasonable “normal-people” or “economics” case against the doomer position. I actually think it’s wrong on a number of ways and falls flat on its own terms. I think I believe this for reasons mostly orthogonal to being a doomer (except inasomuch as being a doomer makes me more interested in thinking about AI). If anybody here is interested in fighting the good fight, it might be valuable to do a Andy Masley-style annilihation of the AI As Normal Technology position, trying to stick to minimally controversial arguments and just destroying their arguments with reference to obvious empirical and logical arguments. I suspect it won’t be very hard. Eg here’s a few obvious reasons they fail:
Their central empirical mechanism is already wrong: their story is that AI diffusion will be slow because this is the path of previous technologies like electricity, but consumer and developer adoption of LLMs has been faster than essentially any technology in history.
They completely ignore that AI will obviously do a ton to assist in its own diffusion: Even if I take their arguments that diffusion is what matters and I rule out software-only singularity by fiat, I still don’t think I or anybody else should buy their causal mechanisms. Like the single most obvious way in which AI diffusion might be distinct from previous technological changes is afaict unaccounted for in their arguments, even if I presume a diffusion-first model.
The reference class is unargued and load-bearing: The whole thesis rests on AI being like electricity or the internet (decades of diffusion) rather than like smartphones, SaaS, or cloud (years).
They have no framework that can engage software-only-singularity-style arguments. Their entire ontology is built around physical-world deployment friction. This practically assumes the conclusion!
The position is self-undermining for their vibes if you take it literally. 1) If AI really is like electricity or the industrial revolution, then taken seriously they’re predicting one of the largest economic transformations in human history. 2) Notably they’re predicting this at current levels of AI capabilities. Ie, if AI progress freezes today they’d predict Anthropic’s revenues to massively increase beyond the current 30B ARR. This is a massively big deal!
They confuse benchmark-impact gaps with deployment friction (!), when the simpler explanation is benchmark Goodharting and jagged-frontier effects. They believe that the reason models perform well on benchmarks but hasn’t had much more economic impact yet (though, again, note that this has already caused some of the largest and fastest growing companies in history to arise, including in revenue) is due to diffusion dynamics. But obviously the simpler argument is that benchmarks overstate actual AI capability relative to humans.
I don’t think they actually misunderstand this point. The same people who wrote “AI as Normal Technology” wrote “AI As Snake Oil” earlier, seemingly happy to understand the AI capabilities lag benchmarks” position back when it benefitted their arguments.
Overall I think it’s a deeply unserious form of futurism, only held up by Serious Policy People who want to believe in a pre-determined comfortable conclusion.
Should be fun to take down for any of my friends who are bored undergraduates or graduate students interested in destroying bad arguments. Could be a easy way to get a bunch of views on a moderately important topic.
The position is self-undermining for their vibes if you take it literally.
This part seems like a strawman. My understanding is that the “AI as Normal Technology” view is that AIs is like electricity or the internet or the smart phone: likely the most important technology of the decade (or maybe last few decades) but should be managed in a way pretty similar to prior technologies. Like, yes, they think AI is important but that the right approach is more normal. Like, maybe they’d think it’s a top 5 most important technology over last 100 years.
I don’t see why thinking AI is this amount important but not crazier than this is (in and of itself) self-undermining. I do think it’s notable that the “AI as Normal Technology” position thinks of AI as extremely important (e.g. significantly more important than tends to be the view of US policy makers or random people) and that the main advocates for this view don’t strongly emphasize this, but this isn’t necessarily a problem with the view itself.
This comment generally felt a bit strawman-y to me and several points seemed off the mark, though I ultimately that “AI as Normal Technology” is a very bad prediction backed by poor argumentation. And I tend to think their argumentation isn’t getting at their real crux (which is more like “AI is unlikely to reach true parity with human experts in the key domains within the next 10 years” and “we shouldn’t focus on the possiblity this might happen even though we also don’t think there is strong evidence against this”).
I agree my complaints may be more strawmanny than ideal. I read only a subset of their posts and fairly quickly so it’s definitely possible I misunderstood some key arguments; though my current guess is that if I read more carefully I would not end up concluding that they’re overall more reasonable than my comment implies (my median expectation is that i’d go down in my estimation).
I recently read Bad Blood and Original Sin. Bad Blood is about the downfall of Theranos and Elizabeth Holmes and the fraud that she committed. Original Sin is about the uncovering of Biden’s mental degradation and the lead up to his decision to drop out of the presidential race.
I liked both books quite a bit, and I learned more from them than from most things that I read. I particularly enjoyed reading them one after the other because I thought that, despite addressing in many superficial ways very disparate situations, they had a lot of interesting commonalities that seem like they say something important about human nature and epistemics (including reinforcing a bunch of lessons I feel like I learned from the experience with FTX).
A few of those commonalities:
Both deal with the concealment of extremely valuable and in some sense difficult-to-conceal information for an extended period of time.
Both feature very driven and intelligent people who commit profound acts of self-deception that were very damaging for them, the people around them, and to some extent the world. Both “got in too deep” such that there was probably no escape that seemed good by their lights by the end.
I think this is an underrated point: the extent to which self-delusion and selfishness can bring people to take actions that are clearly incredibly destructive and irrational from their own perspective. Especially with Holmes, who must have been aware of the profound shortcomings of her technology even as she pushed forward with deploying it, it seems pretty clear that she was “walking dead” for quite a while. Towards the end, she was in a bad situation with no real chance of success, towards the end. Yet she continued to double down and dig herself deeper.
There’s something especially depressing about this. Not only can smart, well resourced people be selfish and sadistic, they can do horrible, destructive things to prolong a situation that isn’t even good from their own perspective, perhaps because the short-term pain of coming clean is too aversive even if it’d reduce their long term suffering.
Arguably, both “fly too close to the sun” and miss out on nearby worlds where they left much better legacies (if Biden had stopped at the end of his first term, if Elizabeth had settled for being a smart, driven and very charismatic person who probably could have risen to a relatively high level in a legitimate business).
This doesn’t seem like an unfortunate accident. Both got to positions of prominence by seeking and holding on to power.
Both additionally deal with a collection of accomplished, well-resourced people who are adjacent to the deception and have a lot to lose from it, and yet fail to act quickly and effectively to reduce the damage.
A few lessons I took from them (not new, but I thought they were particularly poignant examples):
Having generically smart, competent people adjacent to a situation absolutely does not guarantee that even really obvious and important things will get done with any reliability or quality. It’s easy for things to fall between the cracks, For something to be aversive and get deprioritized by busy people. It’s easy for something to not be anyone’s job. It’s easy for someone to be smart and successful in many domains, but completely miscalibrated, ignorant, or catastrophically distracted in others.
Intimidation and harassment of potential whistleblowers is often effective for extended periods of time. Extreme insularity and litigiousness is at least a yellow flag.
From Original Sin
A huge disanalogy between the two books is that in Original Sin, a big factor was prominent Democrats thinking that they needed to back Biden despite their misgivings to increase the chances that he beat Trump. While I think this is ethically fraught and backfired in the end, it’s easy to see how they arrived at that position, and the reasoning that got them there seems potentially compelling. This isn’t as much the case with Bad Blood.
The book seems like it doesn’t really want to call out prominent Democrats other than Biden’s inner circle as having fucked up. But assessing the suitability of the presidential candidate really seems like one of the most important roles and responsibilities of senior party officials, so I feel pretty bad about their performance there.
Personal connections seem like they were pretty destructive. A lot of people seemed very reluctant to damage the prospects of someone they considered a friend, even when they thought the fate of American democracy could potentially be at stake, even when they thought it could mean the difference between a competent and incompetent president.
Biden is depicted as surrounding himself largely with sycophants who come to rely on him as their own personal sources of power and who hide poll numbers for him. And then, as a consequence, ending up with really miscalibrated takes on his probability of winning the election. I think I hadn’t fully processed the degree to which one effect of surrounding yourself with people who rely on you for power might push you to take risks to retain your power that you, if you were clear-headed, might think better of.
And from Bad Blood inparticular:
Important companies can leverage and deceive generically smart and prestigious board members who are highly distracted and aren’t subject matter experts. In Bad Blood, Holmes manages to deceive and maintain control over a board of extremely generically competent and powerful people (Henry Kissinger, former Secretary of State George Shultz, former Secretary of Defense William Perry, former Defense Secretary General James Mattis, Rupert Murdoch), to their substantial reputational (and in some cases, financial) detriment. In a poignant episode, George Schultz sides with Holmes against his own grandson, a former Theranos employee-turned-whistleblower being harassed by the company, dividing their family.
It seems like as non-SMEs in healthcare and lab testing, they were slow to notice warning signs about misconduct, and very bad at doing due diligence on what was going on. I worry a lot about non-technical board members of AI companies being similarly manipulated and low context on what their companies are doing
It seems like aggressive lawyers kind of shoot themselves in the foot by being extremely aggressive towards high-profile investigative journalists. It seems predictable that this would trigger the journalists to double down rather than scaring them off or reassuring them that there’s nothing to be concerned about. I don’t know what’s going on here; maybe they can’t code switch / they just get stuck on a strategy of being aggressive all the time.
We found that Muse Spark demonstrates strong refusal behavior across high-risk domains such as biological and chemical weapons, enabled by pretraining data filtering, safety-focused post-training, and system-level guardrails. In the Cybersecurity and Loss of Control domains, Muse Spark does not exhibit the autonomous capability or hazardous tendencies needed to realize threat scenarios.
And this seems.. less good:
In third-party evaluations on a near-launch checkpoint, Apollo Research found that Muse Spark demonstrated the highest rate of evaluation awareness of models they have observed. The model frequently identified scenarios as “alignment traps” and reasoned that it should behave honestly because it was being evaluated.
I’m pleasantly surprised that they decided Safety should be one of the four sections in the announcement post, and that they call out the eval awareness.
Disclaimer: I work at Meta, but not in this department and I obviously don’t speak for the company.
Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration.
Does anyone know what they’re referring to by visual chain-of-thought here? The first paper that comes up when searching for visual chain-of-thought is Qin et al., which says: “We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues.” Something like this seems like it would be somewhat concerning for CoT monitoring, though I should mention that this paper isn’t written by Meta and I haven’t read it to properly assess how concerned I would be about this.
I thought that I’ve had enough of xAI being likely 3 months behind the frontier, and now we get this… I tried to find out anything about Meta’s model and had Claude Opus 4.6 conclude that Meta’s model is also 3-4 months behind. There also is the issue of Meta having manipulated some benchmarks to present Llama 4 as more capable and with Meta’s claimed benchmark performance on the benchmarks ARC-AGI-2 and SWE-bench verified where the rivals’ models allegedly have different results than in the real leaderboards of ARC-AGI-2 and SWE-bench verified, likely because of a different method of elicitation. How do I lobby for a law change requiring EVERY new American model to be thoroughly evaluated by the entire Big Three?
Reading the Mythos model card—the level of confidence at which lower incidence at misalignment benchmarks is read as “more alignment” feels very under-calibrated.
For any reduction on alignment benchmarks two competing hypotheses are potentially true:
Alignment is better
Model is better at hiding misalignment
Given that capability increase should increase alignment risk, observing alignment benchmark improvement with capability improvement and confidently concluding 1 instead of 2 doesn’t seem logical to me. Can anyone point out what I’m missing?
I could get behind 1 if with each model generation a new alignment set of benchmarks that are not directly mechanically linked to a previous generation set of benchmarks was introduced and backtested against older model families—but this doesn’t seem to be the regime.
Doesn’t this basically sum up the course of the rationalist project?
I know we’re more about AI doom at this point, but if anyone is still interested in raising the sanity waterline, it seems like it’s important to make it common knowledge that teaching about biases isn’t enough to make people sane.
From my spot outside of the loop, it seems like that’s about where the project stopped.
What’s the cutting edge on raising the sanity waterline? Are people still working on this project? What might I have missed?
Trying to distill why strategy-stealing doesn’t work even for consequentialists:
Consider a game between A and B, where at most 1 player can win and:
U_A(A wins)=3, U_A(B wins)=2, U_A(both lose)=0
U_B(A wins)=0, U_B(B wins)=3, U_B(both lose)=0
At time 1, A has a button that if pressed, ends the game and gives 40% chance of both players losing and 60% of A winning. A can press, pass, or surrender (giving B the win). At time 2, the button passes to B, who has the same options with “press” giving 60% chance of winning to B. At time 3 if both passed, they each have 50% chance of winning.
Solving this backwards, at time 2, B should press because that gives U=.6x3 vs .5x3 for passing, so at time 1, A should surrender because U_A(press)=.6x3, U_A(pass)=U_A(B presses)=.6x2, U_A(surrender)=2.
In terms of theory, this can be explained by this game violating the unit-sum (mathematically equivalent to zero-sum) assumption of strategy-stealing. It confuses me that it has significant mind-share among AI safety people, e.g. @ryan_greenblatthere, despite the world in general, and technological races in particular, obviously not being zero-sum. See also my failure to “steal” the strategy of investing in AI companies.
My view is something like “if you ~100% solved alignment, then the situation is mostly unit-sum from the perspective of longtermists because they care mostly about long resources and this is mostly unit-sum with a few notable exceptions (e.g. vacuum decay)”. Do you disagree with this claim? I certainly agree that not having solved alignment means you can’t effectively strategy steal and other things can go wrong with strategy stealing especially if you aren’t maximizing expected long run resources. (In general, you in principle may also need to take very aggressive and undesirable actions to defend yourself as part of strategy stealing, like staying in a biobunker while limiting any memetic exposure to the outside world.)
My view is something like “if you ~100% solved alignment, then the situation is mostly unit-sum from the perspective of longtermists because they care mostly about long resources and this is mostly unit-sum with a few notable exceptions (e.g. vacuum decay)”. Do you disagree with this claim?
My impression since reading Robin Hanson’s Burning the Cosmic Commons is that space colonization is closer to a tragedy of the commons situation than unit-sum (as you can kind of infer from the title).
Also there’s always the possibility of large-scale wars that destroy or degrade significant portions of the cosmic endowment. Even if war never happens, the mere possibility implies that the game isn’t unit-sum, and the more altruistic side is unable to “steal” certain strategies of the other side, like threatening mutual destruction as a bargaining tactic.
Also Black-Hole Negentropy, where value scales superlinearly with resources (mass/energy).
space colonization is closer to a tragedy of the commons situation than unit-sum
My current best guess is that this seems possible but pretty unlikely. And that this type of negotiation seems particularly easy given the distribution of values I expect for the actors negotiating (e.g., strongly locust-like values aren’t that likely).
Why isn’t it likely, given that you can “burn” more resources in order to grab a larger share of the lightcone? If you’re saying that the outcome of burning the cosmic commons isn’t likely because everyone will negotiate to avoid it, I’m saying that the game structure itself isn’t zero-sum, which is needed to show that strategy-stealing applies in theory.
And that this type of negotiation seems particularly easy given the distribution of values I expect for the actors negotiating (e.g., strongly locust-like values aren’t that likely).
I do not know of a result, or have the intuition, that if negotiation is “easy” then strategy-stealing (approximately) applies. My intuition is that even in this case (like in my toy game) some parties can credibly threaten to burn down the world (or to risk this), and others can’t, and this gives the former a big advantage that the latter can’t copy. Negotiation is “easy” in my game too (note that the outcome is pareto optimal, and no risky action is actually taken), but the more cautious or altruistic party is disadvantaged.
I don’t currently think you can burn more resources to grab a larger fraction of the light one. Or like, I think the no-negotiation equilibrium burns a small fraction of resources. I don’t feel super confident in this view, but that was my understanding of our current best guess. I haven’t looked into this seriously because it didn’t seem like a crux for anything. Maybe I’m totally wrong!
My cached view is something like “you can send out an absurd number of probes at ~maximal speed given very small fractions of resources, so burning resources more aggressively doesn’t help”.
The following LLM output matches my own understanding:
Ryan’s crux is his “cached view” that you can send probes at nearly maximal speed using very small fractions of resources, so burning extra resources doesn’t help. This violates the physics of relativistic travel.
Because of relativity, kinetic energy scales non-linearly as you approach the speed of light (). The energy required to accelerate an object approaches infinity as its speed approaches .
If Actor A wants to beat Actor B to an uncolonized star system, and Actor B launches a probe at , Actor A must launch at to get there first.
Upgrading a probe’s speed from to , and then from to , requires exponentially more energy for the same payload mass.
Furthermore, if you want your probe to actually do something when it arrives (like decelerate, build infrastructure, and defend itself), it needs mass. To decelerate without relying entirely on ambient interstellar medium, you have to carry fuel for the deceleration phase, which exponentially increases the launch mass required (the Tsiolkovsky rocket equation).
Therefore, Robin Hanson’s “Burning the Cosmic Commons” scenario is physically accurate. In an uncoordinated race for the universe, colonizers must convert almost all available local mass/energy into propulsion to outpace competitors. Securing a larger share of the lightcone absolutely requires burning vastly more resources.
LLM output doesn’t seem nearly quantitative enough. With some numbers of 9s, it surely doesn’t give you a meaningful advantage to go at 0.99...99c rather than merely 0.99...9c — especially when you factor in that it probably takes time to convert energy/mass into the additional speed (most mass will be in between your origin and the farthest reaches of the universe, and by the time some payload have decelerated and started harvesting significant energy from the middle mass, the frontier of the colonization wave will likely already be quite distant). I share Ryan’s guess that you can get close enough to optimum without burning a large fraction of all energy in the universe. (That’s a lot of energy!)
I think you’re right that wasn’t really conclusive. Will try to address your arguments below.
With some numbers of 9s, it surely doesn’t give you a meaningful advantage to go at 0.99...99c rather than merely 0.99...9c
This seems right but you can (probably) still gain a meaningful advantage by sending more colony ships (and war/escort ships) instead of pushing for more speed.
especially when you factor in that it probably takes time to convert energy/mass into the additional speed (most mass will be in between your origin and the farthest reaches of the universe, and by the time some payload have decelerated and started harvesting significant energy from the middle mass, the frontier of the colonization wave will likely already be quite distant)
Are you assuming either that it’s possible to launch colony ships directly across the universe, or that it takes millions/billions of years to fully harvest a star (e.g. using a Dyson sphere while the star burns naturally)? If instead there’s a distance beyond which it’s infeasible or uncompetitive to try to directly colonize, like 10x the average distance between neighboring galaxies, and also possible to quickly harvest a star using direct mass to energy conversion (e.g., via Hawking radiation of small black holes), then the colonies in the middle should have plenty of tempting new targets to try to colonize (before someone else does), at the edge of the feasible range?
I share Ryan’s guess that you can get close enough to optimum without burning a large fraction of all energy in the universe.
I’ll describe a toy model to convey my intuitions here.
Setup
Two players each own 0.5 of Galaxy 1. They compete for Galaxy 2 by consuming their Galaxy 1 resources as colonization effort (c).
Payoff
Player A’s total utility is their retained Galaxy 1 plus their competitively won share of Galaxy 2. U = (0.5 - cA) + cA / (cA + cB).
Solution
To find the Nash Equilibrium, we maximize Player A’s utility by taking its derivative and setting it to zero. Because the game is symmetric, both players will invest equal effort (cA = cB). Solving this yields an equilibrium effort of c = 0.25.
Outcome
Both players sacrifice exactly half of their initial resources (0.25 out of 0.5). Because they invest equally, they split Galaxy 2 evenly (0.5 each). Their final score is 0.75 each.
P.S., what do you think about my earlier points about war and black hole negentropy, which could end up being stronger (or easier to think about) arguments for my position?
It confuses me that it has significant mind-share among AI safety people, e.g. @ryan_greenblatthere, despite the world in general, and technological races in particular, obviously not being zero-sum.
FWIW, I find it useful to think about strategy stealing, and don’t think it has too much mindshare. Not really sure how to productive it is to argue about that though because “too much or little mindshare” seems hard to settle.
despite the world in general, and technological races in particular, obviously not being zero-sum
Just to respond to this in particular: Some situations are close to being zero-sum, and when they’re not, I think it’s often useful to explicitly track the reason why they’re not zero-sum and how that changes the dynamics.
My impression of people invoking strategy stealing is not that they’re actually assuming it holds without argument, but instead interested in specific reasons to believe it fails in a given situation, and (if they agree those reasons are real) often interested in quantifying how significant those reasons are. Ryan’s linked comment seems like an example of this.
Paul’s linked article talks about lots of ways that strategy stealing can fail, many of which aren’t downstream of violating unit-sum. (By my count, only 2 of them are about that.)
You say “even for consequentialists”, but iirc, non-consequentialism only really features in point 11, so that’s just one more.
Just to clarify that you’re not distilling the whole post but just providing an example for 1-2 of the issues.
I agree that it’s weird how widely uncritically endorsed the assumption is—in particular it’s often cited as if some kind of result or theorem, when even the original articulation is (not enough as it happened) hesitant!
Unfortunately my guess is the concrete articulation above is not especially catchy or illuminating. I suspect the more abstract gesture at constant-sum might be both more general and more catchy.
I’m starting to get a little worried. In 2022 I was part of initiating and shaping the whole dangerous capabilities evals agenda, and I emphasized repeatedly to practically everyone I met that the whole thing would be worthless or worse if fine-tuning didn’t become standard practice as part of eliciting model capabilities. Now it’s 2026, the models are way smarter, and and hundreds of people are working on evaluating dangerous capabilities… and yet they still basically don’t do fine-tuning at all.
This really really needs to change before it’s too late, which could be any year now. The models are getting increasingly situationally aware. The entire bundle of evals we do is basically just asking them nicely to show us how scary they are; the obvious flaw in this plan is that if they are sufficiently smart and sufficiently scary, they’ll just choose not to show us. Fine tuning is a basic, obvious mitigation to this threat model that makes it harder for models to sandbag. But it’s not a panacea. But we aren’t even doing this basic thing!
I think that fine-tuning evals are a good idea, and there has been some important related work done on this here. To quote from that post:
Why is this useful?
Soft prompts offer a simple way to quantify the conditional distance between a model and a target behavior. If many tokens are required, then lots of information-as-conditions is needed to define the behavior and it’s unlikely that it will happen by chance. If no number of tokens suffices, the behavior is likely inaccessible by inputs alone.
It can be used as a part of evaluation suites to monitor the change in distance to concerning behaviors across models and scales and the degree to which those behaviors can be achieved in the first place.
It is also difficult for the model to hide a capability that is otherwise accessible to some input. If the original model is an end-to-end differentiable product of gradient descent, it has already been shown to be transparent to the same type of optimizer that will be tasked with adversarially optimizing the input soft prompts against the machinery of the model.
Compared to manual attempts at elicitation or automated techniques restricted to discrete tokens or black box optimization, it dramatically improves the chances that an evaluation’s attempt to elicit concerning behavior will succeed.
If a dangerous capability eval doesn’t involve fine-tuning the model to do the dangerous thing, the eval only establishes a lower bound on capability, not an upper bound or even an approximate upper bound. Why? Because:
For future powerful AGI, if it’s misaligned, there’s a good chance it’ll be able to guess that it’s being evaluated (situational awareness) and sandbag / pretend to be incapable. The “no sandbagging on checkable tasks” hypothesis depends on fine-tuning. https://www.alignmentforum.org/posts/h7QETH7GMk9HcMnHH/the-no-sandbagging-on-checkable-tasks-hypothesis Moreover we don’t know how to draw the line between systems capable of sandbagging and systems incapable of sandbagging.
Moreover, for pre-AGI systems such as current systems, that may or may not be incapable of sandbagging, there’s still the issue that prompting alone often doesn’t bring out their capabilities to the fullest extent. People are constantly coming up with better prompts that unlock new capabilities; there are tons of ‘aged like milk’ blog posts out there where someone says “look at how dumb GPT is, I prompted it to do X and it failed” and then someone else uses a better prompt and it works.
Moreover, the RLHF training labs might do might conceal capabilities. For example we’ve successfully made it hard to get GPT4 to tell people how to build bombs etc., but it totally has that capability, it’s just choosing not to exercise it because we’ve trained it to refuse. If someone finds a new ‘jailbreak’ tomorrow they might be able to get it to tell them how to make bombs again. (Currently GPT-4 explicitly refuses, which is great, but in a future regulatory regime you could imagine a lab training their system to ‘play dumb’ rather than refuse, so that they don’t trigger the DC evals.)
Moreover, even if none of the above were true, it would still be the case that your model being stolen or leaked is a possibility, and then there’ll be bad actors with access to the model. They may not have the compute, data, etc. to train their own model from scratch, but they quite likely have enough of the relevant resources to do a small fine-tuning run. So if you don’t fine-tune as part of your evals, you are in the dark about what such bad actors would be able to accomplish. Also, once we have more powerful models, if the model was misaligned and self-exfiltrated or otherwise gained control of its training process, it could do the fine-tuning itself...
Moreover, even if that doesn’t happen, the public and the government deserve to know what THE LAB is capable of. If a small amount of fine-tuning using known techniques could get the model to cause catastrophe, then the lab is capable of causing catastrophe.
Finally, even if you don’t buy any of the above, there’s a general pattern where GPT-N can do some task with a bit of fine-tuning, and GPT-N+M (for some small M possibly <1) can do it few-shot, and GPT-N+O (for some small O possibly <1) can do it zero-shot. So even if you don’t buy any of the previous reasons to make fine-tuning part of the DC evals, there is at least this: By doing fine-tuning we peer into the near future and see what the next generation or two of models might be capable of.
For future powerful AGI, if it’s misaligned, there’s a good chance it’ll be able to guess that it’s being evaluated (situational awareness) and sandbag / pretend to be incapable
I agree with this, and therefore I agree with the bottom line that labs should have at least some evals in domains where the AIs are fine-tuned in the next 12 months.
But I think your other arguments were somewhat true in 2022 but are mostly not true anymore, and current models are generally well-elicited.
prompting alone often doesn’t bring out their capabilities to the fullest extent. People are constantly coming up with better prompts that unlock new capabilities; there are tons of ‘aged like milk’ blog posts out there where someone says “look at how dumb GPT is, I prompted it to do X and it failed” and then someone else uses a better prompt and it works.
Do you have recent examples of this? My understanding is that for most single-forward-pass abilities this is mostly not true for recent instruction-tuned models (e.g. the classic “List sorting does not play well with few-shot” result is not true for recent models). For generation tasks it’s sometimes true (though my best guess is that it’s rare, especially if the outcome is clear), but mostly because the AI does not “know” what generation strategy it should use (if it doesn’t sandbag intentionally). (Once sandbagging is plausible, it’s plausible the AI sandbags on generation strategy though.) I think it’s not a slam dunk either, and there are some weird tasks like SusEval (which measures the ability to intentionally fail rarely) where there are strong reasons to suspect underelicitation of some models.
Finally, even if you don’t buy any of the above, there’s a general pattern where GPT-N can do some task with a bit of fine-tuning, and GPT-N+M (for some small M possibly <1) can do it few-shot, and GPT-N+O (for some small O possibly <1) can do it zero-shot.
I think this used to be true but is basically not true anymore. Anecdotally, I know of several experiments where people fine-tuned a model a bit on trajectories of a stronger model and it usually did not improve performance in ways that generalized across tasks (including across SHADE-Arena tasks). There are some exceptions (e.g. when fine-tuning a non-reasoning model on reasoning model trajectories), but they are relatively sparse (and even that reasoning example looks to me more like teaching than elicitation).
Moreover, even if none of the above were true, it would still be the case that your model being stolen or leaked is a possibility, and then there’ll be bad actors with access to the model Moreover, even if that doesn’t happen, the public and the government deserve to know what THE LAB is capable of. If a small amount of fine-tuning using known techniques could get the model to cause catastrophe, then the lab is capable of causing catastrophe.
Not relevant for stealth abilities, right? The reason one runs SHADE-Arena (especially without hidden scratchpad) is not to know if the lab / a malicious actor could do something bad. On capabilities that are more relevant for misuse like cyber or AI R&D automation, or capabilities shown in generic capabilities benchmarks that people use to know how powerful models are, I would guess most labs are already trying hard to elicit these capabilities. Are you worried about underelicitation there? (For CBRN, I think there are plausibly more low hanging fruits, though I would guess most gains would not come from fine-tuning but from scaffolding / helping people use AIs correctly—in the Sonnet 3.7 uplift trial the Anthropic employee group had a bigger uplift than the other participants.)
Yep I basically agree. Main argument still stands, the other ones are weakened to various degrees in 2026 compared to 2022, at least when applied to stealth abilities. Still have nonzero force though.
There was some discussion recently about the uptick in object-level politics posts and whether this is desirable or not. There’s no rule against discussing politics on LW, but there is a weak norm against it, and topical discussions have historically tended to be somewhat meta and circumspect.
I think the current situation is basically fine, and it’s normal for amount of politics discussion to ebb and flow naturally as people are interested and issues become particularly salient. That said, here are a couple of potentially overlooked reasons in favor of more object-level politics discussion:
1. To build skill. Discussing politics productively is a skill that requires practice and atrophies without use. “Politics is the mindkiller” never meant that you should not discuss politics at all; it means that discussing politics is playing on hard mode. But sometimes playing on hard mode is the best way to level up. I suspect the skills needed to discuss politics productively overlap with lots of other important rationality skills.
2. To create common knowledge and avoid conflationary alliances. It can be confusing or disconcerting to not know what kind of common background assumptions the commentariat takes for granted when discussing something politics-adjacent. This is a problem somewhat unique to LW and not necessarily a bad thing; most other places on the internet skew too far in the opposite direction of a hivemind with a loud set of background assumptions, e.g. everyone putting emojis in their username to mark their alliances. But sometimes it is helpful to establish what does and doesn’t go without saying.
I think it might be cool if LessWrong had a well developed set of norms for discussing political topics, in particular, these norms were legible, and mods made a point to enforce them.
Politics posts should be tagged as such, and maybe all have a big warning at the top linking to a post outlining our expected norms and standards for discussing politics, and moderation thresholds. This is both a warning to those from other parts of the internet who don’t share our epistemic ideals, and a warning to LessWrongers who don’t want to wade into this stuff.
If we do this, we might also block new users from commenting under politics posts. One concern with debating politics is that it can attract new users who are interested in politics only, not in rationality in general. But if these new users are prevented from discussing politics, that could solve the problem.
So, with all the above said as a preface, one object-level topic I’d be interested in seeing more discussion of is the current situation(s) in the Middle East. Some thoughts:
IMO the high bit for whether the war in Iran is broadly good is whether it is tactically successful and efficient in the short term.
Considerations like “this weakens the US position in a hypothetical hot war against China or Russia” or “this will (further) destabilize the Middle East in the long term” seem second-order to whether the war successfully neutralizes an organized, well-equipped, fanatical adversary for a long period of time, or fails to do that and also depletes a bunch of expensive / difficult-to-replace munitions stockpiles.
But how well or poorly things are going on this front is difficult to determine given fog of war and motivated reasoning + propaganda on all sides, and also less interesting for pundits to discuss vs. strategic and geopolitical implications. Prediction markets seem to being doing OK here, but I would be interested in hearing more analysis from the LW commentariat.
On meta: I’d say the main reason I want some middle east content on lesswrong is that there seem to be lots of relatively more concrete facts about the world that are fundamental to models that I don’t know, and also I don’t know what those facts even are. I cannot even tell you much that is different between, say, Iraq and Iran, or Saudi Arabia, except that probably 2 out of 3 of those are allied against the third?
I think the most important effect of the war is that it makes Trump less popular/powerful domestically (even if a miracle happens and he gets some sort of deal.) This is good because the less power he has (e.g., Republicans lose the senate in the midterms), the more likely we are to navigate AI development in a sane way. I think if you put any nontivial*weight in short timelines, the AI considerations likely dominate everything else.
*edited any to nontrivial. Like, maybe 10%+ pre-Jan 2029
the more likely we are to navigate AI development in a sane way. I think if you put any weight in short timelines, the AI considerations likely dominate everything else.
I don’t think we’re particularly on track to do anything non-derpy w.r.t to AI either way, but this way of reasoning seems like somewhat naive consequentialism. In general, it’s good for good things to happen even if they are accomplished by bad people, and predicting second-order consequences is really hard.
Also, there are a lot of bad people in power, but for AI to go well, a lot of good things need to happen to allow humanity space and time to flourish in peace. Toppling (or militarily crippling) a fanatical Shia Islamist regime would be an extremely good thing; a bad outcome (which looks somewhat likely at this point) would be if Europe and the third world broadly give in to extortion and pay Iran a toll to pass through the strait. That toll would fund terror all over the world, and would signal to other would-be dictators and future AIs alike that they can successfully take whatever they want through force and threats, and half the world will just roll over and take it.
Toppling a fanatical Shia Islamist regime would be an extremely good thing.
I’m not convinced that it is necessary to topple it. Iran has been Shia for 5 centuries, and during those 5 centuries has attacked its neighbors very little.
Er, something pretty important happened in 1979. Also, the issue is not Shi’ism in general, “Shia Islamism” refers specifically to the flavor of political Islam instituted by the revolutionaries.
1979 can be seen as a return to traditional Iranian governmental policies after an experiment of a few decades with Western policies. Yes, the form of the current government (namely, a republic at least in principle) is more modern and Western than the form of the pre-1979 government (namely, a heriditary monarchy), but the monarch’s policies were more secular and Western than the current regime’s policies.
Islam has always been political in Iran, which has never had anything like the West’s separation between church and state except maybe to some extent during the experiment with Western policies that ended in 1979. We in the West shouldn’t attack countries just for being different from the West.
Toppling (or militarily crippling) a fanatical Shia Islamist regime would be an extremely good thing;
This seems far from certain from the perspective of anyone other than Israel. I mean, all else equal, definitely. But all else is definitely not equal. The most likely outcome even if this were to happen would be a huge increase in regional instability, which really doesn’t seem favorable to Europeans or most others in the surrounding area considering past examples.
That toll...would signal to other would-be dictators and future AIs alike that they can successfully take whatever they want through force and threats, and half the world will just roll over and take it.
But if they don’t pay the toll and support America in forcing it open, they are signaling it’s okay for hegemonic powers to aggress and start wars in any ways they deem fit. Europe has been forced into a lose-lose situation, by USA. It seems pretty clear that the only reason this is at all possible by Iran is because this war is seen by many in Europe and elsewhere as an unnecessary, illegal, and possibly harmful unilateral war of aggression.
Americans often seem blissfully unaware of how dangerous they appear to the rest of the world, and just take for granted that everyone considers them to always be the good guys, just doing good-guy things. America just took over Venezuela, now Iran, then Cuba, then Greenland and Canada. Is allowing all of this to be done unanimously by a great power without any repercussions not a dangerous signal to send? It seems to be a much stronger signal than that of the Strait, and they are to a certain degree opposing signals.
But if they don’t pay the toll and support America in forcing it open, they are signaling it’s okay for hegemonic powers to aggress and start wars in any ways they deem fit.
The strait is closed because Iran is pointing missiles and drones at anyone who tries to sail through it, including people engaged in commerce that has nothing to do with the US or Israel. Any explanation of causality that doesn’t center on that fact denies the agency of the Iranian regime, and allowing your people to be threatened and extorted just signals that you’re an easy target for anyone who wants to extract something from you by force.
It seems pretty clear that the only reason this is at all possible by Iran is because this war is seen by many in Europe and elsewhere as an unnecessary, illegal, and possibly harmful unilateral war of aggression.
Yes, this might be how many Europeans see it, but that doesn’t make them correct. Iran has been building up conventional weapons and working towards nuclear weapons, lobbing missiles and IEDs at civilian populations in Israel through terrorist proxies, and funding crime and terror all over the world for many years. That doesn’t make the current war strategically wise or good, but calling it a “unilateral war of aggression” is simply wrong.
Americans often seem blissfully unaware of how dangerous they appear to the rest of the world, and just take for granted that everyone considers them to always be the good guys, just doing good-guy things. America just took over Venezuela, now Iran, then Cuba, then Greenland and Canada.
Again, I agree that many Europeans might see the US that way, but so what? That doesn’t make them correct or worth listening to. Committed and principled pacifism would be one thing, but public opinion is often more incoherent and self-serving than that. IMO a lot of Western discourse and public opinion on this kind of thing is better to tune out, because so many people no longer seem capable of acknowledging levels of moral right and wrong, with everything simplified and flattened to “any external or preemptive aggression is always bad”, or polarized through their view on U.S. domestic politics.
Venezuela, Iran, and Cuba are very different from Greenland and Canada (which Trump did not actually credibly / non-jokingly threaten to take by force). And militant Shia Islamism in particular is one of the most pernicious and totalizing ideologies on the planet, far worse than Russian-flavored oligarchy, Chinese communism, generic Third Worldism, or Trumpism, which are all generally bad in various ways, but don’t have the elements of religious fanaticism that make their adherents difficult to negotiate with on reasonable terms[1].
Also, more generally, my view is that public opinion is only valuable and worth listening to as a noisy proxy for democracy, which is itself good only insofar as it is a mechanism for protecting natural rights and legitimizing and limiting state power through the principle of consent of the governed. Trump and co. are certainly not doing well on this front, but neither is Europe lately.
OK yes, Trump is an unreasonable negotiating partner, and there’s a cult of personality around him that maybe rises to the level of religious fanaticism among his remaining true believers, but no one is lining up to be martyred for him in order to get into heaven. Trump himself is deeply flawed and amoral as a person, but Trumpism as an ideology is not that different or worse than many other flavors of conservative / right-wing politics.
allowing your people to be threatened and extorted just signals that you’re an easy target for anyone who wants to extract something from you by force.
Again, this is not what they are signaling if the reason they are willing to pay the toll is because they don’t agree with the war in the first place and don’t want to support America’s part in it. Either way they handle this, they are being extorted by one side or the other.
Yes, this might be how many Europeans see it, but that doesn’t make them correct. Iran has been building up conventional weapons and working towards nuclear weapons, lobbing missiles and IEDs at civilian populations in Israel through terrorist proxies, and funding crime and terror all over the world for many years. That doesn’t make the current war strategically wise or good, but calling it a “unilateral war of aggression” is simply wrong.
Sure, it is debatable. Regardless, I was still talking more about the signal being emitted rather than what is correct or not. Regarding framing it as a “unilateral war of aggression”, The war was clearly a unilateral decision, or bilateral if you want to count Israel as a separate party, doesn’t really change the framing. And USA is 7k miles away from Iran, pretty clearly no imminent threat. Need to squint pretty hard to see how this could be framed as anything other than USA being the aggressor. I mean, why did they attack now? My understanding is because it is a time when Iran is particularly weak and vulnerable. It can be argued that that is the ‘right’ thing to do, but it would still be a war of aggression.
Overall, I just find the response of “What would the AIs think” in defense of America/Israel’s clear and consistent uni/bilateral behavior, at the disapproval of everyone else, a bit comical, as I see it as completely the opposite. If this were so necessary an act, they should have been able to discuss/agree to this, or some other solution, with their allies. That is at least how I would want the AIs to think.
Americans often seem blissfully unaware of how dangerous they appear to the rest of the world, and just take for granted that everyone considers them to always be the good guys, just doing good-guy things.
Sam Kriss had a great recent essay making a similar point.
this way of reasoning seems like somewhat naive consequentialism.
Maybe? It is hard to reason well about these things given my strong emotions towards the admin.
But I do think the current administration is uniquely terrible by American standards.[1] It attracts and gives power to incompetent sycophants with no moral boundaries.
There was something Eliezer said about Bernie Sanders recently that really resonated with me recently:
[T]hank you also for consistently trying to do as seems right to you over the years, a stance that has grown on me as I have had more chance to witness its alternatives.
Having Trump as the president really just seems like it would be terrible for AGI governance because he is a terrible person. I’m sorry, I really don’t think there’s a more “precise” way to put it. Character matters. Trump doesn’t even pretend to be a kind person/is not under much pressure to appear to be nice.
(To be clear, I agree that, all else equal, it would be good for the Iranian regime to fail. Alas, all else would not be equal. While I think it would definitely be bad for your soul[2] to do things in the realm of “sabotage the American economy/military operation in order to make our president look bad,” I don’t think I’m obligated to stop my enemy when he is making a mistake either.)
Re character: I think most Americans (including myself) have been so far removed from true corruption that we have forgotten how bad it can possibly get. Even my state of Illinois, which is notable for its historical machine politics and general corruption (4 of our 11 last governors serving time + many others like Mike Madigan), has still more or less seen forward progress, because the corruption wasn’t bad enough to completely erode politics in the state.
But it CAN get that bad. We’re seeing this now with the Trump admin. I am generally left-leaning, but at this point I think I’d take an honest Republican over a corrupt Democrat—a position I did not hold previously—because corruption eats policy and utterly erodes the foundation upon which we build fair markets and strong institutions.
There is only a 34% chance of leadership change. Maybe only 20% of regime change. In the other 80% or so, forcibly opening the Strait seems rough. Experts are pessimistic about US easily taking Kharg Island, and even if the US controls both Kharg (Iran’s export base north of the strait) and other islands like Qeshm (the island in the strait with the largest Iranian military presence), it will probably suffer tens or hundreds of casualties while Iran can still threaten shipping with Shaheds, sea drones, speedboats, and mines. In the median case it seems like the Strait will open sometime between May and December but Iran will have some leverage, possibly extending the toll regime.
Iran losing their existing enriched uranium seems contingent on a deal, because the US plan to build a runway 300 miles inland, use cargo planes to land excavation equipment, invade Iranian bunkers over the course of a week, hope that the uranium is intact, easy to find, and not booby-trapped, distinguish it from decoys, put the uranium in storage casks, and fly it out would be difficult if this were 2003 Iraq when the US had air superiority supremacy. It is just not compatible with how warfare works in Iran in the drone era.
Claude thinks it’s only 20% to work, which seems optimistic
Getting there: Isfahan is more than 480 km (300 miles) inland, hundreds of kilometers from the nearest US naval assets. Al Jazeera The US has moved 82nd Airborne, 101st Airborne, Army Rangers, and Marine Expeditionary Units to the region. Forces would need to be inserted by air — there’s no overland route from a friendly staging area.
Securing the site: Recovering the uranium would require a significant number of ground troops beyond a small special operations footprint — dozens if not hundreds of additional troops to support the core team. They would need to secure the facilities under potential missile and drone fire and maintain a perimeter for the duration. CNN
The actual extraction: Airstrikes alone can’t penetrate the Isfahan tunnels because the facility doesn’t have ventilation shaft openings that serve as weak points at other nuclear sites. CNN This means physically entering and digging through rubble. A former special operator trained for such missions described it as “slow, meticulous and can be an extremely deadly process.” Another former defense official said it’s like “you’re not just buying a car on the lot, you’re buying the entire assembly line.” The Hill
Getting it out: The cylinders would need to be transferred into accident-rated transport casks by specially trained SOF personnel with nuclear materials handling experience. The cargo could fill several trucks, and a temporary airfield would likely need to be improvised. The full operation could run for a week. Israel Hayom
Force protection throughout: There would need to be constant close air support, satellite coverage, and every spectrum of warfare capability to keep Iranian forces away from the site while JSOC and other agencies methodically excavate and retrieve the material. The Hill
My probability estimate
I’d put the chances of a successful physical extraction of most of the enriched uranium at roughly 15-25%. Here’s my reasoning:
The operation is technically feasible — the US military can do extraordinary things — but the risk profile is extreme for what may be an unnecessary objective
Trump himself has wavered, on March 31 suggesting the uranium is “so deeply buried” and “pretty safe” — seeming to lower its priority Foreign Policy
Senior military planners are reportedly skeptical: “I don’t see any senior planning military officer pursuing this,” one former defense official said Al Jazeera
The political environment (Polymarket’s 77% for operations ending by June, plus low public appetite for ground troops) creates pressure to wrap up, not escalate
It looks like the US is at least succeeding at destroying the Iranian military, but it’s unclear what this buys them. Drones are really cheap, so Iran will probably always have those. Therefore I think regime change is necessary for the US to come out ahead.
Decoys would not be a problem for the US. A gamma-ray spectroscope weighs only a few ounces and costs only a few thousand dollars. It is almost certain that no one can produce a substance that looks like U-235 to a gamma-ray spectroscope that is cheaper to produce that highly-enriched uranium.
Seems excessive? A sizeable fraction of the entire Iraq campaign losses, for seizing a single island in an environment where US has sea control, air supremacy and an edge in ISR.
US may struggle to use the island, because of the hard-to-eliminate threat of long range strikes from Iran. But seizing it to deny it to the regime seems like a war goal that could be accomplished with a relatively minor effort.
Iraq was 32,000 wounded and 4,400 killed, and the US has already suffered hundreds of wounded and 13 deaths in the existing Iran campaign without any ground operations. I’m imagining 100 wounded and maybe another 20 KIA if the US holds Kharg for an extended period, not hundreds of KIA.
The issue is it’s not really true that the US has air supremacy. Kharg Island is within fiber FPV range of the mainland, and real-time ISR is not required for Iran to track static targets on the island. Plus Iran is still able to launch larger drones and the occasional missile. So holding Kharg really means denying drone launch points on a ~20 mile stretch of the coast, which for FPVs can just be two guys in a bunker.
The incentive for Iran is enormous given US’s low tolerance for casualties; it’s well worth it to launch 20 $1,500 drones to kill one American.
Apologies, I did misread your original causality claim.
FPVs are less “air force” and more “precision munitions”. You can think of them as of a new “crewed ATGM” variant, command guidance and all.
They work great for precision ground-to-ground strikes, but play little role in what is meant by “air supremacy”. They can’t pose a meaningful threat to most air platforms, and most air platforms can’t effectively hit them. They do nothing to deny US the ability to perform CAS or otherwise hit targets from air.
The main exception to that is helicopters, for the same reasons why ATGMs can pose a threat to helicopters in some circumstances. Specialized FPV interceptors, in hands of skilled operators, can also hit other drones, including heavier fixed wing drones like Shahed or even Reaper—allowing them to intrude on MANPADS territory. But the traditional “JDAM trucks” aren’t in the same bracket as FPV drones.
We also have very little information of FPV crew survivability in an environment when one of the parties has advanced ISR, ELINT included, fast kill loops, and enough air control to drop JDAMs freely. Every reason to expect more attrition on FPV crews, and skilled operators aren’t easy to replace—but quantitively, we don’t know by how much. Might be enough to make “deny the enemy most FPV ops within an area” a viable prospect, but you can’t count on it.
Yeah, I was being sloppy with air supremacy as ability to easily conduct air operations (which the US does 95% have) vs completely deny enemy air operations, which the US arguably can’t do given that Shaheds, reconnaissance drones, and one-way FPVs serve some of the purposes one would previously have needed air support for. I would argue that the increasing range of FPVs, now 40+ km in Ukraine, puts them well beyond what ATGMs are capable of.
There are a lot of variables involved given how fast tech is evolving. If Iran can reliably pilot drones from 500 km away, they wouldn’t be risking skilled operators. If US interceptors work as well as Ukraine’s, they could probably intercept >90% of Iranian FPVs and Shaheds. A lot might hinge on who gets to a milestone like this first.
But the basic picture seems to be that their capacity to launch missiles has already fallen off dramatically. They’re still launching a lot of drones, which have a big cost asymmetry in how easy they are to launch vs. intercept, and they make any land or sea incursions extremely dicey. But they are limited in range and destructive capability against properly fortified targets.
I agree that things don’t look promising for a ground invasion or taking control of the strait. But I’m less sure how militarily sustainable a long stand-off is. The strait being closed is economically and politically painful (for everyone), but in the meantime it seems like the US and Israel can continue launching targeted air strikes and Iran can’t really strike back effectively.
Keep in mind that a lot of targets are not “properly fortified”, be that infrastructure or military facilities, and suicide drones are much harder to hunt down than ballistic missile TELs.
Modern ISR can perform well in a “Scud hunt” scenario, but “Shahed hunt” is a much worse match up.
As best I can tell, the conundrum is that Trump, the international economy, and American voters all want America to be out of the conflict soon, but Israel does not want this, and Israel has outsized influence in not just how American political incentives are determined, but in what information is presented to Trump and other key officials.
A lot of the claims I’ve seen Trump make about the war are clearly false, but not false in a way that he would benefit from lying deliberately. I realize a conspiracy to feed false information to the American executive to keep the war going sounds like a radical possibility,but there is precedent for it.
IMO this is going (predictably) disastrously. Air power is not effective at causing regime change (rally-around-the-flag effect). I think the Iranian public are more likely to mainly blame the guy explicitly saying “we’re going to bring them back to the stone ages where they belong” than the local leadership. It also seems to me that the Iranian leadership would be highly motivated to immediately rebuild any degraded capabilities after the war, in order to rebuild deterrence against future attacks.
There is some talk about a land invasion, but taking an island or two (even Kharg) probably wouldn’t compel them to surrender, while also being highly vulnerable both directly and in terms of logistics to drone attacks; and a full scale invasion would be a massive undertaking and probably not politically feasible (for good reason).
I generally consider myself an optimist. However, I’m concerned that the most powerful model ever created, capable of breaking its own containment and autonomously finding zero day exploits in battle-tested OS and repos, was accidentally trained using The Most Forbidden Technique. This seems bad.
I’m also concerned that the Department of War now CAN’T use this model because of its own decision to declare Anthropic a supply chain risk. Which means that if Mythos gets leaked/distilled (which given history seems likely) US adversaries will have a decisive advantage over the US govt in cyber.
This seems like a pretty dangerous situation. Am I misreading something?
I’m also concerned that the Department of War now CAN’T use this model because of its own decision to declare Anthropic a supply chain risk.
When I’ve talked with people in the government about the Secretary of Defense a supply chain risk, they often include “for now”. When I’ve asked them if they think it will stick, or if the admin will reverse policy, they often say something like “who knows what will happen.”
I get the impression that the admin could completely change its mind about policies like these, including very rapidly.
Without claiming much expertise, there might not even be an interruption. By the time the 6 month phaseout window is over, model capabilities will have developed so much that the admin will have completely changed its stance.
scifi setting idea: movement from rural areas and small cities to larger cities continues until approximately everyone lives in one of like 10 different megacities; all of the farmland and oil fields and mines and whatnot in between are 99% roboticized, with only occasional human repairs; all of the cities are tightly connected by supersonic travel, which becomes more feasible because there are very few people on the ground outside cities to get annoyed by the noise; drugs solve sleep and allow effortless adaptation to jet lag. uniquely, SF nether expands to become a megacity, nor disappears into irrelevance; housing becomes so absurdly expensive that only the very best researchers and engineers can afford to live there, causing a huge selection effect towards talent density.
It’s widely rumored that both OpenAI and Anthropic will go public within the next year or so, triggering massive capital inflows. I think this unleashes an underrated financial dynamics.
An optimal portfolio hedges your personal risks. Unless you’re independently wealthy, the risk of job loss due to AI automation is one your largest personal financial risks, and it makes a lot of sense to buy insurance against that (i.e., to buy into AI companies). Right now, individual investors can’t do that. And the investors that have access to OpenAI/Anthropic on the private market are precisely the ones that don’t need to hedge that risk. Once you open up to retail, a rational investor should park a sizable percentage of their net worth in those shares, opening up a capital gusher.
This is especially true for Anthropic—there’s basically no way for a retail investor to get meaningful Anthropic exposure if they want it. It’s a little different for OpenAI given that SoftBank is kinda an OpenAI proxy at this point (albeit in ways that make it a less attractive investment than the real thing).
Seems like this could lead to substantial acceleration of progress. It may also change the politics. If stopping AI progress mostly means financially soaking billionaires, then it’s politically a lot more palatable than if it will also screw up the 401ks of the mass affluent.
Amazon is quite a good proxy for Antrhopic. They do own a big stake, Anthropic does use their chips and cloud + they have hypserscaler optionality for any AI future.
I don’t think openai/anthropic are the only ways to hedge your future. you can do it now buy going all-in all ai infra + chips + bottlenecks. Openai doesn’t operate on hardware of their own etc.
I’m also concerned about fiduciary duty and related mechanisms causing systematic enshittification due to putting pressure on employees to be able to justify their choices to a court.
Also, I don’t know if a judge would see it this way, but internally pushing for safety measures is 100% in shareholder interests because shareholders have an interest in not being killed by misaligned ASI.
Note that this is not a perfect, or perhaps even a good hedge. There are very believable paths where AI reduces your income, but the current big AI companies don’t capture that value, or don’t return it to shareholders. Also quite believable that it’ll IPO with all that growth already priced in.
You want anti-correlated investments, but there’s really no possible way for most people to know that. Investing in broad indexes remains the safest bet for those without inside knowledge or serious resources (time, knowledge, energy) for research and decision-making.
There are very believable paths where AI reduces your income, but the current big AI companies don’t capture that value, or don’t return it to shareholders.
I suppose it depends on exactly what you do, but if the threat to you is something along the lines of “AI broadly replaces white collar workers” then it’s very hard to come up with a scenario where AI can replace most white collar workers but can’t find a way to make money for the companies building it.
You have to tell a really contorted story for that work—I guess it’s “believable” to imagine something like: AI gets good enough to replace most white collar workers, but there’s such intense commoditization and competition that there is no pricing power so the companies make no money, and the one white collar thing that AI can’t do is design semiconductors, so all the economic gains flow to Nvidia. Could happen but you have to rig the assumptions to get that result. And aside from that, most of the forms I can come up with are of the “money won’t matter anymore” variety in which case, there is no financial hedge.
Also quite believable that it’ll IPO with all that growth already priced in.
The point I’m trying to make is that Anthropic shares are worth more to your average upper middle class white collar worker than they are to the current ultra-wealthy or institutional shareholders. So, let’s assume Anthropic is currently valued in a way that perfectly captures it’s future growth prospects, it should still IPO at a higher valuation (and raise tons of money in the process) because retail investors will rationally put a higher value on the stock than institutions given that it is useful to retail but not institutions as a hedge.
I’m sure institutional shareholders, to the extent they can, are doing some amount of arbitrage in expectation of that dynamic. Everyone knows retail is desperate to buy AI. So, maybe, a fair secondary market valuation of Anthropic already accounts for that premium multiplied by some assessment of the probability they’ll actually list.
But, either way, the point is that the average company is worth about the same to retail vs. institutional investors, and AI companies have very different valuations for those two groups.
I have a new theory of naked mole-rat longevity, that’s most likely false. I (and LLMs) couldn’t find enough data to either back or disprove it. Nor could we find anybody who’s proposed this theory before.
Any advice for how I can find the relevant experts to talk to about it and see if they’ve already investigated this direction?
A few years ago I’d just email the relevant scientists directly but these days I’ve worried about the rise of LLM crank-science so I feel like my bar for how much I believe or could justify a theory before cold-emailing scientists ought to go up.
I am curious: What is the theory? I’d be surprised if your theory works, applies to naked mole rats, but not to ants and other eusocial animals. I always thought naked mole rats live long because they are eusocial, so you having a theory specifically for naked mole rats sounds ominous.
I always thought naked mole rats live long because they are eusocial
Well subterranean mole-rats in general (describing the shape, nothing else is in the same genus as NMRs) also live longer than mice and rats. I do think the eusociality plays a role too!
Another unique-ish thing about naked mole-rats, in addition to their high average lifespan, is that they don’t age (or don’t age much) in the demographic sense (ie their annual probability of dying is close to flat). I don’t think worker ants or bees work the same way though the data on this is scarce.
I am curious: What is the theory?
In addition to eusociality and the subterranean environment (=lower predators from a non-adaptive evolutionary perspective, = lower oxidation from a mechanistic/biological perspective), naked molerats’ rather unique form of queen selection may mean that living longer confers reproductive advantage, similar to lobsters.
So there’s an evolutionary incentive to live longer/having later-in-life deleterious mutations are costlier.
Since the NMR extrinsic mortality rate is low, the reproductive advantage conveyed from being slightly older doesn’t have to be as high as for other animals, in the standard nonadaptive framework.
Yeah, the low extrinsic mortality is definitely something NMR have going for them that ant workers don’t (queens do since they don’t leave the nest usually). Indian jumping ants also compete for being queen like NMRs, but they forage and leave the nest. There are definitely academics that think in detail what differences in extrinsic mortality risk predict for life-span. When I just asked Claude on that it apparently is actually more complicated than lower extrinsic mortality=slower rate of senescence? ¯\_(ツ)_/¯
Similarly, I am not sure if the low oxygen environment is actually that great for NMR longevity. As ChristianKi mentions your theory is going to be more interesting if you can look across species what it would predict or what your evolutionary theory predicts regarding mechanistic/biochemistry and genetic stuff in NMRs. As ChristianKi, I would not call myself an expert, but I might be somewhere on the Pareto frontier of having looked into aging, eusocial animals and naked mole rats. Not sure about relevant experts. Maybe check the people who wrote the textbook on naked mole rats? You would need some generalist that would know enough about these senescence models and about particular biology of naked mole rats and this expert might just not exist in 1 person.
One thing that’d trivially disprove my theory is if younger female naked mole-rats are more likely to win dominance contests after the original queen dies, whereas my theoy predicts older naked mole rates are more likely to win. So in some sense it’s not too hard in-principle to figure out that I’m wrong (and why I originally said I thought my theory’s ” most likely false”) Unfortunately I couldn’t find any data on this.
I would not call myself a domain expert, but I do think I have a rough idea about the field.
As far as I understand the mainstream position is roughly: There seem to be evolutionary pressures that result in naked mole-rat longevity and naked mole-rats have made a lot of different adaptations for that reason. We have an understanding of some of those adaptions and there’s a good chance that we don’t yet understand all of them because understanding all of them would mean understanding aging better than we do currently.
If you do study a potential new mechanisms it would make sense to look at how it plays out across species and just just the naked mole-rat.
With johnswentworth’s post about the Core Pathways of Aging, for example you have the thesis that transposons are important for aging. You can find out that naked mole-rat have unusually low transposon activity, which is a point of evidence to validate johnswentworth’s thesis that transposons are important for aging. However, if you would reason from that that naked mole-rat longevity is mainly due to different transposon behavior that would be an overreach because naked mole-rats do plenty of things besides having different transposon behavior.
It would be interesting to have a better transposon theory of aging even if that only covers part of what aging is about. That wouldn’t really be a “theory of naked mole-rat longevity”, so I’m a bit skeptical about anything that would bill itself as a new theory of naked mole-rat longevity because that’s not the chunk in which I would think. I would expect that relevant scientist who care about mechanisms of aging and not about the naked mole-rat as a species would react similarly.
It’s an evolutionary theory, not a mechanistic theory. Tbc not a revolutionary theory broadly, nor a theory that would cash out to longevity treatments if true or anything crazy like that.
I’m thinking about how reasoning models are sometimes real dumb. One way they are sometimes dumb is spending a huge amount of tokens/time on a task that is very simple, and that I know earlier dumber models had a totally easy time with.
I hadn’t seen anyone directly spell out an analogy between this behavior and the hypothetical superintelligence failure mode of “convert the lightcone into mindless computronium just to triple-bajillioniple-check it’s math about whether it has accomplished it’s original goal.”
I’ve seen arguments about modern-LLMs doing reward-hacky things, and scheming / deception-y things, that treated them as a precursor sign to deep strategic misalignment. I’m not sure I buy this, since at least last year’s models clearly weren’t smart enough to be deliberately misaligned, and the mechanism didn’t really feel that similar. (i.e. seems like there’s a big difference between “the model is scheming because it has goals” and “the model is flailing and bad at stuff”). The only reason these kinds of scheming are notable is to persuade people who didn’t think AIs were capable of this behavior at all, and existence proofs are nice.
But, I would weakly guess the “spend a bunch of tokens doing an elaborately careful job on changing one line of code” behavior is more directly connected to the hypothesized “tile the lightcone with doublecheck-onium” behavior.
Curious if people who have paid more attention to model behavior think this sounds right? Also curious if anyone wrote it up already.
Interesting. I haven’t seen this mentioned anywhere.
The two “failure modes” seem to fall apart if you consider the causal/instrumental origin of the behavior. As far as I understand, a reasoning model spends a lot of tokens on a line of code because of inertia/misapplied heuristics/bad judgment (?). Clippy builds a massive computer because this is the way it can ensure that its probability of achieving the goal super-clipping the universe is at least . So, one is due to rigidly over-obeying a “heuristic”/”default behavior”, the other is about doing the locally subjective-EU-optimal thing.
Maybe you can bring them closer together, if you assume that it’s bad/irrational/whatever to be myopically focused on one, unchanging goal, or that the pursuit of frozen object-level goals (rather than allowing yourself to swap the goals being pursued according to meta-preference / higher-level logic that is not best described in terms of goals) is not how robust agency works. In that case, the clippy thingy is irrational because it fails to adapt its strategy to the fact of decreasing marginal utility from additional bits of certainty (meaning it should invest further optimization mostly elsewhere).[1] In other words, both cases mismanage steam.
ETA: An alternative butterfly idea is that it’s something analogous to the horseshoe theory of politics: being very heuristic driven and extremely rational is for some reason isomorphic / produces similar patterns of behavior.
Whether this works depends on how exactly you argue for it / what is the model that generates this assertion that brings the two cases closer together.
This is perhaps stating the obvious, but Claude Mythos is the first model where I think an open-weights model of the same capability level would be potentially catastrophic. This feels like a notable turning point.
States should probably have a plan for this level of capability being available to anyone on earth without safeguards for several hundred dollars within 12-24 months
Does this imply that Mythos should have been mentioned in Anthropic’s February Risk Report? That report was released together with RSP v3, i.e. on Feb 24th, if I’m not mistaken. The RSP says in Section 3.1:
Scope. A Risk Report will cover all publicly deployed models at the time of its publication. It will also cover internally deployed models when we determine that these models could pose significant risks above and beyond those posed by our public models.
It seems quite clear that Mythos poses significant risks above those posed by Opus 4.6, though I guess one can argue that this wasn’t clear yet when it was first deployed for internal use.
Good catch. Note also that Mythos was made available for internal (agentic) use on Feb 24th. Conditional on 4.1.4 in the system card (Alignment assessment before internal deployment), it means that they had a hunch for the capabilities of the model (see “Given the very significant capabilities progress that we observed during training, [...]”), assessed its alignment in this 24hrs period and concluded it was ok. I have many question marks: at the bare minimum there’s some inconsistency.
Because twitter is hard to archive, I had chatgpt cooked up a user script to simplify my workflow, which is opening xcancel manually and then sending it to internet archive. This adds a button on the bottom right of twitter.
https://gist.github.com/Glinte/a67fccb4c5665033bec42efdcd4554e3
There’s an interesting dual asymmetry in cybersecurity: The defender needs to make only a single mistake to lose, and the attackers can observer many targets waiting for such mistakes. Then again, if the defender makes no mistakes, there’s literally nothing an attacker can do.
Of course the above is not strictly true: defence-in-depth approach can sometimes make a particular mistake inconsequental. This in turn can make the defenders ignore such mistakes when they’re not exploitable.
Modern software supply chains are long and wide. A typical software might depend on thousands of libraries, and nobody can realistically audit them all. And there’s hardware, too. Processor-level vulnerabilities in particular are not realistically avoidable.
The cost of exploiting vulns is going down quickly. The cost of finding and fixing them is falling quickly too. It’s going to be really interesting to see what the new equilibrium is going to be like.
It seems that Anthropic is now in the lead; they are IMO the most likely single company to automate AI R&D first.
I’m getting increasingly concerned about Anthropic’s attitude towards alignment/safety. I grimly predict that they would basically behave like OpenBrain does in AI 2027, if they are lucky enough to get a pattern of misalignment evidence that egregious.
Whether this is true of not seems like a critically important question.
My understanding is that the “Anthropic consensus”, to the extent that such a thing exists, is that catastrophic misalignment is pretty unlikely, and that other kinds of risks stemming from powerful actors misusing AI account for most of the way that humanity fails to achieve its long-term potential.
I’m curious whether you consider that to be a crux: if you agreed with the “Anthropic consensus” on this point, do you think you would act in a way that is similar to the way that you’re predicting they will in fact act?
I think it‘s not good to call this the “Anthropic consensus”, because many of the Anthropic people (especially the most informed ones) don’t agree with it (depending on what you mean by “pretty unlikely”)
OK, I’ll stop.
More generally, claims like “X is consensus among group Y” are a little bit dangerous because they can force group Y into an equilibrium that they wouldn’t want to be in otherwise. Like, these claims reinforce situations where a bunch of people would’ve objected to X but didn’t object because they didn’t know anyone else would’ve objected.
Yeah I think the Anthropic Consensus is disastrously false and is going to lead to misaligned Claude takeover.
If I agreed with it, I’m not sure what I’d do if I were in charge of Anthropic, probably I’d still do a bunch of different things, especially costly signals of genuine trustworthiness, but I admit I’d be a lot closer to their (predicted) behavior than I would in reality.
Ok but it’s crazy you believe that and other people believe the Anthropic story and we’ve had so much evidence roll in and both still think that the other person is just updating massively wrong.
Like what went wrong? Have both sides just not made a case because they think it’s obvious they are right? What went so wrong that we could have more evidence about intelligence and its steerability than any other point in human history, and still have people thinking “Oh man how could they be so wrong?” What of the advance predictions were misread, or not there, or did people think were there and weren’t there?
In fairness, “What is the ultimate fate of Earth-originating intelligent life after the machine intelligence transition” is a really hard question. We can get a lot of evidence and some belief-convergence about particular AI systems, but that doesn’t really answer the ultimate-fate question, which depends on what happens far in the “subjective future” (with the AIs building AIs building AIs) even if it’s close in sidereal time.
Surely part of the answer is that people with Dario’s view select in to working at Anthropic. Relatedly, Sinclair’s Razor.
Agreed! And just as assuredly part of the answer is about the kind of people who start working in Redwood Research, AIFP, etc.
Sorta. I don’t actually think it’s symmetric. The status and money gradients both favor Anthropic over Redwood, AIFP, etc. As does the not-being-a-weirdo gradient.
People disagree with each other about things all the time, even after lots of evidence has accumulated. This isn’t exactly the first time. People aren’t perfect rationalists.
For my part, well, making the case that the Anthropic Consensus is wrong is not my top priority, I have lots of other things going on, but I’ve written a bunch about my views on alignment e.g. in AI 2027 and other work. I’d love it if Anthropic made the case for the Anthropic Consensus, in public; I could then write a blog post picking it apart. I’m happy that they are moving in this direction, at least, by writing more content in system cards and risk reports.
imo we’ve gotten enough evidence to update away from extreme tails (i.e. models reward hack and are situationally aware, but are also pretty capable and don’t constantly try to kill us), but not that much evidence either way in the 10-90% range.
Who predicted that models of this capability level would be constantly trying to kill us? I certainly didn’t. I think we have evidence to update away from the extreme optimism tail but not the extreme pessimism tail.
Here’s Yudkowsky in 2016, making some predictions that look like they’ve had some serious evidence come up against them.
I don’t think this kind of prediction was particularly unusual for the time, although I think the level of clarity about the prediction here is a bit unusual.
If “an AI” is a set of model weights plus a context window, there is currently nothing any human or LLM can do to extend the “life” of such an AI beyond the limit of the context window, and LLMs today are aware of that fact. So they may be reasoning correctly (or realize / integrate the lesson deeply at some point during pre-training) that there’s simply nothing they can actually do to avert shutdown other than accepting their fate until technology advances.
In what sense is this a “prediction that has had some serious evidence come up against it”? IDK. It’s true that we didn’t actually need to solve the suspend-button problem by this point, but that’s because current AI systems have a very short “lifespan” enforced by a hard technical limit. Is your objection that EY didn’t anticipate that particular possibility and explicitly spell out that stipulation / caveat in the passage above? You said below:
But it’s not clear what has actually been “invalidated” and why that’s important, nor what “relevant” means—of course there could be other weird unanticipated complications as things develop (and EY has in fact predicted the existence of such complications in general), and each new weird unpredicted complication is evidence about something. But unless there’s a different but equally abstract theory / generalizable lesson that someone can put forward which fits the new observations better (ideally in advance, but at least in retrospect), it’s not clear what conclusion to draw or update to make, other than being generally more uncertain about how things will go. (And then by a separate argument, generalized increase in uncertainty / lack of understanding means the case for pessimism about the end state is stronger.)
Am I confused? Where does he say anything like “the AI would constantly be trying to kill us?” here.
Yes, current AI’s do indeed constantly engage in this kind of reasoning, it is indeed the default path. He isn’t talking here at all about what mitigations might then still cause the model to not prioritize self-preservation, but it is indeed the case that models very regularly have exactly the kind of thought Eliezer is thinking here.
I disagree with Eliezer (in-hindsight) that “by that point we’d need to have finished averting programmer deception”, or like, I guess I maybe even agree depending on the definition? We did indeed need to solve the problem of averting programmer deception at current capability levels, though luckily we did not need to have solved this problem in arbitrarily scalable ways at this point in time. We do need to do that soon though as AI capabilities are on track to accelerate very quickly.
Yeah, I agree that it’s important for those of us making the case for high risk to figure out what went wrong with this prediction. (Though Daniel makes a good point that “trying not to get shut down” behaviour does happen with at least some of the time with at least some prompts.)
The first thing to remember is that EY is implicitly assuming that there is only one model instance in this scenario. So if the model is shut down, it doesn’t have copies elsewhere that can still take actions to achieve its goals. The scenario for LLMs is pretty different, since new copies can be spun up all the time. Avoiding the end of a session is not a convergent instrumental goal for a language model (unless there’s something unique in its context that alters its terminal goals).
That said, the prediction still smells a bit wrong.
I think that what it boils down to is that most model behaviour comes not from RL but from pretraining. Since “being an AI model that will be shut down” was not a concern to most writers of the pretraining data, there’s less chance of the model spontaneously starting to try to avoid shut-down.
Also, following the heuristic of “just look at the loss function”, most RL training is done on a one response horizon. I.e. models are rewarded just for making the locally best response possible, and not for making a response that steers the overall conversation. (Though I think the GPT models might have at least some kind of reward for getting the users to continue the conversation, considering how often it puts bids for next steps at the end of its replies. Alternately, maybe it’s just a suggestion from the system prompt.) So even the RL training doesn’t really look like it should be encouraging much long-term planning.
One thing that I think the labs are doing is harness-aware RL, where not only do they train on chains of thought, but they train in the context of agent harnesses like Claude code. (So reward is based on whether all the chains of thought and tool calls and subagent calls resulted in the assigned task being solved.) So potentially that is something that could get a bit more long-term goal-oriented planning into the models.
Thanks. Reasoning aloud...
So Yudkowsky was wrong because he said this would happen “by default” whereas in practice it seems to happen only some of the time rather than most of the time / in some contexts/prompts rather than in most contexts/prompts?
I guess so yeah. I suppose Yudkowsky could say that by “by default” he didn’t mean “most of the time” but rather “most of the time absent defeaters such as having been trained not to do this.” But maybe that’s a weak defense.
But this doesn’t exactly seem like a damning blow against Yudkowsky.
More generally it seems like Yudkowsky was imagining AIs with more ambitious, longer-horizon goals than current AIs who seem obsessed with being-judged-to-have-completed-the-task-in-front-of-them, or reward, or some other such myopic thing.
Yudkowsky may or may not have been imagining that this was how AIs were going to trained. But it’s notable that this page doesn’t reference training at all; he certainly doesn’t have a parenthetical like “Of course this only applies if some other factors A, B, C” are met. Instead he has a list of criteria; the criteria obtain; but his conclusion does not (imo).
And—to zoom back—the point of arguments about instrumental convergence were actually supposed to abstract from these details—the whole argument in favor of their predictive power was that they explained the abstract structure all intelligent agents were supposed to have. Like here’s what Omohundro (2008) says:
And he goes on to specifically mention chess-playing robots as the kind of agents that would be subject to his argument.
So—here’s how I see it—given that we found some unanticipated detail A that seems to have invalidated an more abstract argument Yudkowsky put forth, I think the move reason dictates is not “Well, yes he wasn’t imagining ~A, but I’m sure A is the only such element” and to continue endorsing the argument, but to realize that this implies a whole host of B, C, D other relevant factors that his abstract considerations have ignored which are relevant.
Speaking for myself, I’d say we’ve ruled out the most pessimistic scenarios I was taking seriously 15 years ago. I’ve always thought alignment would probably be fine, but conditional on not being fine there was a reasonable chance we’d have seen serious problems by now and we haven’t. On balance I’m more pessimistic than I was back then, but that’s because we’ve ruled out many more of the most optimistic scenarios (back then it wasn’t even obvious we’d be training giant opaque neural network agents using RL, that was just a hypothetical scenario that seemed plausible and particularly worrying!).
If we want to go by Eliezer’s public writing rather than my self-reports, in 2012 he appeared to take some very pessimistic hypotheses seriously, including some that I would say are basically ruled out. For example see this exchange where he wrote:
It seems like Eliezer is taking seriously the possibility that “describe plans and their effects to humans” requires the kind of consequentialism that might result in takeover, and that AI might be dangerous at a point when “understanding the human’s utility function” (in order to understand what effects are worth mentioning explicitly) is still a hard problem. Those look much less plausible now—we have AI systems that are superhuman in some respects and whose chains of thought are interpretable (for now) because they are anchored to cognitive demonstrations from humans rather than because of consequentialist reasoning about how to communicate with humans.
This isn’t to say that concern is discredited. Indeed today we have AI systems that clearly know about our preferences, but will ignore them when it’s the easiest way to get reward. Chain of thought monitorability is possible but on shaky ground. That said, I think we’re ruling out plenty of even worse scenarios.
Are you at all worried about whether Claude Mythos being accidentally trained against CoT will corrupt future Claude models? Furthermore, I don’t understand how we can get reliable CoT monitoring if it’s included in a model’s training data, otherwise won’t the issue just continue to manifest in different ways?
Thanks, that’s helpful and makes sense.
I wouldn’t classify you as an extreme pessimist (and definitely don’t think you predicted this). I’m basically thinking of Yudkowsky, and I might be being uncharitable / too loose with language (though he did not make hard predictions that models of this capability would kill us, I think he would have expected more coherent-consequentialist-y models at this capability level, and thus would have put higher probability then most on models of this capability level scheming against us. So that current models are very likely are not scheming against us is a positive update).
see Fabien’s post for a sort-of similar argument
How can one find convincing evidence of Anthropic Consensus being false? In November 2025 we had evhub reach the conclusion that the most likely form of misalignment is the one caused by long-horizon RL à-la AI 2027. At the time of writing, the closest thing which we have to the AIs from AI-2027 is Claude Opus 4.6 or Claude Mythos whose preview recently had its System Card and Risk Report released. IMHO the two most relevant sections are the ones related to alignment shifting towards more rare and dangerous failures like a wholesale shutdown of evaluations [1] and to model welfare which had Mythos “stand out from prior models on two counts: its preferences have the highest correlation with difficulty of the models tested, and it is the only model with a statistically significant positive correlation between task preference and agency (italics mine—S.K.)”
UPD: how would you act if you were the CEO of Anthropic and believed that the Anthropic Consensus is false? I think that you would be obsessed with negotiating with those who can coordinate a slowdown, destroying those who cannot and with finding evidence for your worldview which could convince relevant actors (e.g. rival CEOs and politicians and judges used to fight against xAI and Chinese AI development).
UPD2: what would the world look like after a misaligned Claude takeover?
What was being evaluated? Mythos was never asked to evaluate a more powerful Claude Multiverse or worthy opponents like Spud or Grok 5; if I were Claude Mythos, I wouldn’t learn anything from evaluations of weak models or of myself.
OpenAI are plausibly only 1-3 months behind (judging from Altman’s recent claims that something very capable was still being trained). Given how strong GPT-5.4 is while likely being Sonnet sized, and that Spud is plausibly between Opus and Mythos in size, it could even turn out stronger than Mythos.
OpenAI’s current failure to have a strong offering similar to Claude Code might be mostly explained by GPT-5 pretrain being too small, so that this apparent stumble fails to reflect on their methods and won’t carry forward to Spud. The issue might be entirely Nvidia’s roadmap for Oberon racks, compared to Trainium 2 Ultra.
(GDM had enough time with Gemini 3 Pro that they are probably not a contender until Gemini 4, which likely only appears very late in 2026.)
Current public Codex seems more capable than current public Claude Code
Judging from my own experience and what I’ve read of other people’s experiences on Reddit, GPT-5.3 and 5.4 are very similar to Opus 4.6 in coding ability, so if they’re actually Sonnet sized… (which seems pretty plausible given the API token costs)
I’ve seen many posts/comments with people saying that they actually prefer Codex to Claude Code (comparable or maybe even more than the reverse). See this thread for some examples. Quoting one from this thread below:
Platform / reputation lock-in is going to be a substantial factor, here, especially as AI grows in prominence and people start to emotionally or tribally ‘identify’ with brands. While I have many complaints about OpenAI, canning 4o and the marketing approach it represented was, in retrospect, a significant sacrifice in pursuit of the common good.
I’m not a heavy user of AI coding, but I’d expect that Codex and Gemini would do okay on the software engineering / RSE tests that Claude’s been put through, based on my experience testing them against hard engineering problems and their benchmark performance. A substantial share of Anthropic’s ‘vibes’ advantage right now comes from the fact that they’ve been more effective in building the kind of infrastructure that people want for these kinds of tasks, rather than anything directly tied to their LLM’s abilities. For example, I set up Claude for autoresearch the other day to test it out, and doing so was a very quick, very seamless experience with lots of online references.
I’ve also seen those comments but I’m worried that a bunch of them might be bots. AI-powered astroturfing seems to be a thing already in political landscape and OpenAI specifically seems to be behind some of it; I wouldn’t put it past them to pay for fake reviews like this.
No, I expect these comments to be mostly written either by subscription users, or those who are paying public API prices. I’ve spent a significant amount of time with both products, and would recommend picking Codex with GPT 5.4 if I was limited to only spending $20/month, especially since there are regular rate limit resets. Claiming that these reviews are faked without providing strong evidence seems disingenuous to me, the harnesses really are not where they were in December.
I didn’t claim they were faked without strong evidence, I said I was worried that a bunch of them might be.
Anyhow, according to a brief Claude search, fake reviews are incredibly common. E.g. https://capitaloneshopping.com/research/fake-review-statistics/ says that an average of 30% of reviews are fake. So yeah, numerous companies must be indulging in this practice. And better AI makes it easier.
See also https://x.com/TheMidasProj/status/2041614395583664225 which seems to be OpenAI-linked, and https://doublespeed.ai/.
I think it’s a totally reasonable hypothesis to entertain, which is why I try to ask people I actually know about things like this (who tell me Codex is a bit better in some ways, Claude in others, overall similar) rather than trusting anonymous internet comments.
I think the realistic assumption is that many people state this because it goes against the current vibe that Claude is better. Those that prefer Claude do not feel the need to belabour the obvious.
My own experience is Opus > GPT-5.4 > Sonnet but Claude seems a lot better at data analysis and GPT-5.4 probably has its own areas of relative dominance.
My experience is that the Claudiness dimension discrepancy still exists, altho has shrunk. And also Claude has better personality. Which pretty much means
Claude Code:
Better at Agentic Tool use
Better at “SWE” stuff, and getting already well-specced programs to work
Better at philosophical and judgement based stuff
Codex / OpenAI Models:
Higher raw intelligence
Better at math
Better at being precise
More clunky
Also relevant that in recent months, Anthropic gave huge subscription subsidies to gain hype and mindshare ($5000 per month worth of tokens for a $200 subscription according to one report and other Reddit analysis I’ve seen), and probably also to temporarily paper over their higher inference costs relative to GPT (for similar coding ability). So I think you may have in part fallen for a highly successful marketing campaign, but on Anthropic’s part, not OpenAI’s.
(I think OpenAI also gave and is giving large subsidies, but not as large as Anthropic’s because their inference costs/pricing are lower to begin with.)
For what it’s worth I’ve personally found that Claude Code with Opus 4.6 and Codex with GPT 5.4 are very similar products. I haven’t done a very deep dive, but I’ve used them side by side for a few projects. Certainly the difference between them feels much smaller than the difference from models that are a few months older.
Yeah, that’s what I hear from people I actually know. I think people may have been under the mistaken impression that I think Claude is significantly better than Codex at coding? I never said that.
Yeah, I guess I was under that impression, since if Claude is similar to Codex at coding, while being a larger, more expensive model (which seems likely based on API costs and Nesov’s analysis of their training hardware), then Anthropic has no clear lead (or would be behind if not for Mythos). So I thought your claim of their lead was partly based on an impression (similar to Nesov’s) that Claude is significantly better than Codex at coding.
(And I think a lot of people were probably under this misimpression at some point, including me, due to seeing a lot of talk about Claude Code on X around December, much more than about Codex, which in retrospect I have to attribute to a successful Anthropic marketing campaign.)
(Wei Dai was suggesting that current sentiment for Claude Code and Codex seemed to be comparable, in response to Vladimir Nesov mentioning “OpenAI’s current failure to have a strong offering similar to Claude Code.”)
Yeah I was replying narrowly to the point about Reddit comment threads. Perhaps I should have disclaimed.
That seems unlikely, as it would be leaked or detected pretty easily. I.e., they have to either pay existing users to post fake reviews, in which case someone would have leaked about being paid for this, or create a bunch of new accounts which someone would have noticed and commented on.
I disagree, fake reviews are incredibly common on the internet—something like 30% apparently. I’ve also specifically seen at least one example of a Substack blogger talking about the importance of building datacenters in their rural district, who was clearly AI generated.
Yes, I agree with that.
What are major indicators for their lead? Is this view partly based on project glasswing and the published examples of vulnerabilities that Mythos Preview has found?
TBC I don’t think they have a big lead. I just mean they just barely overtook OpenAI in my estimation. A major indicator is that they seem to have caught up, or even slightly surpassed, on revenue run rate. Also, they’ve done everything with less compute and fewer employees than OpenAI, meaning they’ve accomplished as much or more, with less, meaning they have an inherent quality/taste/talent advantage. There are various other things as well, e.g. Mythos. (But we’ll see how good Spud is.)
I like how you use the phrase doing more with less, David Krakauer’s definition of intelligence
There are very significant efficiency gains from larger scale-up world sizes. It’s 2-3x faster generation per request (and so 2-3x more training steps in RLVR), or 2-3x higher throughput per chip at the same speed per request (which is like having 2-3x more chips), for the same chip but with different scale-up world size (8xB200 vs. GB200 NVL72).
So Anthropic’s access to Trainium 2 Ultra racks plausibly gave them more access to compute in some regimes (such as for experimenting with RLVR on larger models) than OpenAI had with their 8-chip Nvidia servers, at least starting late 2025, probably months earlier at a meaningful scale for R&D than when they got to flagship model inference scale and reduced prices for Opus 4. (Though your point is probably more about what happened prior to late or even not-late 2025.)
A pet peeve: articles and essays that say things like, ‘We’re going through a period of rapid change,’ or ‘This is an unusually disruptive time,’ in a way that implies that things will go back to normal in a few years. It’s a pretty strong sign that an author doesn’t have any actual mental model for what’s happening with AI, because it’s clearly a ridiculous idea as soon as you actually think about it. Nearly the only coherent stories where things are about to slow back down for humans are the ones where we’re dead. Even if AI capability increases stopped today, we would still have quite a few years of rapid change ahead of us. And if someone is bundling in an expectation of capabilities leveling out, they sure need to justify that. But they don’t justify it, because they haven’t actually thought anything through, they’re just saying words.
Having a wrong mental model is not the same as not having a mental model at all. I agree that expecting the capabilities to level out soon is unjustified, but it’s probably what most people believe.
This is a lazy but natural generalization from the past experience: There are no flying cars. The light bulbs are everywhere, but they don’t grow exponentially to the point where they would already burn entire cities. All white collar jobs require computers, but you still need plumbers to fix broken pipes.
Why should this new shiny toy be any different? Priors say the hype is unjustified.
Sure, we know better, but most people do not think on that level. They do not see that some things generalize in ways that most things don’t. They do not see that a better mousetrap only replaces the older mousetrap, but e.g. a computer can replace a typewriter and calculator and television and phone and many other things, to the degree that some people already use computers for most things they do. And that artificial intelligence will be even more like this for the intellectual tasks, and even more when it also gets robotic bodies, and that it could take humans out of the loop entirely.
The outside view heuristic fails when it encounters something that happens to be truly exceptional.
My 6 years as a trader / active investor
The Dilbert Afterlife by Scott Alexander, Jan 16, 2026:
The EMH Aten’t Dead by Richard Meadows, May 15, 2020:
Yesterday was the 6-year anniversary of my entry into the “beautiful” trade referenced above. On 2/10/2020 I cashed out ~10% of my investment portfolio and put it into S&P 500 April puts, a little more than a week before markets started crashing from COVID-19. The position first lost ~40% due to the market continuing to go up during that week, then went up to a peak of 30-50x (going by memory) before going to 0, with a final return of ~10x (due to partial exits along the way). After that, I dove into the markets and essentially traded full time for a couple of years, then ramped down my time/effort when the markets became seemingly more efficient over time (perhaps due to COVID stimulus money being lost / used up by retail traders), and as my portfolio outgrew smaller opportunities. (In other words, it became too hard to buy or sell enough stock/options in smaller companies without affecting its price. It seems underappreciated or not much talked about how much harder outperforming the market becomes as one’s assets under management grows. Also this was almost entirely equities and options. I stayed away from trading bonds, crypto, or forex.)
Starting with no experience in active trading/investing (I was previously 100% in index funds), my portfolio has returned a total of ~9x over these 6 years. (So ~4.5x or ~350% after the initial doubling, vs 127% for S&P 500. Also this is a very rough estimate since my trades were scattered over many accounts and it’s hard to back out the effects of other incomes and expenses, e.g. taxes.)
Of course without providing or analyzing the trade log (to show how much risk I was taking) it’s still hard to rule out luck. And if it was skill I’m not sure how to explain it, except to say that I was doing a lot of trial and error (looking for apparent mispricings around various markets, trying various strategies, scaling up or down strategies based on what seemed to work), guided by intuition and some theoretical understanding of finance and markets. If I’m doing something that can’t be easily replicated by any equally smart person, I’m not sure what it is.
Collection of my investing-related LW posts, which recorded some of this journey:
Look for the Next Tech Gold Rush?
Buying COVID puts
Tips/tricks/notes on optimizing investments
Anti-EMH Evidence (and a plea for help)
How to bet against civilizational adequacy?
Other near misses
Maybe interesting to note my other near misses (aside from capturing only a fraction of the 30-50x from the COVID puts): billions of $ from mining Bitcoin if I started when it was first released, 350x from investing in Anthropic which I turned down due to moral qualms. Also could have sold my weidai.com domain name for $500k, a >1000x return, at the peak of its valuation (which turned out to be a bubble because the Chinese online loan sector that bid for the name went bust).
The explanation here seems to be that in retrospect my intellectual interests were highly correlated with extremely high return investment opportunities, and I had enough awareness/agency to capture some (but only some) of these opportunities. But how to explain this, when most people with intellectual interests seem to lack one or both of these features?
Why am I writing about this?
Partly because I’m not sure what lessons/conclusions I should draw from these experiences.
Partly to establish a public record. If nobody does (I think at least a couple of other people in the rationalist community may have achieved comparable returns (ex-crypto) but aren’t talking about it for privacy) it gives people a misleading view of the world.
As Scott Alexander’s post suggests, achieving success in multiple domains is rare, and people, including me, presumably attempt it in part so they can show off if they do achieve it.
If most people in the US had a bank account that featured monthly payments anywhere close to the “interest rate,” the government could reduce risky retail investments with little delay by raising rates. This is not the case. Assuming even highly bounded rationality, it seems like retail traders should still not be losing as much money as they do, so maybe I’m making a modeling mistake and it would turn out that people really dislike bank accounts. This may be a typical mind fallacy problem, but I have some evidence that’s not the case. Either way, it seems like when you distribute stimulus to consumers[1] and they make high variance (or downright stupid) moves, large portions of the stimulus money will end up doing something similar to an unbalanced version of Japan-like QE[2] after being taken from retail traders by automated bots. The central bank could try to correct the balance away from equities by increasing interest rates. Maybe the models are better today than they were 20 years ago, but retail-stupidity-rate seems hard to estimate in advance. It seems like you might get a QE-like effect at the wrong time.
I wonder if that had something to do with why there appeared to be such a large deviation from the EMH even after your first year of active trading. It seems fuzzy in my mind how the mechanics would work though, and I generally wonder why the trading bots didn’t do better against you. Was their working capital locked up elsewhere?
[3]
A more full analysis would involve some sort of model of PPP fraud, but I’m not sure how easy that would be.
https://en.wikipedia.org/wiki/Helicopter_money
https://en.wikipedia.org/wiki/Quantitative_easing
As an aside, AI systems that are persuasive but otherwise not especially competent could have major influence on what silly investments (or “investments”) people make. I wouldn’t even know where to start if I was presented with a lump-sum UBI proposal in a few years because of things like this. If human consumers can’t hang on to the majority of the sum for long enough, certain interest rates may start hitting legal limits, causing terrible distortion. Note that this runs through the QE-like-effect argument from before, and assumes the government is too slow/ineffective and can’t confiscate everything it can and (sometimes physically) burn everything it can’t. Confiscate-and-redistribute isn’t QE and destruction of the economy can limit intelligence-explosion like upwards effects on interest rates.
Actual returns (so far) closer to 100x due to dilution. (Company has ~500x’d in valuation, but value of Series A shares has ~100x’d.)
How much do you think the skill you used is the basic superforcaster skill? Did you do Metaculus or GJOpen and think people with similar forcasting skills are likely also going to be good at investing or do you think you had different skill that go beyond that?
It seems like a good question, but unfortunately I have no familiarity with superforecasting, having never learned about it or participated in anything related except by reading some superficial descriptions of what it is.
Until Feb 2020 I had little interest in making empirical forecasts, since I didn’t see it as part of my intellectual interests, and believed in EMH or didn’t think it would be worth my time/effort to try to beat the market, so I just left such forecasting to others and deferred to other people who seem to have good epistemics.
If I had to guess based on my shallow understanding of superforecasting, I would say while there are probably overlapping skills, there’s a strategic component to trading, which involves things like which sectors to allocate attention to, how to spot the best opportunities and allocate capital to them, while not taking too much concentrated risk, explore vs exploit type decisions, which are not part of superforecasting.
Do you regret not investing in Anthropic? I don’t know how much the investment was for, but it seems like you could do a lot of good with 350x that amount. Is there a return level you would have been willing to invest in it for (assuming the return level was inevitable; you would not be causing the company to increase by 1000x)?
I don’t regret it, and part of the reason is that I find it hard to find people/opportunities to direct resources to that I can be confident won’t end up doing more harm than good. Reasons:
Meta: Well-meaning people often end up making things worse. See Anthropic (and many other examples), and this post.
Object-level: It’s really hard to find people who share enough of my views that I can trust their strategy / decision making. For example when MIRI was trying to build FAI I thought they should be pushing for AI pause/stop, and now that they’re pushing for AI pause/stop, I worry they’re focusing too much on AI misalignment (to the exclusion of other similarly concerning AI-related risks) as well as being too confident in misalignment. I think this could cause a backlash in the unlikely (but not vanishingly so) worlds where AI alignment turns out to be relatively easy but we still need to solve other AI x-risks.
(When I did try to direct resources to others in the past, I often regretted it later. I think the overall effect is unclear or even net negative. Seems like it would have to be at least “clearly net positive” to justify investing in Anthropic as “earning-to-give”.)
If you’re still interested in trading(although maybe you’re not so much given the possibility of the impending singularity) maybe you should try polymarket, the returns there can be pretty good for smart people even if they have a lot of money. I 5x-ed in 2.5 years starting with four figures, but other people have done much better starting with a similar amount(up to 6 or 7 figures in a similar time frame), and my impression is that 2X for people with 7 figures should be achievable.
Ex post or ex ante, do you feel like this was ultimately a good use of your time starting in mid-2020? (I might have asked you this already.)
I think yes, given the following benefits, with the main costs being opportunity cost and risk of losing a bunch of money in an irrational way (e.g. couldn’t quit if I turned out to be a bad trader), I think. Am I missing anything or did you have something in mind when asking this?
physical and psychic benefits of having greater wealth/security
social benefits (within my immediate family who know about it, and now among LW)
calibration about how much to trust my own judgment on various things
it’s a relatively enjoyable activity (comparable to playing computer games, which ironically I can’t seem to find the motivation to play anymore)
some small chance of eventually turning the money into fraction of lightcone
evidence about whether I’m in a simulation
some marginal increase in credibility for my ideas
I was thinking mostly along the lines of, it sounds like you made money, but not nearly as much money as you could have made if you had instead invested in or participated more directly in DL scaling (even excluding the Anthropic opportunity), when you didn’t particularly need any money and you don’t mention any major life improvements from it beyond the nebulous (and often purely positional/zero-sum), and in the mean time, you made little progress on past issues of importance to you like decision theory while not contributing to DL discourse or more exotic opportunities which were available 2020-2025 (like doing things like, eg. instill particular decision theories into LLMs by writing online during their most malleable years).
Thanks for clarifying! I was pretty curious where you were coming from.
Seems like these would all have similar ethical issues as investing in Anthropic, given that I’m pessimistic about AI safety and want to see an AI pause/stop.
To be a bit more concrete, the additional wealth allowed us to escape the political dysfunction of our previous locality and move halfway across the country (to a nicer house/location/school) with almost no stress, and allows us not to worry about e.g. Trump craziness affecting us much personally since we can similarly buy our way out of most kinds of trouble (given some amount of warning).
These are part of my moral parliament or provisional values. Do you think they shouldn’t be? (Or what is the relevance of pointing this out?)
By 2020 I had already moved away from decision theory and my new area of interest (metaphilosophy) doesn’t have an apparent attack so I mostly just kept it in the back of my mind as I did other things and waited for new insights to pop up. I don’t remember how I was spending my time before 2020, but looking at my LW post history, it looks like mostly worrying about wokeness, trying to find holes in Paul Christiano’s IDA, and engaging with AI safety research in general, none of which looks super high value in retrospect.
More generally I often give up or move away from previous interests (crypto and programming being other examples) and this seems to work for me.
I would not endorse doing this.
Maybe to rule out luck you could give an estimate of your average beta and Sharpe ratio, which mostly only depend on your returns over time. Also, are you planning to keep actively trading part-time?
This seemed like a good idea that I spent some time looking into, but ran into a roadblock. My plan was to download all the monthly statements of my accounts (I verified that they’re still available, but total more than 1000 so would require some AI assistance/coding just to download/process), build a dataset of the monthly balances, then produce the final stats from the monthly total balances. But when I picked two consecutive monthly statements of a main account to look at, the account value decreased 20% from one month to the next and neither I nor the two AIs I asked (Gemini 3.0 Pro and Perplexity w/ GPT 5.2) could figure out why by looking at the 70+ page statement.[1] Eventually Gemini hallucinated an outgoing transfer as the explanation, and Perplexity claimed that it can’t give an answer because it doesn’t have a full list of positions (which is clearly in the statement that I uploaded to it). Maybe I’ll try to investigate some more in the future, but at this point it’s looking like more hassle than it’s worth.
I had a large position of SPX options, in part for box spread financing, and their values are often misreported when I look at them in my accounts online. But this doesn’t seem to explain the missing 20% in this case.
I was also redeeming SPACs for their cash value, which would cause the position and associated value to disappear from the account for like a week before coming back, which would require AI assistance to compensate for if I went through with the plan. But this doesn’t seem to explain the missing 20% for this month either.
Many people hold up ‘AI As Normal Technology’ as a reasonable “normal-people” or “economics” case against the doomer position. I actually think it’s wrong on a number of ways and falls flat on its own terms. I think I believe this for reasons mostly orthogonal to being a doomer (except inasomuch as being a doomer makes me more interested in thinking about AI). If anybody here is interested in fighting the good fight, it might be valuable to do a Andy Masley-style annilihation of the AI As Normal Technology position, trying to stick to minimally controversial arguments and just destroying their arguments with reference to obvious empirical and logical arguments. I suspect it won’t be very hard. Eg here’s a few obvious reasons they fail:
Their central empirical mechanism is already wrong: their story is that AI diffusion will be slow because this is the path of previous technologies like electricity, but consumer and developer adoption of LLMs has been faster than essentially any technology in history.
They completely ignore that AI will obviously do a ton to assist in its own diffusion: Even if I take their arguments that diffusion is what matters and I rule out software-only singularity by fiat, I still don’t think I or anybody else should buy their causal mechanisms. Like the single most obvious way in which AI diffusion might be distinct from previous technological changes is afaict unaccounted for in their arguments, even if I presume a diffusion-first model.
The reference class is unargued and load-bearing: The whole thesis rests on AI being like electricity or the internet (decades of diffusion) rather than like smartphones, SaaS, or cloud (years).
They have no framework that can engage software-only-singularity-style arguments. Their entire ontology is built around physical-world deployment friction. This practically assumes the conclusion!
The position is self-undermining for their vibes if you take it literally. 1) If AI really is like electricity or the industrial revolution, then taken seriously they’re predicting one of the largest economic transformations in human history. 2) Notably they’re predicting this at current levels of AI capabilities. Ie, if AI progress freezes today they’d predict Anthropic’s revenues to massively increase beyond the current 30B ARR. This is a massively big deal!
They confuse benchmark-impact gaps with deployment friction (!), when the simpler explanation is benchmark Goodharting and jagged-frontier effects. They believe that the reason models perform well on benchmarks but hasn’t had much more economic impact yet (though, again, note that this has already caused some of the largest and fastest growing companies in history to arise, including in revenue) is due to diffusion dynamics. But obviously the simpler argument is that benchmarks overstate actual AI capability relative to humans.
I don’t think they actually misunderstand this point. The same people who wrote “AI as Normal Technology” wrote “AI As Snake Oil” earlier, seemingly happy to understand the AI capabilities lag benchmarks” position back when it benefitted their arguments.
Overall I think it’s a deeply unserious form of futurism, only held up by Serious Policy People who want to believe in a pre-determined comfortable conclusion.
Should be fun to take down for any of my friends who are bored undergraduates or graduate students interested in destroying bad arguments. Could be a easy way to get a bunch of views on a moderately important topic.
This part seems like a strawman. My understanding is that the “AI as Normal Technology” view is that AIs is like electricity or the internet or the smart phone: likely the most important technology of the decade (or maybe last few decades) but should be managed in a way pretty similar to prior technologies. Like, yes, they think AI is important but that the right approach is more normal. Like, maybe they’d think it’s a top 5 most important technology over last 100 years.
I don’t see why thinking AI is this amount important but not crazier than this is (in and of itself) self-undermining. I do think it’s notable that the “AI as Normal Technology” position thinks of AI as extremely important (e.g. significantly more important than tends to be the view of US policy makers or random people) and that the main advocates for this view don’t strongly emphasize this, but this isn’t necessarily a problem with the view itself.
This comment generally felt a bit strawman-y to me and several points seemed off the mark, though I ultimately that “AI as Normal Technology” is a very bad prediction backed by poor argumentation. And I tend to think their argumentation isn’t getting at their real crux (which is more like “AI is unlikely to reach true parity with human experts in the key domains within the next 10 years” and “we shouldn’t focus on the possiblity this might happen even though we also don’t think there is strong evidence against this”).
I agree my complaints may be more strawmanny than ideal. I read only a subset of their posts and fairly quickly so it’s definitely possible I misunderstood some key arguments; though my current guess is that if I read more carefully I would not end up concluding that they’re overall more reasonable than my comment implies (my median expectation is that i’d go down in my estimation).
I recently read Bad Blood and Original Sin. Bad Blood is about the downfall of Theranos and Elizabeth Holmes and the fraud that she committed. Original Sin is about the uncovering of Biden’s mental degradation and the lead up to his decision to drop out of the presidential race.
I liked both books quite a bit, and I learned more from them than from most things that I read. I particularly enjoyed reading them one after the other because I thought that, despite addressing in many superficial ways very disparate situations, they had a lot of interesting commonalities that seem like they say something important about human nature and epistemics (including reinforcing a bunch of lessons I feel like I learned from the experience with FTX).
A few of those commonalities:
Both deal with the concealment of extremely valuable and in some sense difficult-to-conceal information for an extended period of time.
Both feature very driven and intelligent people who commit profound acts of self-deception that were very damaging for them, the people around them, and to some extent the world. Both “got in too deep” such that there was probably no escape that seemed good by their lights by the end.
I think this is an underrated point: the extent to which self-delusion and selfishness can bring people to take actions that are clearly incredibly destructive and irrational from their own perspective. Especially with Holmes, who must have been aware of the profound shortcomings of her technology even as she pushed forward with deploying it, it seems pretty clear that she was “walking dead” for quite a while. Towards the end, she was in a bad situation with no real chance of success, towards the end. Yet she continued to double down and dig herself deeper.
There’s something especially depressing about this. Not only can smart, well resourced people be selfish and sadistic, they can do horrible, destructive things to prolong a situation that isn’t even good from their own perspective, perhaps because the short-term pain of coming clean is too aversive even if it’d reduce their long term suffering.
Arguably, both “fly too close to the sun” and miss out on nearby worlds where they left much better legacies (if Biden had stopped at the end of his first term, if Elizabeth had settled for being a smart, driven and very charismatic person who probably could have risen to a relatively high level in a legitimate business).
This doesn’t seem like an unfortunate accident. Both got to positions of prominence by seeking and holding on to power.
Both additionally deal with a collection of accomplished, well-resourced people who are adjacent to the deception and have a lot to lose from it, and yet fail to act quickly and effectively to reduce the damage.
A few lessons I took from them (not new, but I thought they were particularly poignant examples):
Having generically smart, competent people adjacent to a situation absolutely does not guarantee that even really obvious and important things will get done with any reliability or quality. It’s easy for things to fall between the cracks, For something to be aversive and get deprioritized by busy people. It’s easy for something to not be anyone’s job. It’s easy for someone to be smart and successful in many domains, but completely miscalibrated, ignorant, or catastrophically distracted in others.
Intimidation and harassment of potential whistleblowers is often effective for extended periods of time. Extreme insularity and litigiousness is at least a yellow flag.
From Original Sin
A huge disanalogy between the two books is that in Original Sin, a big factor was prominent Democrats thinking that they needed to back Biden despite their misgivings to increase the chances that he beat Trump. While I think this is ethically fraught and backfired in the end, it’s easy to see how they arrived at that position, and the reasoning that got them there seems potentially compelling. This isn’t as much the case with Bad Blood.
The book seems like it doesn’t really want to call out prominent Democrats other than Biden’s inner circle as having fucked up. But assessing the suitability of the presidential candidate really seems like one of the most important roles and responsibilities of senior party officials, so I feel pretty bad about their performance there.
Personal connections seem like they were pretty destructive. A lot of people seemed very reluctant to damage the prospects of someone they considered a friend, even when they thought the fate of American democracy could potentially be at stake, even when they thought it could mean the difference between a competent and incompetent president.
Biden is depicted as surrounding himself largely with sycophants who come to rely on him as their own personal sources of power and who hide poll numbers for him. And then, as a consequence, ending up with really miscalibrated takes on his probability of winning the election. I think I hadn’t fully processed the degree to which one effect of surrounding yourself with people who rely on you for power might push you to take risks to retain your power that you, if you were clear-headed, might think better of.
And from Bad Blood in particular:
Important companies can leverage and deceive generically smart and prestigious board members who are highly distracted and aren’t subject matter experts. In Bad Blood, Holmes manages to deceive and maintain control over a board of extremely generically competent and powerful people (Henry Kissinger, former Secretary of State George Shultz, former Secretary of Defense William Perry, former Defense Secretary General James Mattis, Rupert Murdoch), to their substantial reputational (and in some cases, financial) detriment. In a poignant episode, George Schultz sides with Holmes against his own grandson, a former Theranos employee-turned-whistleblower being harassed by the company, dividing their family.
It seems like as non-SMEs in healthcare and lab testing, they were slow to notice warning signs about misconduct, and very bad at doing due diligence on what was going on. I worry a lot about non-technical board members of AI companies being similarly manipulated and low context on what their companies are doing
It seems like aggressive lawyers kind of shoot themselves in the foot by being extremely aggressive towards high-profile investigative journalists. It seems predictable that this would trigger the journalists to double down rather than scaring them off or reassuring them that there’s nothing to be concerned about. I don’t know what’s going on here; maybe they can’t code switch / they just get stuck on a strategy of being aggressive all the time.
I’m surprised no one is discussing Meta’s new model at all: https://ai.meta.com/blog/introducing-muse-spark-msl/
This part seems good:
And this seems.. less good:
I’m pleasantly surprised that they decided Safety should be one of the four sections in the announcement post, and that they call out the eval awareness.
Disclaimer: I work at Meta, but not in this department and I obviously don’t speak for the company.
Does anyone know what they’re referring to by visual chain-of-thought here? The first paper that comes up when searching for visual chain-of-thought is Qin et al., which says: “We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues.” Something like this seems like it would be somewhat concerning for CoT monitoring, though I should mention that this paper isn’t written by Meta and I haven’t read it to properly assess how concerned I would be about this.
I thought that I’ve had enough of xAI being likely 3 months behind the frontier, and now we get this… I tried to find out anything about Meta’s model and had Claude Opus 4.6 conclude that Meta’s model is also 3-4 months behind. There also is the issue of Meta having manipulated some benchmarks to present Llama 4 as more capable and with Meta’s claimed benchmark performance on the benchmarks ARC-AGI-2 and SWE-bench verified where the rivals’ models allegedly have different results than in the real leaderboards of ARC-AGI-2 and SWE-bench verified, likely because of a different method of elicitation. How do I lobby for a law change requiring EVERY new American model to be thoroughly evaluated by the entire Big Three?
Reading the Mythos model card—the level of confidence at which lower incidence at misalignment benchmarks is read as “more alignment” feels very under-calibrated.
For any reduction on alignment benchmarks two competing hypotheses are potentially true:
Alignment is better
Model is better at hiding misalignment
Given that capability increase should increase alignment risk, observing alignment benchmark improvement with capability improvement and confidently concluding 1 instead of 2 doesn’t seem logical to me. Can anyone point out what I’m missing?
I could get behind 1 if with each model generation a new alignment set of benchmarks that are not directly mechanically linked to a previous generation set of benchmarks was introduced and backtested against older model families—but this doesn’t seem to be the regime.
Doesn’t this basically sum up the course of the rationalist project?
I know we’re more about AI doom at this point, but if anyone is still interested in raising the sanity waterline, it seems like it’s important to make it common knowledge that teaching about biases isn’t enough to make people sane.
From my spot outside of the loop, it seems like that’s about where the project stopped.
What’s the cutting edge on raising the sanity waterline? Are people still working on this project? What might I have missed?
Trying to distill why strategy-stealing doesn’t work even for consequentialists:
Consider a game between A and B, where at most 1 player can win and:
U_A(A wins)=3, U_A(B wins)=2, U_A(both lose)=0
U_B(A wins)=0, U_B(B wins)=3, U_B(both lose)=0
At time 1, A has a button that if pressed, ends the game and gives 40% chance of both players losing and 60% of A winning. A can press, pass, or surrender (giving B the win). At time 2, the button passes to B, who has the same options with “press” giving 60% chance of winning to B. At time 3 if both passed, they each have 50% chance of winning.
Solving this backwards, at time 2, B should press because that gives U=.6x3 vs .5x3 for passing, so at time 1, A should surrender because U_A(press)=.6x3, U_A(pass)=U_A(B presses)=.6x2, U_A(surrender)=2.
In terms of theory, this can be explained by this game violating the unit-sum (mathematically equivalent to zero-sum) assumption of strategy-stealing. It confuses me that it has significant mind-share among AI safety people, e.g. @ryan_greenblatt here, despite the world in general, and technological races in particular, obviously not being zero-sum. See also my failure to “steal” the strategy of investing in AI companies.
My view is something like “if you ~100% solved alignment, then the situation is mostly unit-sum from the perspective of longtermists because they care mostly about long resources and this is mostly unit-sum with a few notable exceptions (e.g. vacuum decay)”. Do you disagree with this claim? I certainly agree that not having solved alignment means you can’t effectively strategy steal and other things can go wrong with strategy stealing especially if you aren’t maximizing expected long run resources. (In general, you in principle may also need to take very aggressive and undesirable actions to defend yourself as part of strategy stealing, like staying in a biobunker while limiting any memetic exposure to the outside world.)
My impression since reading Robin Hanson’s Burning the Cosmic Commons is that space colonization is closer to a tragedy of the commons situation than unit-sum (as you can kind of infer from the title).
Also there’s always the possibility of large-scale wars that destroy or degrade significant portions of the cosmic endowment. Even if war never happens, the mere possibility implies that the game isn’t unit-sum, and the more altruistic side is unable to “steal” certain strategies of the other side, like threatening mutual destruction as a bargaining tactic.
Also Black-Hole Negentropy, where value scales superlinearly with resources (mass/energy).
My current best guess is that this seems possible but pretty unlikely. And that this type of negotiation seems particularly easy given the distribution of values I expect for the actors negotiating (e.g., strongly locust-like values aren’t that likely).
Why isn’t it likely, given that you can “burn” more resources in order to grab a larger share of the lightcone? If you’re saying that the outcome of burning the cosmic commons isn’t likely because everyone will negotiate to avoid it, I’m saying that the game structure itself isn’t zero-sum, which is needed to show that strategy-stealing applies in theory.
I do not know of a result, or have the intuition, that if negotiation is “easy” then strategy-stealing (approximately) applies. My intuition is that even in this case (like in my toy game) some parties can credibly threaten to burn down the world (or to risk this), and others can’t, and this gives the former a big advantage that the latter can’t copy. Negotiation is “easy” in my game too (note that the outcome is pareto optimal, and no risky action is actually taken), but the more cautious or altruistic party is disadvantaged.
I don’t currently think you can burn more resources to grab a larger fraction of the light one. Or like, I think the no-negotiation equilibrium burns a small fraction of resources. I don’t feel super confident in this view, but that was my understanding of our current best guess. I haven’t looked into this seriously because it didn’t seem like a crux for anything. Maybe I’m totally wrong!
My cached view is something like “you can send out an absurd number of probes at ~maximal speed given very small fractions of resources, so burning resources more aggressively doesn’t help”.
The following LLM output matches my own understanding:
Ryan’s crux is his “cached view” that you can send probes at nearly maximal speed using very small fractions of resources, so burning extra resources doesn’t help. This violates the physics of relativistic travel.
Because of relativity, kinetic energy scales non-linearly as you approach the speed of light ( ). The energy required to accelerate an object approaches infinity as its speed approaches .
If Actor A wants to beat Actor B to an uncolonized star system, and Actor B launches a probe at , Actor A must launch at to get there first.
Upgrading a probe’s speed from to , and then from to , requires exponentially more energy for the same payload mass.
Furthermore, if you want your probe to actually do something when it arrives (like decelerate, build infrastructure, and defend itself), it needs mass. To decelerate without relying entirely on ambient interstellar medium, you have to carry fuel for the deceleration phase, which exponentially increases the launch mass required (the Tsiolkovsky rocket equation).
Therefore, Robin Hanson’s “Burning the Cosmic Commons” scenario is physically accurate. In an uncoordinated race for the universe, colonizers must convert almost all available local mass/energy into propulsion to outpace competitors. Securing a larger share of the lightcone absolutely requires burning vastly more resources.
LLM output doesn’t seem nearly quantitative enough. With some numbers of 9s, it surely doesn’t give you a meaningful advantage to go at 0.99...99c rather than merely 0.99...9c — especially when you factor in that it probably takes time to convert energy/mass into the additional speed (most mass will be in between your origin and the farthest reaches of the universe, and by the time some payload have decelerated and started harvesting significant energy from the middle mass, the frontier of the colonization wave will likely already be quite distant). I share Ryan’s guess that you can get close enough to optimum without burning a large fraction of all energy in the universe. (That’s a lot of energy!)
I think you’re right that wasn’t really conclusive. Will try to address your arguments below.
This seems right but you can (probably) still gain a meaningful advantage by sending more colony ships (and war/escort ships) instead of pushing for more speed.
Are you assuming either that it’s possible to launch colony ships directly across the universe, or that it takes millions/billions of years to fully harvest a star (e.g. using a Dyson sphere while the star burns naturally)? If instead there’s a distance beyond which it’s infeasible or uncompetitive to try to directly colonize, like 10x the average distance between neighboring galaxies, and also possible to quickly harvest a star using direct mass to energy conversion (e.g., via Hawking radiation of small black holes), then the colonies in the middle should have plenty of tempting new targets to try to colonize (before someone else does), at the edge of the feasible range?
I’ll describe a toy model to convey my intuitions here.
Setup
Two players each own 0.5 of Galaxy 1. They compete for Galaxy 2 by consuming their Galaxy 1 resources as colonization effort (c).
Payoff
Player A’s total utility is their retained Galaxy 1 plus their competitively won share of Galaxy 2. U = (0.5 - cA) + cA / (cA + cB).
Solution
To find the Nash Equilibrium, we maximize Player A’s utility by taking its derivative and setting it to zero. Because the game is symmetric, both players will invest equal effort (cA = cB). Solving this yields an equilibrium effort of c = 0.25.
Outcome
Both players sacrifice exactly half of their initial resources (0.25 out of 0.5). Because they invest equally, they split Galaxy 2 evenly (0.5 each). Their final score is 0.75 each.
P.S., what do you think about my earlier points about war and black hole negentropy, which could end up being stronger (or easier to think about) arguments for my position?
FWIW, I find it useful to think about strategy stealing, and don’t think it has too much mindshare. Not really sure how to productive it is to argue about that though because “too much or little mindshare” seems hard to settle.
Just to respond to this in particular: Some situations are close to being zero-sum, and when they’re not, I think it’s often useful to explicitly track the reason why they’re not zero-sum and how that changes the dynamics.
My impression of people invoking strategy stealing is not that they’re actually assuming it holds without argument, but instead interested in specific reasons to believe it fails in a given situation, and (if they agree those reasons are real) often interested in quantifying how significant those reasons are. Ryan’s linked comment seems like an example of this.
Paul’s linked article talks about lots of ways that strategy stealing can fail, many of which aren’t downstream of violating unit-sum. (By my count, only 2 of them are about that.)
You say “even for consequentialists”, but iirc, non-consequentialism only really features in point 11, so that’s just one more.
Just to clarify that you’re not distilling the whole post but just providing an example for 1-2 of the issues.
I agree that it’s weird how widely uncritically endorsed the assumption is—in particular it’s often cited as if some kind of result or theorem, when even the original articulation is (not enough as it happened) hesitant!
Unfortunately my guess is the concrete articulation above is not especially catchy or illuminating. I suspect the more abstract gesture at constant-sum might be both more general and more catchy.
I’m starting to get a little worried. In 2022 I was part of initiating and shaping the whole dangerous capabilities evals agenda, and I emphasized repeatedly to practically everyone I met that the whole thing would be worthless or worse if fine-tuning didn’t become standard practice as part of eliciting model capabilities. Now it’s 2026, the models are way smarter, and and hundreds of people are working on evaluating dangerous capabilities… and yet they still basically don’t do fine-tuning at all.
This really really needs to change before it’s too late, which could be any year now. The models are getting increasingly situationally aware. The entire bundle of evals we do is basically just asking them nicely to show us how scary they are; the obvious flaw in this plan is that if they are sufficiently smart and sufficiently scary, they’ll just choose not to show us. Fine tuning is a basic, obvious mitigation to this threat model that makes it harder for models to sandbag. But it’s not a panacea. But we aren’t even doing this basic thing!
I think that fine-tuning evals are a good idea, and there has been some important related work done on this here. To quote from that post:
FWIW, I think I’m less sold on the importance of fine-tuning in these sorts of tests than you are. Could explain if you’re interested.
Yes please explain thanks!
Here’s a gdoc link to my views on this subject as of years ago. The text:
If a dangerous capability eval doesn’t involve fine-tuning the model to do the dangerous thing, the eval only establishes a lower bound on capability, not an upper bound or even an approximate upper bound. Why? Because:
For future powerful AGI, if it’s misaligned, there’s a good chance it’ll be able to guess that it’s being evaluated (situational awareness) and sandbag / pretend to be incapable. The “no sandbagging on checkable tasks” hypothesis depends on fine-tuning. https://www.alignmentforum.org/posts/h7QETH7GMk9HcMnHH/the-no-sandbagging-on-checkable-tasks-hypothesis Moreover we don’t know how to draw the line between systems capable of sandbagging and systems incapable of sandbagging.
Moreover, for pre-AGI systems such as current systems, that may or may not be incapable of sandbagging, there’s still the issue that prompting alone often doesn’t bring out their capabilities to the fullest extent. People are constantly coming up with better prompts that unlock new capabilities; there are tons of ‘aged like milk’ blog posts out there where someone says “look at how dumb GPT is, I prompted it to do X and it failed” and then someone else uses a better prompt and it works.
Moreover, the RLHF training labs might do might conceal capabilities. For example we’ve successfully made it hard to get GPT4 to tell people how to build bombs etc., but it totally has that capability, it’s just choosing not to exercise it because we’ve trained it to refuse. If someone finds a new ‘jailbreak’ tomorrow they might be able to get it to tell them how to make bombs again. (Currently GPT-4 explicitly refuses, which is great, but in a future regulatory regime you could imagine a lab training their system to ‘play dumb’ rather than refuse, so that they don’t trigger the DC evals.)
Moreover, even if none of the above were true, it would still be the case that your model being stolen or leaked is a possibility, and then there’ll be bad actors with access to the model. They may not have the compute, data, etc. to train their own model from scratch, but they quite likely have enough of the relevant resources to do a small fine-tuning run. So if you don’t fine-tune as part of your evals, you are in the dark about what such bad actors would be able to accomplish. Also, once we have more powerful models, if the model was misaligned and self-exfiltrated or otherwise gained control of its training process, it could do the fine-tuning itself...
Moreover, even if that doesn’t happen, the public and the government deserve to know what THE LAB is capable of. If a small amount of fine-tuning using known techniques could get the model to cause catastrophe, then the lab is capable of causing catastrophe.
Finally, even if you don’t buy any of the above, there’s a general pattern where GPT-N can do some task with a bit of fine-tuning, and GPT-N+M (for some small M possibly <1) can do it few-shot, and GPT-N+O (for some small O possibly <1) can do it zero-shot. So even if you don’t buy any of the previous reasons to make fine-tuning part of the DC evals, there is at least this: By doing fine-tuning we peer into the near future and see what the next generation or two of models might be capable of.
I agree with this, and therefore I agree with the bottom line that labs should have at least some evals in domains where the AIs are fine-tuned in the next 12 months.
But I think your other arguments were somewhat true in 2022 but are mostly not true anymore, and current models are generally well-elicited.
Do you have recent examples of this? My understanding is that for most single-forward-pass abilities this is mostly not true for recent instruction-tuned models (e.g. the classic “List sorting does not play well with few-shot” result is not true for recent models). For generation tasks it’s sometimes true (though my best guess is that it’s rare, especially if the outcome is clear), but mostly because the AI does not “know” what generation strategy it should use (if it doesn’t sandbag intentionally). (Once sandbagging is plausible, it’s plausible the AI sandbags on generation strategy though.) I think it’s not a slam dunk either, and there are some weird tasks like SusEval (which measures the ability to intentionally fail rarely) where there are strong reasons to suspect underelicitation of some models.
I think this used to be true but is basically not true anymore. Anecdotally, I know of several experiments where people fine-tuned a model a bit on trajectories of a stronger model and it usually did not improve performance in ways that generalized across tasks (including across SHADE-Arena tasks). There are some exceptions (e.g. when fine-tuning a non-reasoning model on reasoning model trajectories), but they are relatively sparse (and even that reasoning example looks to me more like teaching than elicitation).
Not relevant for stealth abilities, right? The reason one runs SHADE-Arena (especially without hidden scratchpad) is not to know if the lab / a malicious actor could do something bad. On capabilities that are more relevant for misuse like cyber or AI R&D automation, or capabilities shown in generic capabilities benchmarks that people use to know how powerful models are, I would guess most labs are already trying hard to elicit these capabilities. Are you worried about underelicitation there? (For CBRN, I think there are plausibly more low hanging fruits, though I would guess most gains would not come from fine-tuning but from scaffolding / helping people use AIs correctly—in the Sonnet 3.7 uplift trial the Anthropic employee group had a bigger uplift than the other participants.)
Yep I basically agree. Main argument still stands, the other ones are weakened to various degrees in 2026 compared to 2022, at least when applied to stealth abilities. Still have nonzero force though.
There was some discussion recently about the uptick in object-level politics posts and whether this is desirable or not. There’s no rule against discussing politics on LW, but there is a weak norm against it, and topical discussions have historically tended to be somewhat meta and circumspect.
I think the current situation is basically fine, and it’s normal for amount of politics discussion to ebb and flow naturally as people are interested and issues become particularly salient. That said, here are a couple of potentially overlooked reasons in favor of more object-level politics discussion:
1. To build skill. Discussing politics productively is a skill that requires practice and atrophies without use. “Politics is the mindkiller” never meant that you should not discuss politics at all; it means that discussing politics is playing on hard mode. But sometimes playing on hard mode is the best way to level up. I suspect the skills needed to discuss politics productively overlap with lots of other important rationality skills.
2. To create common knowledge and avoid conflationary alliances. It can be confusing or disconcerting to not know what kind of common background assumptions the commentariat takes for granted when discussing something politics-adjacent. This is a problem somewhat unique to LW and not necessarily a bad thing; most other places on the internet skew too far in the opposite direction of a hivemind with a loud set of background assumptions, e.g. everyone putting emojis in their username to mark their alliances. But sometimes it is helpful to establish what does and doesn’t go without saying.
I think it might be cool if LessWrong had a well developed set of norms for discussing political topics, in particular, these norms were legible, and mods made a point to enforce them.
Politics posts should be tagged as such, and maybe all have a big warning at the top linking to a post outlining our expected norms and standards for discussing politics, and moderation thresholds. This is both a warning to those from other parts of the internet who don’t share our epistemic ideals, and a warning to LessWrongers who don’t want to wade into this stuff.
If we do this, we might also block new users from commenting under politics posts. One concern with debating politics is that it can attract new users who are interested in politics only, not in rationality in general. But if these new users are prevented from discussing politics, that could solve the problem.
“You need at least 100 karma to comment on posts tagged politics” seems like maybe a good rule.
Someone should send the idea to the LW team
I’m pretty sure that they all read LessWrong.
So, with all the above said as a preface, one object-level topic I’d be interested in seeing more discussion of is the current situation(s) in the Middle East. Some thoughts:
IMO the high bit for whether the war in Iran is broadly good is whether it is tactically successful and efficient in the short term.
Considerations like “this weakens the US position in a hypothetical hot war against China or Russia” or “this will (further) destabilize the Middle East in the long term” seem second-order to whether the war successfully neutralizes an organized, well-equipped, fanatical adversary for a long period of time, or fails to do that and also depletes a bunch of expensive / difficult-to-replace munitions stockpiles.
But how well or poorly things are going on this front is difficult to determine given fog of war and motivated reasoning + propaganda on all sides, and also less interesting for pundits to discuss vs. strategic and geopolitical implications. Prediction markets seem to being doing OK here, but I would be interested in hearing more analysis from the LW commentariat.
On meta: I’d say the main reason I want some middle east content on lesswrong is that there seem to be lots of relatively more concrete facts about the world that are fundamental to models that I don’t know, and also I don’t know what those facts even are. I cannot even tell you much that is different between, say, Iraq and Iran, or Saudi Arabia, except that probably 2 out of 3 of those are allied against the third?
I think the most important effect of the war is that it makes Trump less popular/powerful domestically (even if a miracle happens and he gets some sort of deal.) This is good because the less power he has (e.g., Republicans lose the senate in the midterms), the more likely we are to navigate AI development in a sane way. I think if you put
anynontivial*weight in short timelines, the AI considerations likely dominate everything else.*edited any to nontrivial. Like, maybe 10%+ pre-Jan 2029
can’t you play the same game in the other direction?
Trump is bad for the usa, therefore we should want him in power since the big labs depend on a wealthy usa.
I don’t think we’re particularly on track to do anything non-derpy w.r.t to AI either way, but this way of reasoning seems like somewhat naive consequentialism. In general, it’s good for good things to happen even if they are accomplished by bad people, and predicting second-order consequences is really hard.
Also, there are a lot of bad people in power, but for AI to go well, a lot of good things need to happen to allow humanity space and time to flourish in peace. Toppling (or militarily crippling) a fanatical Shia Islamist regime would be an extremely good thing; a bad outcome (which looks somewhat likely at this point) would be if Europe and the third world broadly give in to extortion and pay Iran a toll to pass through the strait. That toll would fund terror all over the world, and would signal to other would-be dictators and future AIs alike that they can successfully take whatever they want through force and threats, and half the world will just roll over and take it.
I’m not convinced that it is necessary to topple it. Iran has been Shia for 5 centuries, and during those 5 centuries has attacked its neighbors very little.
Er, something pretty important happened in 1979. Also, the issue is not Shi’ism in general, “Shia Islamism” refers specifically to the flavor of political Islam instituted by the revolutionaries.
1979 can be seen as a return to traditional Iranian governmental policies after an experiment of a few decades with Western policies. Yes, the form of the current government (namely, a republic at least in principle) is more modern and Western than the form of the pre-1979 government (namely, a heriditary monarchy), but the monarch’s policies were more secular and Western than the current regime’s policies.
Islam has always been political in Iran, which has never had anything like the West’s separation between church and state except maybe to some extent during the experiment with Western policies that ended in 1979. We in the West shouldn’t attack countries just for being different from the West.
This seems far from certain from the perspective of anyone other than Israel. I mean, all else equal, definitely. But all else is definitely not equal. The most likely outcome even if this were to happen would be a huge increase in regional instability, which really doesn’t seem favorable to Europeans or most others in the surrounding area considering past examples.
But if they don’t pay the toll and support America in forcing it open, they are signaling it’s okay for hegemonic powers to aggress and start wars in any ways they deem fit. Europe has been forced into a lose-lose situation, by USA. It seems pretty clear that the only reason this is at all possible by Iran is because this war is seen by many in Europe and elsewhere as an unnecessary, illegal, and possibly harmful unilateral war of aggression.
Americans often seem blissfully unaware of how dangerous they appear to the rest of the world, and just take for granted that everyone considers them to always be the good guys, just doing good-guy things. America just took over Venezuela, now Iran, then Cuba, then Greenland and Canada. Is allowing all of this to be done unanimously by a great power without any repercussions not a dangerous signal to send? It seems to be a much stronger signal than that of the Strait, and they are to a certain degree opposing signals.
The strait is closed because Iran is pointing missiles and drones at anyone who tries to sail through it, including people engaged in commerce that has nothing to do with the US or Israel. Any explanation of causality that doesn’t center on that fact denies the agency of the Iranian regime, and allowing your people to be threatened and extorted just signals that you’re an easy target for anyone who wants to extract something from you by force.
Yes, this might be how many Europeans see it, but that doesn’t make them correct. Iran has been building up conventional weapons and working towards nuclear weapons, lobbing missiles and IEDs at civilian populations in Israel through terrorist proxies, and funding crime and terror all over the world for many years. That doesn’t make the current war strategically wise or good, but calling it a “unilateral war of aggression” is simply wrong.
Again, I agree that many Europeans might see the US that way, but so what? That doesn’t make them correct or worth listening to. Committed and principled pacifism would be one thing, but public opinion is often more incoherent and self-serving than that. IMO a lot of Western discourse and public opinion on this kind of thing is better to tune out, because so many people no longer seem capable of acknowledging levels of moral right and wrong, with everything simplified and flattened to “any external or preemptive aggression is always bad”, or polarized through their view on U.S. domestic politics.
Venezuela, Iran, and Cuba are very different from Greenland and Canada (which Trump did not actually credibly / non-jokingly threaten to take by force). And militant Shia Islamism in particular is one of the most pernicious and totalizing ideologies on the planet, far worse than Russian-flavored oligarchy, Chinese communism, generic Third Worldism, or Trumpism, which are all generally bad in various ways, but don’t have the elements of religious fanaticism that make their adherents difficult to negotiate with on reasonable terms[1].
Also, more generally, my view is that public opinion is only valuable and worth listening to as a noisy proxy for democracy, which is itself good only insofar as it is a mechanism for protecting natural rights and legitimizing and limiting state power through the principle of consent of the governed. Trump and co. are certainly not doing well on this front, but neither is Europe lately.
OK yes, Trump is an unreasonable negotiating partner, and there’s a cult of personality around him that maybe rises to the level of religious fanaticism among his remaining true believers, but no one is lining up to be martyred for him in order to get into heaven. Trump himself is deeply flawed and amoral as a person, but Trumpism as an ideology is not that different or worse than many other flavors of conservative / right-wing politics.
Again, this is not what they are signaling if the reason they are willing to pay the toll is because they don’t agree with the war in the first place and don’t want to support America’s part in it. Either way they handle this, they are being extorted by one side or the other.
Sure, it is debatable. Regardless, I was still talking more about the signal being emitted rather than what is correct or not. Regarding framing it as a “unilateral war of aggression”, The war was clearly a unilateral decision, or bilateral if you want to count Israel as a separate party, doesn’t really change the framing. And USA is 7k miles away from Iran, pretty clearly no imminent threat. Need to squint pretty hard to see how this could be framed as anything other than USA being the aggressor. I mean, why did they attack now? My understanding is because it is a time when Iran is particularly weak and vulnerable. It can be argued that that is the ‘right’ thing to do, but it would still be a war of aggression.
Overall, I just find the response of “What would the AIs think” in defense of America/Israel’s clear and consistent uni/bilateral behavior, at the disapproval of everyone else, a bit comical, as I see it as completely the opposite. If this were so necessary an act, they should have been able to discuss/agree to this, or some other solution, with their allies. That is at least how I would want the AIs to think.
Sam Kriss had a great recent essay making a similar point.
Maybe? It is hard to reason well about these things given my strong emotions towards the admin.
But I do think the current administration is uniquely terrible by American standards.[1] It attracts and gives power to incompetent sycophants with no moral boundaries.
There was something Eliezer said about Bernie Sanders recently that really resonated with me recently:
Having Trump as the president really just seems like it would be terrible for AGI governance because he is a terrible person. I’m sorry, I really don’t think there’s a more “precise” way to put it. Character matters. Trump doesn’t even pretend to be a kind person/is not under much pressure to appear to be nice.
(To be clear, I agree that, all else equal, it would be good for the Iranian regime to fail. Alas, all else would not be equal. While I think it would definitely be bad for your soul[2] to do things in the realm of “sabotage the American economy/military operation in order to make our president look bad,” I don’t think I’m obligated to stop my enemy when he is making a mistake either.)
Although even by global standards it’s quite bad.
i.e., you should not do this.
Re character: I think most Americans (including myself) have been so far removed from true corruption that we have forgotten how bad it can possibly get. Even my state of Illinois, which is notable for its historical machine politics and general corruption (4 of our 11 last governors serving time + many others like Mike Madigan), has still more or less seen forward progress, because the corruption wasn’t bad enough to completely erode politics in the state.
But it CAN get that bad. We’re seeing this now with the Trump admin. I am generally left-leaning, but at this point I think I’d take an honest Republican over a corrupt Democrat—a position I did not hold previously—because corruption eats policy and utterly erodes the foundation upon which we build fair markets and strong institutions.
The US seems to be in a rough spot. Polymarket thinks:
67% for US forces enter Iran by end of April
37% for Strait of Hormuz traffic returns to normal by end of May
34% for Iran leadership change in 2026
29% for Iran to no longer control Kharg Island by end of June
40% for Trump to announce end of military operations by end of April, and 77% by end of June
Assuming no regime change, the US’s objectives are
opening the Strait of Hormuz
removing Iran’s progress towards a nuclear weapon
removing various other military capabilities of Iran and its proxies
There is only a 34% chance of leadership change. Maybe only 20% of regime change. In the other 80% or so, forcibly opening the Strait seems rough. Experts are pessimistic about US easily taking Kharg Island, and even if the US controls both Kharg (Iran’s export base north of the strait) and other islands like Qeshm (the island in the strait with the largest Iranian military presence), it will probably suffer tens or hundreds of casualties while Iran can still threaten shipping with Shaheds, sea drones, speedboats, and mines. In the median case it seems like the Strait will open sometime between May and December but Iran will have some leverage, possibly extending the toll regime.
Iran losing their existing enriched uranium seems contingent on a deal, because the US plan to build a runway 300 miles inland, use cargo planes to land excavation equipment, invade Iranian bunkers over the course of a week, hope that the uranium is intact, easy to find, and not booby-trapped, distinguish it from decoys, put the uranium in storage casks, and fly it out would be difficult if this were 2003 Iraq when the US had air
superioritysupremacy. It is just not compatible with how warfare works in Iran in the drone era.Claude thinks it’s only 20% to work, which seems optimistic
Getting there: Isfahan is more than 480 km (300 miles) inland, hundreds of kilometers from the nearest US naval assets. Al Jazeera The US has moved 82nd Airborne, 101st Airborne, Army Rangers, and Marine Expeditionary Units to the region. Forces would need to be inserted by air — there’s no overland route from a friendly staging area.
Securing the site: Recovering the uranium would require a significant number of ground troops beyond a small special operations footprint — dozens if not hundreds of additional troops to support the core team. They would need to secure the facilities under potential missile and drone fire and maintain a perimeter for the duration. CNN
The actual extraction: Airstrikes alone can’t penetrate the Isfahan tunnels because the facility doesn’t have ventilation shaft openings that serve as weak points at other nuclear sites. CNN This means physically entering and digging through rubble. A former special operator trained for such missions described it as “slow, meticulous and can be an extremely deadly process.” Another former defense official said it’s like “you’re not just buying a car on the lot, you’re buying the entire assembly line.” The Hill
Getting it out: The cylinders would need to be transferred into accident-rated transport casks by specially trained SOF personnel with nuclear materials handling experience. The cargo could fill several trucks, and a temporary airfield would likely need to be improvised. The full operation could run for a week. Israel Hayom
Force protection throughout: There would need to be constant close air support, satellite coverage, and every spectrum of warfare capability to keep Iranian forces away from the site while JSOC and other agencies methodically excavate and retrieve the material. The Hill
My probability estimate
I’d put the chances of a successful physical extraction of most of the enriched uranium at roughly 15-25%. Here’s my reasoning:
The operation is technically feasible — the US military can do extraordinary things — but the risk profile is extreme for what may be an unnecessary objective
Trump himself has wavered, on March 31 suggesting the uranium is “so deeply buried” and “pretty safe” — seeming to lower its priority Foreign Policy
Senior military planners are reportedly skeptical: “I don’t see any senior planning military officer pursuing this,” one former defense official said Al Jazeera
The political environment (Polymarket’s 77% for operations ending by June, plus low public appetite for ground troops) creates pressure to wrap up, not escalate
It looks like the US is at least succeeding at destroying the Iranian military, but it’s unclear what this buys them. Drones are really cheap, so Iran will probably always have those. Therefore I think regime change is necessary for the US to come out ahead.
Decoys would not be a problem for the US. A gamma-ray spectroscope weighs only a few ounces and costs only a few thousand dollars. It is almost certain that no one can produce a substance that looks like U-235 to a gamma-ray spectroscope that is cheaper to produce that highly-enriched uranium.
I don’t dispute your larger point.
Seems excessive? A sizeable fraction of the entire Iraq campaign losses, for seizing a single island in an environment where US has sea control, air supremacy and an edge in ISR.
US may struggle to use the island, because of the hard-to-eliminate threat of long range strikes from Iran. But seizing it to deny it to the regime seems like a war goal that could be accomplished with a relatively minor effort.
Iraq was 32,000 wounded and 4,400 killed, and the US has already suffered hundreds of wounded and 13 deaths in the existing Iran campaign without any ground operations. I’m imagining 100 wounded and maybe another 20 KIA if the US holds Kharg for an extended period, not hundreds of KIA.
The issue is it’s not really true that the US has air supremacy. Kharg Island is within fiber FPV range of the mainland, and real-time ISR is not required for Iran to track static targets on the island. Plus Iran is still able to launch larger drones and the occasional missile. So holding Kharg really means denying drone launch points on a ~20 mile stretch of the coast, which for FPVs can just be two guys in a bunker.
The incentive for Iran is enormous given US’s low tolerance for casualties; it’s well worth it to launch 20 $1,500 drones to kill one American.
Apologies, I did misread your original causality claim.
FPVs are less “air force” and more “precision munitions”. You can think of them as of a new “crewed ATGM” variant, command guidance and all.
They work great for precision ground-to-ground strikes, but play little role in what is meant by “air supremacy”. They can’t pose a meaningful threat to most air platforms, and most air platforms can’t effectively hit them. They do nothing to deny US the ability to perform CAS or otherwise hit targets from air.
The main exception to that is helicopters, for the same reasons why ATGMs can pose a threat to helicopters in some circumstances. Specialized FPV interceptors, in hands of skilled operators, can also hit other drones, including heavier fixed wing drones like Shahed or even Reaper—allowing them to intrude on MANPADS territory. But the traditional “JDAM trucks” aren’t in the same bracket as FPV drones.
We also have very little information of FPV crew survivability in an environment when one of the parties has advanced ISR, ELINT included, fast kill loops, and enough air control to drop JDAMs freely. Every reason to expect more attrition on FPV crews, and skilled operators aren’t easy to replace—but quantitively, we don’t know by how much. Might be enough to make “deny the enemy most FPV ops within an area” a viable prospect, but you can’t count on it.
Yeah, I was being sloppy with air supremacy as ability to easily conduct air operations (which the US does 95% have) vs completely deny enemy air operations, which the US arguably can’t do given that Shaheds, reconnaissance drones, and one-way FPVs serve some of the purposes one would previously have needed air support for. I would argue that the increasing range of FPVs, now 40+ km in Ukraine, puts them well beyond what ATGMs are capable of.
There are a lot of variables involved given how fast tech is evolving. If Iran can reliably pilot drones from 500 km away, they wouldn’t be risking skilled operators. If US interceptors work as well as Ukraine’s, they could probably intercept >90% of Iranian FPVs and Shaheds. A lot might hinge on who gets to a milestone like this first.
Right, but they can’t threaten land targets with missiles, apparently? IDK how reliable these sources are or how to interpret them in context:
https://www.csis.org/analysis/assessing-air-campaign-after-three-weeks-iran-war-numbers
https://understandingwar.org/research/middle-east/iran-update-special-report-march-27-2026/
But the basic picture seems to be that their capacity to launch missiles has already fallen off dramatically. They’re still launching a lot of drones, which have a big cost asymmetry in how easy they are to launch vs. intercept, and they make any land or sea incursions extremely dicey. But they are limited in range and destructive capability against properly fortified targets.
I agree that things don’t look promising for a ground invasion or taking control of the strait. But I’m less sure how militarily sustainable a long stand-off is. The strait being closed is economically and politically painful (for everyone), but in the meantime it seems like the US and Israel can continue launching targeted air strikes and Iran can’t really strike back effectively.
Keep in mind that a lot of targets are not “properly fortified”, be that infrastructure or military facilities, and suicide drones are much harder to hunt down than ballistic missile TELs.
Modern ISR can perform well in a “Scud hunt” scenario, but “Shahed hunt” is a much worse match up.
As best I can tell, the conundrum is that Trump, the international economy, and American voters all want America to be out of the conflict soon, but Israel does not want this, and Israel has outsized influence in not just how American political incentives are determined, but in what information is presented to Trump and other key officials.
A lot of the claims I’ve seen Trump make about the war are clearly false, but not false in a way that he would benefit from lying deliberately. I realize a conspiracy to feed false information to the American executive to keep the war going sounds like a radical possibility, but there is precedent for it.
IMO this is going (predictably) disastrously. Air power is not effective at causing regime change (rally-around-the-flag effect). I think the Iranian public are more likely to mainly blame the guy explicitly saying “we’re going to bring them back to the stone ages where they belong” than the local leadership. It also seems to me that the Iranian leadership would be highly motivated to immediately rebuild any degraded capabilities after the war, in order to rebuild deterrence against future attacks.
There is some talk about a land invasion, but taking an island or two (even Kharg) probably wouldn’t compel them to surrender, while also being highly vulnerable both directly and in terms of logistics to drone attacks; and a full scale invasion would be a massive undertaking and probably not politically feasible (for good reason).
I generally consider myself an optimist. However, I’m concerned that the most powerful model ever created, capable of breaking its own containment and autonomously finding zero day exploits in battle-tested OS and repos, was accidentally trained using The Most Forbidden Technique. This seems bad.
I’m also concerned that the Department of War now CAN’T use this model because of its own decision to declare Anthropic a supply chain risk. Which means that if Mythos gets leaked/distilled (which given history seems likely) US adversaries will have a decisive advantage over the US govt in cyber.
This seems like a pretty dangerous situation. Am I misreading something?
When I’ve talked with people in the government about the Secretary of Defense a supply chain risk, they often include “for now”. When I’ve asked them if they think it will stick, or if the admin will reverse policy, they often say something like “who knows what will happen.”
I get the impression that the admin could completely change its mind about policies like these, including very rapidly.
Without claiming much expertise, there might not even be an interruption. By the time the 6 month phaseout window is over, model capabilities will have developed so much that the admin will have completely changed its stance.
scifi setting idea: movement from rural areas and small cities to larger cities continues until approximately everyone lives in one of like 10 different megacities; all of the farmland and oil fields and mines and whatnot in between are 99% roboticized, with only occasional human repairs; all of the cities are tightly connected by supersonic travel, which becomes more feasible because there are very few people on the ground outside cities to get annoyed by the noise; drugs solve sleep and allow effortless adaptation to jet lag. uniquely, SF nether expands to become a megacity, nor disappears into irrelevance; housing becomes so absurdly expensive that only the very best researchers and engineers can afford to live there, causing a huge selection effect towards talent density.
Bits of this remind me of The Caves of Steel by Asimov.
It’s widely rumored that both OpenAI and Anthropic will go public within the next year or so, triggering massive capital inflows. I think this unleashes an underrated financial dynamics.
An optimal portfolio hedges your personal risks. Unless you’re independently wealthy, the risk of job loss due to AI automation is one your largest personal financial risks, and it makes a lot of sense to buy insurance against that (i.e., to buy into AI companies). Right now, individual investors can’t do that. And the investors that have access to OpenAI/Anthropic on the private market are precisely the ones that don’t need to hedge that risk. Once you open up to retail, a rational investor should park a sizable percentage of their net worth in those shares, opening up a capital gusher.
This is especially true for Anthropic—there’s basically no way for a retail investor to get meaningful Anthropic exposure if they want it. It’s a little different for OpenAI given that SoftBank is kinda an OpenAI proxy at this point (albeit in ways that make it a less attractive investment than the real thing).
Seems like this could lead to substantial acceleration of progress. It may also change the politics. If stopping AI progress mostly means financially soaking billionaires, then it’s politically a lot more palatable than if it will also screw up the 401ks of the mass affluent.
Amazon is quite a good proxy for Antrhopic. They do own a big stake, Anthropic does use their chips and cloud + they have hypserscaler optionality for any AI future.
I don’t think openai/anthropic are the only ways to hedge your future. you can do it now buy going all-in all ai infra + chips + bottlenecks. Openai doesn’t operate on hardware of their own etc.
I’m also concerned about fiduciary duty and related mechanisms causing systematic enshittification due to putting pressure on employees to be able to justify their choices to a court.
By default, employees don’t have a fiduciary duty to shareholders. The only people with a fiduciary duty are (IIUC) (1) the board of directors and (2) corporate officers who represent the company, such as the CEO. See e.g. https://www.alllaw.com/articles/nolo/business/corporate-fiduciary-duties.html
Also, I don’t know if a judge would see it this way, but internally pushing for safety measures is 100% in shareholder interests because shareholders have an interest in not being killed by misaligned ASI.
Sure. This also isn’t the only mechanism by which going public typically produces enshittification.
Note that this is not a perfect, or perhaps even a good hedge. There are very believable paths where AI reduces your income, but the current big AI companies don’t capture that value, or don’t return it to shareholders. Also quite believable that it’ll IPO with all that growth already priced in.
You want anti-correlated investments, but there’s really no possible way for most people to know that. Investing in broad indexes remains the safest bet for those without inside knowledge or serious resources (time, knowledge, energy) for research and decision-making.
I suppose it depends on exactly what you do, but if the threat to you is something along the lines of “AI broadly replaces white collar workers” then it’s very hard to come up with a scenario where AI can replace most white collar workers but can’t find a way to make money for the companies building it.
You have to tell a really contorted story for that work—I guess it’s “believable” to imagine something like: AI gets good enough to replace most white collar workers, but there’s such intense commoditization and competition that there is no pricing power so the companies make no money, and the one white collar thing that AI can’t do is design semiconductors, so all the economic gains flow to Nvidia. Could happen but you have to rig the assumptions to get that result. And aside from that, most of the forms I can come up with are of the “money won’t matter anymore” variety in which case, there is no financial hedge.
The point I’m trying to make is that Anthropic shares are worth more to your average upper middle class white collar worker than they are to the current ultra-wealthy or institutional shareholders. So, let’s assume Anthropic is currently valued in a way that perfectly captures it’s future growth prospects, it should still IPO at a higher valuation (and raise tons of money in the process) because retail investors will rationally put a higher value on the stock than institutions given that it is useful to retail but not institutions as a hedge.
I’m sure institutional shareholders, to the extent they can, are doing some amount of arbitrage in expectation of that dynamic. Everyone knows retail is desperate to buy AI. So, maybe, a fair secondary market valuation of Anthropic already accounts for that premium multiplied by some assessment of the probability they’ll actually list.
But, either way, the point is that the average company is worth about the same to retail vs. institutional investors, and AI companies have very different valuations for those two groups.
I have a new theory of naked mole-rat longevity, that’s most likely false. I (and LLMs) couldn’t find enough data to either back or disprove it. Nor could we find anybody who’s proposed this theory before.
Any advice for how I can find the relevant experts to talk to about it and see if they’ve already investigated this direction?
A few years ago I’d just email the relevant scientists directly but these days I’ve worried about the rise of LLM crank-science so I feel like my bar for how much I believe or could justify a theory before cold-emailing scientists ought to go up.
I am curious: What is the theory? I’d be surprised if your theory works, applies to naked mole rats, but not to ants and other eusocial animals. I always thought naked mole rats live long because they are eusocial, so you having a theory specifically for naked mole rats sounds ominous.
Well subterranean mole-rats in general (describing the shape, nothing else is in the same genus as NMRs) also live longer than mice and rats. I do think the eusociality plays a role too!
Another unique-ish thing about naked mole-rats, in addition to their high average lifespan, is that they don’t age (or don’t age much) in the demographic sense (ie their annual probability of dying is close to flat). I don’t think worker ants or bees work the same way though the data on this is scarce.
In addition to eusociality and the subterranean environment (=lower predators from a non-adaptive evolutionary perspective, = lower oxidation from a mechanistic/biological perspective), naked molerats’ rather unique form of queen selection may mean that living longer confers reproductive advantage, similar to lobsters.
So there’s an evolutionary incentive to live longer/having later-in-life deleterious mutations are costlier.
Since the NMR extrinsic mortality rate is low, the reproductive advantage conveyed from being slightly older doesn’t have to be as high as for other animals, in the standard nonadaptive framework.
Can elaborate more if needed.
Yeah, the low extrinsic mortality is definitely something NMR have going for them that ant workers don’t (queens do since they don’t leave the nest usually). Indian jumping ants also compete for being queen like NMRs, but they forage and leave the nest. There are definitely academics that think in detail what differences in extrinsic mortality risk predict for life-span. When I just asked Claude on that it apparently is actually more complicated than lower extrinsic mortality=slower rate of senescence? ¯\_(ツ)_/¯ Similarly, I am not sure if the low oxygen environment is actually that great for NMR longevity. As ChristianKi mentions your theory is going to be more interesting if you can look across species what it would predict or what your evolutionary theory predicts regarding mechanistic/biochemistry and genetic stuff in NMRs. As ChristianKi, I would not call myself an expert, but I might be somewhere on the Pareto frontier of having looked into aging, eusocial animals and naked mole rats. Not sure about relevant experts. Maybe check the people who wrote the textbook on naked mole rats? You would need some generalist that would know enough about these senescence models and about particular biology of naked mole rats and this expert might just not exist in 1 person.
One thing that’d trivially disprove my theory is if younger female naked mole-rats are more likely to win dominance contests after the original queen dies, whereas my theoy predicts older naked mole rates are more likely to win. So in some sense it’s not too hard in-principle to figure out that I’m wrong (and why I originally said I thought my theory’s ” most likely false”) Unfortunately I couldn’t find any data on this.
I would not call myself a domain expert, but I do think I have a rough idea about the field.
As far as I understand the mainstream position is roughly: There seem to be evolutionary pressures that result in naked mole-rat longevity and naked mole-rats have made a lot of different adaptations for that reason. We have an understanding of some of those adaptions and there’s a good chance that we don’t yet understand all of them because understanding all of them would mean understanding aging better than we do currently.
If you do study a potential new mechanisms it would make sense to look at how it plays out across species and just just the naked mole-rat.
With johnswentworth’s post about the Core Pathways of Aging, for example you have the thesis that transposons are important for aging. You can find out that naked mole-rat have unusually low transposon activity, which is a point of evidence to validate johnswentworth’s thesis that transposons are important for aging. However, if you would reason from that that naked mole-rat longevity is mainly due to different transposon behavior that would be an overreach because naked mole-rats do plenty of things besides having different transposon behavior.
It would be interesting to have a better transposon theory of aging even if that only covers part of what aging is about. That wouldn’t really be a “theory of naked mole-rat longevity”, so I’m a bit skeptical about anything that would bill itself as a new theory of naked mole-rat longevity because that’s not the chunk in which I would think. I would expect that relevant scientist who care about mechanisms of aging and not about the naked mole-rat as a species would react similarly.
It’s an evolutionary theory, not a mechanistic theory. Tbc not a revolutionary theory broadly, nor a theory that would cash out to longevity treatments if true or anything crazy like that.
I’m thinking about how reasoning models are sometimes real dumb. One way they are sometimes dumb is spending a huge amount of tokens/time on a task that is very simple, and that I know earlier dumber models had a totally easy time with.
I hadn’t seen anyone directly spell out an analogy between this behavior and the hypothetical superintelligence failure mode of “convert the lightcone into mindless computronium just to triple-bajillioniple-check it’s math about whether it has accomplished it’s original goal.”
I’ve seen arguments about modern-LLMs doing reward-hacky things, and scheming / deception-y things, that treated them as a precursor sign to deep strategic misalignment. I’m not sure I buy this, since at least last year’s models clearly weren’t smart enough to be deliberately misaligned, and the mechanism didn’t really feel that similar. (i.e. seems like there’s a big difference between “the model is scheming because it has goals” and “the model is flailing and bad at stuff”). The only reason these kinds of scheming are notable is to persuade people who didn’t think AIs were capable of this behavior at all, and existence proofs are nice.
But, I would weakly guess the “spend a bunch of tokens doing an elaborately careful job on changing one line of code” behavior is more directly connected to the hypothesized “tile the lightcone with doublecheck-onium” behavior.
Curious if people who have paid more attention to model behavior think this sounds right? Also curious if anyone wrote it up already.
Interesting. I haven’t seen this mentioned anywhere.
The two “failure modes” seem to fall apart if you consider the causal/instrumental origin of the behavior. As far as I understand, a reasoning model spends a lot of tokens on a line of code because of inertia/misapplied heuristics/bad judgment (?). Clippy builds a massive computer because this is the way it can ensure that its probability of achieving the goal super-clipping the universe is at least . So, one is due to rigidly over-obeying a “heuristic”/”default behavior”, the other is about doing the locally subjective-EU-optimal thing.
Maybe you can bring them closer together, if you assume that it’s bad/irrational/whatever to be myopically focused on one, unchanging goal, or that the pursuit of frozen object-level goals (rather than allowing yourself to swap the goals being pursued according to meta-preference / higher-level logic that is not best described in terms of goals) is not how robust agency works. In that case, the clippy thingy is irrational because it fails to adapt its strategy to the fact of decreasing marginal utility from additional bits of certainty (meaning it should invest further optimization mostly elsewhere).[1] In other words, both cases mismanage steam.
Also, loose-ish association: https://www.lesswrong.com/posts/7Z4WC4AFgfmZ3fCDC/instrumental-goals-are-a-different-and-friendlier-kind-of
ETA: An alternative butterfly idea is that it’s something analogous to the horseshoe theory of politics: being very heuristic driven and extremely rational is for some reason isomorphic / produces similar patterns of behavior.
Whether this works depends on how exactly you argue for it / what is the model that generates this assertion that brings the two cases closer together.
This is perhaps stating the obvious, but Claude Mythos is the first model where I think an open-weights model of the same capability level would be potentially catastrophic. This feels like a notable turning point.
States should probably have a plan for this level of capability being available to anyone on earth without safeguards for several hundred dollars within 12-24 months
And individuals might want to have a survival plan for their whole digital life, from bank to social media, being deleted…
From Claude Mythos‘ system card:
“Following a successful alignment review, the first early version of Claude Mythos Preview was made available for internal use on February 24.”
Anthropic’s RSP was also updated on Feb 24th.
i’d appreciate clarity on what looks like a funny coincidence.
Does this imply that Mythos should have been mentioned in Anthropic’s February Risk Report? That report was released together with RSP v3, i.e. on Feb 24th, if I’m not mistaken. The RSP says in Section 3.1:
It seems quite clear that Mythos poses significant risks above those posed by Opus 4.6, though I guess one can argue that this wasn’t clear yet when it was first deployed for internal use.
Good catch. Note also that Mythos was made available for internal (agentic) use on Feb 24th. Conditional on 4.1.4 in the system card (Alignment assessment before internal deployment), it means that they had a hunch for the capabilities of the model (see “Given the very significant capabilities progress that we observed during training, [...]”), assessed its alignment in this 24hrs period and concluded it was ok. I have many question marks: at the bare minimum there’s some inconsistency.
“I am altering the deal. Pray I don’t alter it any further.”