Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise.
Aaron_Scher
I think your discussion for why humanity could survive a misaligned superintelligence is missing a lot. Here are a couple claims:
When there are ASIs in the world, we will see ~100 years of technological progress in 5 years (or like, what would have taken humanity 100 years in the absence of AI). This will involve the development of many very lethal technologies.
The aligned AIs will fail to defend the world against at least one of those technologies.
Why do I believe point 2? It seems like the burden of proof is really high to say that “nope, every single one of those dangerous technologies is going to be something that it is technically possible for the aligned AIs to defend against, and they will have enough lead time to do so, in every single case”. If you’re assuming we’re in a world with misaligned ASIs, then every single existentially dangerous technology is another disjunctive source of risk. Looking out at the maybe-existentially-dangerous technologies that have been developed previously and that could be developed in the future, e.g., nuclear weapons, biological weapons, mirror bacteria, false vacuum decay, nanobots, I don’t feel particularly hopeful that we will avoid catastrophe. We’ve survived nuclear weapons so far, but with a few very close calls — if you assume other existentially dangerous technologies go like this, then we probably won’t make it past a few of them. Now crunch that all into a few years, and like gosh it seems like a ton of unjustified optimism to think we’ll survive every one of these challenges.
It’s pretty hard to convey my intuition around the vulnerable world hypothesis, I also try to do so here.
I was surprised to see you choose to measure faithfulness using the setup from Chua et al. and Turpin et al. rather than Lanham et al. IMO, the latter is much better, albeit is restricted in that you have to do partial pre-filling of model responses (so you might be constrained on what models you can do it on, but it should be possible on QwQ). I would guess this is partially for convenience reasons, as you already have a codebase that works and you’re familiar with, and partially because you think this is a better setup. Insofar as you think this is a better setup, I would be excited to hear why? Insofar as you might do follow-up work, I am excited to see the tests from Lanham et al. applied here.
I would happily give more thoughts on why I like the measurement methods from Lanham et al., if that’s useful.
I like this blog post. I think this plan has a few problems, which you mention, e.g., Potential Problem 1, getting the will and oversight to enact this domestically, getting the will and oversight/verification to enact this internationally.
There’s a sense in which any plan like this that coordinates AI development and deployment to a slower-than-ludicrous rate seems like it reduces risk substantially. To me it seems like most of the challenge comes from getting to a place of political will from some authority to actually do that (and in the international context there could be substantial trust/verification needs). But nevertheless, it is interesting and useful to think through what some of the details might be of such a coordinated-slow-down regime. And I think this post does a good job explaining an interesting idea in that space.
I would like Anthropic to prepare for a world where the core business model of scaling to higher AI capabilities is no longer viable because pausing is needed. This looks like having a comprehensive plan to Pause (actually stop pushing the capabilities frontier for an extended period of time, if this is needed). I would like many parts of this plan to be public. This plan would ideally cover many aspects, such as the institutional/governance (who makes this decision and on what basis, e.g., on the basis of RSP), operational (what happens), and business (how does this work financially).
To speak to the business side: Currently, the AI industry is relying on large expected future profits to generate investment. This is not a business model which is amenable to pausing for a significant period of time. I would like there to be minimal friction to pausing. One way to solve this problem is to invest heavily (and have a plan to invest more if a pause is imminent or ongoing) in revenue streams which are orthogonal to catastrophic risk, or at least not strongly positively correlated. As an initial brainstorm, these streams might include:
Making really cheap weak models.
AI integration in low-stakes domains or narrow AI systems (ideally combined with other security measures such as unlearning).
Selling AI safety solutions to other AI companies.
A plan for the business side of things should also include something about “what do we do about all the expected equity that employees lose if we pause, and how do we align incentives despite this”, it should probably include a commitment to ensure all investors and business partners understand that a long term pause may be necessary for safety and are okay with that risk (maybe this is sufficiently covered under the current corporate structure, I’m not sure, but those sure can change).
It’s all good and well to have an RSP that says “if X we will pause”, but the situation is probably going to be very messy with ambiguous evidence, crazy race pressures, crazy business pressures from external investors, etc. Investing in other revenue streams could reduce some of this pressure, and (if shared) potentially it could enable a wider pause. e.g., all AI companies see a viable path to profit if they just serve early AGIs for cheap, and nobody has intense business pressure to go to superintelligence.
Second, I would like Anthropic to invest in its ability to make credible commitments about internal activities and model properties. There is more about this in Miles Brundage’s blog post and my paper, as well as FlexHEGs. This might include things like:
cryptographically secured audit trails (version control for models). I find it kinda crazy that AI companies sometimes use external pre-deployment testers and then change a model in completely unverifiable ways and release it to users. Wouldn’t it be so cool if OpenAI couldn’t do that, and instead when their system card comes out there are certificates verifying which model was evaluated and how the model was changed from evaluation to deployment? That would be awesome!,
whistleblower programs, declaring and allowing external auditing of what compute is used for (e.g., differentiating training vs. inference clusters in a clear and relatively unspoofable way),
using TEEs and certificates to attest that the same model is evaluated as being deployed to users, and more.
I think investment/adoption in this from a major AI company could be a significant counterfactual shift in the likelihood of national or international regulation that includes verification. Many of these are also good for being-a-nice-company reasons, like I think it would be pretty cool if claims like Zero Data Retention were backed by actual technical guarantees rather than just trust (which it seems like is the status quo).
I believe this is standard/acceptable for presenting log-axis data, but I’m not sure. This is a graph from the Kaplan paper:
It is certainly frustrating that they don’t label the x-axis. Here’s a quick conversation where I asked GPT4o to explain. You are correct that a quick look at this graph (where you don’t notice the log-scale) would imply (highly surprising and very strong) linear scaling trends. Scaling laws are generally very sub-linear, in particular often following a power-law. I don’t think they tried to mislead about this, instead this is a domain where log-scaling axes is super common and doesn’t invalidate the results in any way.
From the o1 blog post (evidence about the methodology for presenting results but not necessarily the same):
o1 greatly improves over GPT-4o on challenging reasoning benchmarks. Solid bars show pass@1 accuracy and the shaded region shows the performance of majority vote (consensus) with 64 samples.
What do people mean when they say that o1 and o3 have “opened up new scaling laws” and that inference-time compute will be really exciting?
The standard scaling law people talk about is for pretraining, shown in the Kaplan and Hoffman (Chinchilla) papers.
It was also the case that various post-training (i.e., finetuning) techniques improve performance, (though I don’t think there is as clean of a scaling law, I’m unsure). See e.g., this paper which I just found via googling fine-tuning scaling laws. See also the Tülu 3 paper, Figure 4.
We have also already seen scaling law-type trends for inference compute, e.g., this paper:
The o1 blog post points out that they are observing two scaling trends: predictable scaling w.r.t. post-training (RL) compute, and predictable scaling w.r.t. inference compute:
The paragraph before this image says: “We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.” That is, the left graph is about post-training compute.
Following from that graph on the left, the o1 paradigm gives us models that are better for a fixed inference compute budget (which is basically what it means to train a model for longer or train a better model of the same size by using better algorithms — the method is new but not the trend), and following from the right, performance seems to scale well with inference compute budget. I’m not sure there’s sufficient public data to compare that graph on the right against other inference-compute scaling methods, but my guess is the returns are better.
What is o3 doing that you couldn’t do by running o1 on more computers for longer?
I mean, if you replace “o1” in this sentence with “monkeys typing Shakespeare with ground truth verification,” it’s true, right? But o3 is actually a smarter mind in some sense, so it takes [presumably much] less inference compute to get similar performance. For instance, see this graph about o3-mini:
The performance-per-dollar frontier is pushed up by the o3-mini models. It would be somewhat interesting to know how much cost it would take for o1 to reach o3 performance here, but my guess is it’s a huge amount and practically impossible. That is, there are some performance levels that are practically unobtainable for o1, the same way the monkeys won’t actually complete Shakespeare.
Hope that clears things up some!
The ARC-AGI page (which I think has been updated) currently says:
At OpenAI’s direction, we tested at two levels of compute with variable sample sizes: 6 (high-efficiency) and 1024 (low-efficiency, 172x compute).
Regarding whether this is a new base model, we have the following evidence:
o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on chain of thought to scale inference compute. Way faster than pretraining paradigm of new model every 1-2 years
o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1, and the strength of the resulting model the resulting model is very, very impressive. (2/n)
The prices leaked by ARC-ARG people indicate $60/million output tokens, which is also the current o1 pricing. 33m total tokens and a cost of $2,012.
Notably, the codeforces graph with pricing puts o3 about 3x higher than o1 (tho maybe it’s a secretly log scale), and the ARC-AGI graph has the cost of o3 being 10-20x that of o1-preview. Maybe this indicates it does a bunch more test-time reasoning. That’s collaborated by ARC-AGI, average 55k tokens per solution[1], which seems like a ton.
I think this evidence indicates this is likely the same base model as o1, and I would be at like 65% sure, so not super confident.
- ^
edit to add because the phrasing is odd: this is the data being used for the estimate, and the estimate is 33m tokens / (100 tasks * 6 samples per task) = ~55k tokens per sample. I called this “solution” because I expect these are basically 6 independent attempts at answering the prompt, but somebody else might interpret things differently. The last column is “Time/Task (mins)”.
- ^
Thanks for your continued engagement.
I appreciate your point about compelling experimental evidence, and I think it’s important that we’re currently at a point with very little of that evidence. I still feel a lot of uncertainty here, and I expect the evidence to basically always be super murky and for interpretations to be varied/controversial, but I do feel more optimistic than before reading your comment.
You could find a way of proving to the world that your AI is aligned, which other labs can’t replicate, giving you economic advantage.
I don’t expect this to be a very large effect. It feels similar to an argument like “company A will be better on ESG dimensions and therefore more and customers will switch to using it”. Doing a quick review of the literature on that, it seems like there’s a small but notable change in consumer behavior for ESG-labeled products. In the AI space, it doesn’t seem to me like any customers care about OpenAI’s safety team disappearing (except a few folks in the AI safety world). In this particular case, I expect the technical argument needed to demonstrate that some family of AI systems are aligned while others are not is a really complicated argument; I expect fewer than 500 people would be able to actually verify such an argument (or the initial “scalable alignment solution”), maybe zero people. I realize this is a bit of a nit because you were just gesturing toward one of many ways it could be good to have an alignment solution.
I endorse arguing for alternative perspectives and appreciate you doing it. And I disagree with your synthesis here.
This is important work, keep it up!
I agree it’s plausible. I continue to think that defensive strategies are harder than offensive ones, except the ones that basically look like centralized control over AGI development. For example,
Provide compelling experimental evidence that standard training methods lead to misaligned power-seeking AI by default
Then what? The government steps in and stops other companies from scaling capabilities until big safety improvements have been made? That’s centralization along many axes. Or maybe all the other key decision makers in AGI projects get convinced by evidence and reason and this buys you 1-3 years until open source / many other actors reach this level of capabilities.
Sharing an alignment solution involves companies handing over valuable IP to their competitors. I don’t want to say it’s impossible, but I have definitely gotten less optimistic about this in the last year. I think in the last year we have not seen a race to the top on safety, in any way. We have not seen much sharing of safety research that is relevant to products (or like, applied alignment research). We have instead mostly seen research without direct applications: interp, model organisms, weak-to-stong / scalable oversight (which is probably the closest to product relevance). Now sure, the stakes are way higher with AGI/ASI so there’s a bigger incentive to share, but I don’t want to be staking the future on these companies voluntarily giving up a bunch of secrets, which would be basically a 180 from their current strategy.
I fail to see how developing and sharing best practices for RSPs will shift the game board. Except insofar as it involves key insights on technical problems (e.g., alignment research that is critical for scaling) which hits the IP problem. I don’t think we’ve seen a race to the top on making good RSPs, but we have definitely seen pressure to publish any RSP. Not enough pressure; the RSPs are quite weak IMO and some frontier AI developers (Meta, xAI, maybe various Chinese orgs count) have none.
I agree that it’s plausible that “one good apple saves the bunch”, but I don’t think it’s super likely if you condition on not centralization.
Do you believe that each of the 3 things you mentioned would change the game board? I think that they are like 75%, 30%, and 20% likely to meaningfully change catastrophic risk, conditional on happening.
Training as it’s currently done needs to happen within a single cluster
I think that’s probably wrong, or at least effectively wrong. Gemini 1.0, trained a year ago has the following info in the technical report:
TPUv4 accelerators are deployed in “SuperPods” of 4096 chips...
TPU accelerators primarily communicate over the high speed inter-chip-interconnect, but at
Gemini Ultra scale, we combine SuperPods in multiple datacenters using Google’s intra-cluster and
inter-cluster network (Poutievski et al., 2022; Wetherall et al., 2023; yao Hong et al., 2018). Google’s
network latencies and bandwidths are sufficient to support the commonly used synchronous training
paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.As you note, public distributed training methods have advanced beyond basic data parallelism (though they have not been publicly shown at large model scales because nobody has really tried yet).
While writing, I realized that this sounds a bit similar to the unilateralist’s curse. It’s not the same, but it has parallels. I’ll discuss that briefly because it’s relevant to other aspects of the situation. The unilateralist’s curse does not occur specifically due to multiple samplings, it occurs because different actors have different beliefs about the value/disvalue, and this variance in beliefs makes it more likely that one of those actors has a belief above the “do it” threshold. If each draw from the AGI urn had the same outcome, this would look a lot like a unilateralist’s curse situation where we care about variance in the actors’ beliefs. But I instead think that draws from the AGI urn are somewhat independent and the problem is just that we should incur e.g., a 5% misalignment risk as few times as we have to.
Interestingly, a similar look at variance is part of what makes the infosecurity situation much worse for multiple projects compared to centralized AGI project: variance is bad here. I expect a single government AGI project to care about and invest in security at least as much as the average AGI company. The AGI companies have some variance in their caring and investment in security, and the lower ones will be easier to steal from. If you assume these multiple projects have similar AGI capabilities (this is a bad assumption but is basically the reason to like multiple projects for Power Concentration reasons so worth assuming here; if the different projects don’t have similar capabilities, power is not very balanced), you might then think that any of the companies getting their models stolen is similarly bad to the centralized project getting its models stolen (with a time lag I suppose, because the centralized project got to that level of capability faster).
If you are hacking a centralized AGI project, say you have a 50% chance of success. If you are hacking 3 different AGI projects, you have 3 different/independent 50% chances of success. They’re different because these project have different security measures in place. Now sure, as indicated by one of the points in this blog post, maybe less effort goes into hacking each of the 3 projects (because you have to split your resources, and because there’s less overall interest in stealing model weights), maybe that pushes each of these down to 33%. These numbers are obviously made up, and they would get to a 1 – (0.67^3) = 70% chance of success.
Unilateralist’s curse is about variance in beliefs about the value of some action. The parent comment is about taking multiple independent actions that each have a risk of very bad outcomes.
Thanks for writing this, I think it’s an important topic which deserves more attention. This post covers many arguments, a few of which I think are much weaker than you all state. But more importantly, I think you all are missing at least one important argument. I’ve been meaning to write this up, and I’ll use this as my excuse.
TL;DR: More independent AGI efforts means more risky “draws” from a pool of potential good and bad AIs; since a single bad draw could be catastrophic (a key claim about offense/defense), we need fewer, more controlled projects to minimize that danger.
The argument is basically an application of the Vulnerability World Hypothesis to AI development. You capture part of this argument in the discussion of Racing, but not the whole thing. So the setup is that building any particular AGI is drawing a ball from the urn of potential AIs. Some of these AIs are aligned, some are misaligned — we probably disagree about the proportions here but that’s not crucial, and note that the proportion depends on a bunch of other aspects about the world such as how good our AGI alignment research is. More AGI projects means more draws from the urn and a higher likelihood of pulling out misaligned AI systems. Importantly, I think that pulling out a misaligned AGI system is more bad than pulling out an aligned AGI system is good. I think this because I think some of the key components about the world that are offense-favored.
Key assumption/claim: human extinction and human loss of control are offense-favored — if there were similarly resourced actors trying to destroy humanity as to protect it, humanity would be destroyed. I have a bunch of intuitions for why this is true, to give some sense:
Humans are flesh bags that die easily and never come back to life. AIs will not be like this.
Humans care a lot about not dying, their friends and families not dying, etc., I expect extorting a small number of humans in order to gain control would simply work if one could successfully make the relevant threats.
Terrorists or others who seek to cause harm often succeed. There are many mass shootings. 8% of US presidents were assassinated in office. I don’t actually know what the average death count per attempted terrorist is; I would intuitively guess it’s between 0.5 and 10 (This Wikipedia article indicates it’s ~10, but I think you should include attempts that totally fail, even though these are not typically counted). Terrorism is very heavy tailed, which I think probably means that more capable terrorists (i.e., AIs that are at least as good as human experts, AGI+) will have high fatality rates.
There are some emerging technologies that so far seem more offense-favored to me. Maybe not 1000:1, but definitely not 1:1. Bio tech and engineered pandemics seem like this; autonomous weapons seem like this.
The strategy-stealing assumption seems false to me, partially for reasons listed in the linked post. I note that the linked post includes Paul listing a bunch of convincing-to-me ways in which strategy-stealing is false and then concluding that it’s basically true. The claim about offense is easier than defense is sorta just a version of the strategy stealing claim, this bullet point isn’t actually another distinct argument, just an excuse to point toward previous thinking and the various arguments there.
A couple caveats: I think killing all of humanity with current tech is pretty hard; as noted however, I think this is too high a bar because probably things like extortion are sufficient for grabbing power. Also, I think there are some defensive strategies that would actually totally work at reducing the threat from misaligned AGI systems. Most of these strategies look a lot like “centralization of AGI development”, e.g., destroying advanced computing infrastructure, controlling who uses advanced computing infrastructure and how they use it, a global treaty banning advanced AI development (which might be democratically controlled but has the effect of exercising central decision making).
So circling back to the urn, if you pull out an aligned AI system, and 3 months later somebody else pulls out a misaligned AI system, I don’t think pulling out the aligned AI system a little in advance buys you that much. The correct strategy to this situation is to try and make the proportion of balls weighted heavily toward aligned, AND to pull out as few as you can.
More AGI development projects means more draws from the urn because there are more actors doing this and no coordinated decision process to stop. You mention that maybe government can regulate AI developers to reduce racing. This seems like it will go poorly, and in the worlds where it goes well, I think you should maybe just call them “centralization” because they involve a central decision process deciding who can train what models when with what methods. That is, extremely involved regulations seem to effectively be centralization.
Notably, this is related but not the same as the effects from racing. More AGI projects leads to racing which leads to cutting corners on safety (higher proportion of misaligned AIs in the urn), and racing leads to more draws from the urn because of fear of losing to a competitor. But even without racing, more AGI projects means more draws from the urn.
The thing I would like to happen instead is that there is a very controlled process for drawing from the urn, where each ball is carefully inspected, and if we draw aligned AIs, we use them to do AI alignment research, i.e., increase the proportion of aligned AIs in the urn. And we don’t take more draws from the urn until we’re really sure we’re quite confident we’re not going to pull out a misaligned AI. Again, this is both about reducing the risk of catastrophe each time you take a risky action, and about decreasing the number of times you have to take risky actions.
Summarizing: If you are operating in a domain where losses are very bad, you want to take less gambles. I think AGI and ASI development are such domains, and decentralized AGI development means more gambles are taken.
Noting that I spent a couple minutes pondering the quoted passage which I don’t think was a good use of time (I basically would have immediately dismissed it if I knew Claude wrote it, and I only thought about it because my prior on Buck saying true things is way higher), and I would have preferred the text not have this.
I don’t see anybody having mentioned it yet, but the recent paper about LLM Introspection seems pretty relevant. I would say that a model which performs very well at introspection (as defined there) would be able to effectively guess which jailbreak strategies were attempted.
There is now some work in that direction: https://forum.effectivealtruism.org/posts/47RH47AyLnHqCQRCD/soft-nationalization-how-the-us-government-will-control-ai
A Brief Explanation of AI Control
Sounds like a very successful hackathon! Nice work to everybody involved!
Hm, sorry, I did not mean to imply that the defense/offense ratio is infinite. It’s hard to know, but I expect it’s finite for the vast majority of dangerous technologies[1]. I do think there are times where the amount of resources and intelligence needed to do defense are too high and a civilization cannot do them. If an astroid were headed for earth 200 years ago, we simply would not have been able to do anything to stop it. Asteroid defense is not impossible in principle — the defensive resources and intelligence needed are not infinite — but they are certainly above what 1825 humanity could have mustered in a few years. It’s not in principle impossible, but it’s impossible for 1825 humanity.
While defense/offense ratios are relevant, I was more-so trying to make the points that these are disjunctive threats, some might be hard to defend against (i.e., have a high defense-offense ratio), and we’ll have to do that on a super short time frame. I think this argument goes through unless one is fairly optimistic about the defense-offense ratio for all the technologies that get developed rapidly. I think the argumentative/evidential burden to be on net optimistic about this situation is thus pretty high, and per the public arguments I have seen, unjustified.
(I think it’s possible I’ve made some heinous reasoning error that places too much burden on the optimism case, if that’s true, somebody please point it out)
To be clear, it certainly seems plausible that some technologies have a defense/offense ratio which is basically unachievable with conventional defense, and that you need to do something like mass surveillance to deal with these. e.g., triggering vacuum decay seems like the type of thing where there may not be technological responses that avert catastrophe if the decay has started, instead the only effective defenses are ones that stop anybody from doing the thing to begin with.