Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise.
Aaron_Scher
This is important work, keep it up!
I agree it’s plausible. I continue to think that defensive strategies are harder than offensive ones, except the ones that basically look like centralized control over AGI development. For example,
Provide compelling experimental evidence that standard training methods lead to misaligned power-seeking AI by default
Then what? The government steps in and stops other companies from scaling capabilities until big safety improvements have been made? That’s centralization along many axes. Or maybe all the other key decision makers in AGI projects get convinced by evidence and reason and this buys you 1-3 years until open source / many other actors reach this level of capabilities.
Sharing an alignment solution involves companies handing over valuable IP to their competitors. I don’t want to say it’s impossible, but I have definitely gotten less optimistic about this in the last year. I think in the last year we have not seen a race to the top on safety, in any way. We have not seen much sharing of safety research that is relevant to products (or like, applied alignment research). We have instead mostly seen research without direct applications: interp, model organisms, weak-to-stong / scalable oversight (which is probably the closest to product relevance). Now sure, the stakes are way higher with AGI/ASI so there’s a bigger incentive to share, but I don’t want to be staking the future on these companies voluntarily giving up a bunch of secrets, which would be basically a 180 from their current strategy.
I fail to see how developing and sharing best practices for RSPs will shift the game board. Except insofar as it involves key insights on technical problems (e.g., alignment research that is critical for scaling) which hits the IP problem. I don’t think we’ve seen a race to the top on making good RSPs, but we have definitely seen pressure to publish any RSP. Not enough pressure; the RSPs are quite weak IMO and some frontier AI developers (Meta, xAI, maybe various Chinese orgs count) have none.
I agree that it’s plausible that “one good apple saves the bunch”, but I don’t think it’s super likely if you condition on not centralization.
Do you believe that each of the 3 things you mentioned would change the game board? I think that they are like 75%, 30%, and 20% likely to meaningfully change catastrophic risk, conditional on happening.
Training as it’s currently done needs to happen within a single cluster
I think that’s probably wrong, or at least effectively wrong. Gemini 1.0, trained a year ago has the following info in the technical report:
TPUv4 accelerators are deployed in “SuperPods” of 4096 chips...
TPU accelerators primarily communicate over the high speed inter-chip-interconnect, but at
Gemini Ultra scale, we combine SuperPods in multiple datacenters using Google’s intra-cluster and
inter-cluster network (Poutievski et al., 2022; Wetherall et al., 2023; yao Hong et al., 2018). Google’s
network latencies and bandwidths are sufficient to support the commonly used synchronous training
paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.As you note, public distributed training methods have advanced beyond basic data parallelism (though they have not been publicly shown at large model scales because nobody has really tried yet).
While writing, I realized that this sounds a bit similar to the unilateralist’s curse. It’s not the same, but it has parallels. I’ll discuss that briefly because it’s relevant to other aspects of the situation. The unilateralist’s curse does not occur specifically due to multiple samplings, it occurs because different actors have different beliefs about the value/disvalue, and this variance in beliefs makes it more likely that one of those actors has a belief above the “do it” threshold. If each draw from the AGI urn had the same outcome, this would look a lot like a unilateralist’s curse situation where we care about variance in the actors’ beliefs. But I instead think that draws from the AGI urn are somewhat independent and the problem is just that we should incur e.g., a 5% misalignment risk as few times as we have to.
Interestingly, a similar look at variance is part of what makes the infosecurity situation much worse for multiple projects compared to centralized AGI project: variance is bad here. I expect a single government AGI project to care about and invest in security at least as much as the average AGI company. The AGI companies have some variance in their caring and investment in security, and the lower ones will be easier to steal from. If you assume these multiple projects have similar AGI capabilities (this is a bad assumption but is basically the reason to like multiple projects for Power Concentration reasons so worth assuming here; if the different projects don’t have similar capabilities, power is not very balanced), you might then think that any of the companies getting their models stolen is similarly bad to the centralized project getting its models stolen (with a time lag I suppose, because the centralized project got to that level of capability faster).
If you are hacking a centralized AGI project, say you have a 50% chance of success. If you are hacking 3 different AGI projects, you have 3 different/independent 50% chances of success. They’re different because these project have different security measures in place. Now sure, as indicated by one of the points in this blog post, maybe less effort goes into hacking each of the 3 projects (because you have to split your resources, and because there’s less overall interest in stealing model weights), maybe that pushes each of these down to 33%. These numbers are obviously made up, and they would get to a 1 – (0.67^3) = 70% chance of success.
Unilateralist’s curse is about variance in beliefs about the value of some action. The parent comment is about taking multiple independent actions that each have a risk of very bad outcomes.
Thanks for writing this, I think it’s an important topic which deserves more attention. This post covers many arguments, a few of which I think are much weaker than you all state. But more importantly, I think you all are missing at least one important argument. I’ve been meaning to write this up, and I’ll use this as my excuse.
TL;DR: More independent AGI efforts means more risky “draws” from a pool of potential good and bad AIs; since a single bad draw could be catastrophic (a key claim about offense/defense), we need fewer, more controlled projects to minimize that danger.
The argument is basically an application of the Vulnerability World Hypothesis to AI development. You capture part of this argument in the discussion of Racing, but not the whole thing. So the setup is that building any particular AGI is drawing a ball from the urn of potential AIs. Some of these AIs are aligned, some are misaligned — we probably disagree about the proportions here but that’s not crucial, and note that the proportion depends on a bunch of other aspects about the world such as how good our AGI alignment research is. More AGI projects means more draws from the urn and a higher likelihood of pulling out misaligned AI systems. Importantly, I think that pulling out a misaligned AGI system is more bad than pulling out an aligned AGI system is good. I think this because I think some of the key components about the world that are offense-favored.
Key assumption/claim: human extinction and human loss of control are offense-favored — if there were similarly resourced actors trying to destroy humanity as to protect it, humanity would be destroyed. I have a bunch of intuitions for why this is true, to give some sense:
Humans are flesh bags that die easily and never come back to life. AIs will not be like this.
Humans care a lot about not dying, their friends and families not dying, etc., I expect extorting a small number of humans in order to gain control would simply work if one could successfully make the relevant threats.
Terrorists or others who seek to cause harm often succeed. There are many mass shootings. 8% of US presidents were assassinated in office. I don’t actually know what the average death count per attempted terrorist is; I would intuitively guess it’s between 0.5 and 10 (This Wikipedia article indicates it’s ~10, but I think you should include attempts that totally fail, even though these are not typically counted). Terrorism is very heavy tailed, which I think probably means that more capable terrorists (i.e., AIs that are at least as good as human experts, AGI+) will have high fatality rates.
There are some emerging technologies that so far seem more offense-favored to me. Maybe not 1000:1, but definitely not 1:1. Bio tech and engineered pandemics seem like this; autonomous weapons seem like this.
The strategy-stealing assumption seems false to me, partially for reasons listed in the linked post. I note that the linked post includes Paul listing a bunch of convincing-to-me ways in which strategy-stealing is false and then concluding that it’s basically true. The claim about offense is easier than defense is sorta just a version of the strategy stealing claim, this bullet point isn’t actually another distinct argument, just an excuse to point toward previous thinking and the various arguments there.
A couple caveats: I think killing all of humanity with current tech is pretty hard; as noted however, I think this is too high a bar because probably things like extortion are sufficient for grabbing power. Also, I think there are some defensive strategies that would actually totally work at reducing the threat from misaligned AGI systems. Most of these strategies look a lot like “centralization of AGI development”, e.g., destroying advanced computing infrastructure, controlling who uses advanced computing infrastructure and how they use it, a global treaty banning advanced AI development (which might be democratically controlled but has the effect of exercising central decision making).
So circling back to the urn, if you pull out an aligned AI system, and 3 months later somebody else pulls out a misaligned AI system, I don’t think pulling out the aligned AI system a little in advance buys you that much. The correct strategy to this situation is to try and make the proportion of balls weighted heavily toward aligned, AND to pull out as few as you can.
More AGI development projects means more draws from the urn because there are more actors doing this and no coordinated decision process to stop. You mention that maybe government can regulate AI developers to reduce racing. This seems like it will go poorly, and in the worlds where it goes well, I think you should maybe just call them “centralization” because they involve a central decision process deciding who can train what models when with what methods. That is, extremely involved regulations seem to effectively be centralization.
Notably, this is related but not the same as the effects from racing. More AGI projects leads to racing which leads to cutting corners on safety (higher proportion of misaligned AIs in the urn), and racing leads to more draws from the urn because of fear of losing to a competitor. But even without racing, more AGI projects means more draws from the urn.
The thing I would like to happen instead is that there is a very controlled process for drawing from the urn, where each ball is carefully inspected, and if we draw aligned AIs, we use them to do AI alignment research, i.e., increase the proportion of aligned AIs in the urn. And we don’t take more draws from the urn until we’re really sure we’re quite confident we’re not going to pull out a misaligned AI. Again, this is both about reducing the risk of catastrophe each time you take a risky action, and about decreasing the number of times you have to take risky actions.
Summarizing: If you are operating in a domain where losses are very bad, you want to take less gambles. I think AGI and ASI development are such domains, and decentralized AGI development means more gambles are taken.
Noting that I spent a couple minutes pondering the quoted passage which I don’t think was a good use of time (I basically would have immediately dismissed it if I knew Claude wrote it, and I only thought about it because my prior on Buck saying true things is way higher), and I would have preferred the text not have this.
I don’t see anybody having mentioned it yet, but the recent paper about LLM Introspection seems pretty relevant. I would say that a model which performs very well at introspection (as defined there) would be able to effectively guess which jailbreak strategies were attempted.
There is now some work in that direction: https://forum.effectivealtruism.org/posts/47RH47AyLnHqCQRCD/soft-nationalization-how-the-us-government-will-control-ai
Sounds like a very successful hackathon! Nice work to everybody involved!
Some prompts I found interesting when brainstorming LLM startups
I spent a little time thinking about making an AI startup. I generally think it would be great if more people were trying to build useful companies that directly add value, rather than racing to build AGI. Here are some of the prompts I found interesting to think about, perhaps they will be useful to other people/AI agents interested in building a startup:
What are the situations where people will benefit from easy and cheap access to expert knowledge? You’re leveraging that human expert labor is hard to scale to many situations (especially when experts are rare, needs are specific, it’s awkward, it’s too expensive — including both raw cost and the cost of finding/trusting/onboarding an expert). What are all the things you occasionally pay somebody to do, but which requires them coming in person? What is a problem people know they have but they don’t seek out existing solutions (because of perceived cost, awkwardness, unsure how). e.g., dating profile feedback, outfit designer.
Solve a problem that exists due to technological development, e.g., preventing the social isolation from social media, reducing various catastrophic risks during and after intelligence explosion.
Some other problem attack surface opened up by LLMs:
Cheaply carry out simple straightforward tasks.
Analyze data at scale.
Do tasks that there was no previous market for (e.g., provided $5 of value but took an hour, and you can’t hire people for $5/hour because they don’t want to work for that little and the overhead is high). Reasons for lack of market: not enough money to be made, can’t trust somebody (not worth the time needed to grow trust, or substantial privacy concerns), communication cost too high (specify task), other overhead too high (travel, finding person), training cost too high compared to salary (imagine it took 8 years to become a barber).
Provide cheap second opinions, potentially many of them (e.g., reviewing a low-importance piece of writing).
Some other desiderata I had (for prompting LLMs):
I want to have a clear and direct story for making people’s lives better or solving problems they have. So I have a slight preference for B2C over B2B, unless there’s a clear story for how we’re significantly helping the business in an industry that benefits people.
We don’t want to be obsoleted by the predictable products coming out of AI development companies; for instance a product that just takes ChatGPT and adds a convenient voice feature is not a good idea because that niche is likely to be met by existing developers fairly soon.
We don’t want to work on something that other well resourced efforts are working on. Our edge is having good ideas and creative implementations, not being able to outcompete others according to resource investment. We should play to our strengths and not try to get in a losing battle with strong existing products.
I mainly don’t want to be directly competing with existing products or services, instead I want to be creating a large amount of counterfactual value by solving a problem that nobody else has solved.
The MVP should be achievable by a team of 5 working for <6 months, ideally even a very basic MVP should be achievable in just a week or two of full-time work.
I want to be realistic, we won’t be able to solve everything or do everything. I want to aim for a fairly niche product, rather than solving a huge-scale problem like fixing medical care. That is, instead of a general medical chatbot, a better idea would be a first-aid tutor that can help people learn first-aid basics and refresh their knowledge later.
I want to be providing a service people are excited to receive. For instance, a sustainable living advisor isn’t a great idea because if it actually got people to make more sustainable decisions, that would be annoying — people don’t actually want to hear that they shouldn’t fly home to their family for the holidays, even though this is one of the more important sustainability decisions they could make.
I probably want to provide a service that is not currently provided by a simple google search. For instance, a cooking assistant is pretty much just glorified google search. I want to be providing more value than that. Services which can be provided by a simple google search are likely to be filled in by existing developers.
I do not want to be pushing the frontier of AI capabilities in dangerous domains such as: synthetic biology, cybersecurity, autonomous weapons, ML engineering and AI development, manipulation and persuasion. Generally pushing the frontier of scientific capabilities may also be in this group due to its effects on the other domains, but it is unclear.
Yep, I basically agree with those being the cruxes!
On how much of the goal reasoning is happening out loud: Nowadays, I think about a lot of AI safety research as being aimed at an AI Control scenario where we are closely supervising what models are thinking about in CoT, and thus malign goal seeking must either happen in individual forward passes (and translated to look harmless most the time) or in an obfuscated way in CoT. (or from rogue deployments outside the control scheme)
Studying the intrinsic goals of the models in naturalistic settings would also be interesting, but I expect this to be hard, and also less relevant to the threat models we’re worried about. I’m mainly concerned about models becoming more goal-directed with scale and someone figuring out long-horizon RL training on top of that, leading to models that are qualitatively different from current ones.
By naturalistic, I mean “from a realistic training process, even if that training is designed to create goals”. Which sounds like what you said is the main threat model you’re worried about? If you have the time, I would push you harder on this: what is a specific story of AI catastrophe that you are trying to study/prevent?
Some quick thoughts (only skimmed the post, writing quickly), as you asked for feedback:
It looks like the main thing you’re testing is some variant of “when prompted to do goal directed behavior, how effective is the model at satisfying the goal?” That’s a reasonable thing to investigate, but I’m not sure it would be near the top of the list of “empirical research on goal-directed-ness that I want to see”. I’m probably mainly interested in the deceptive alignment motivation, read the rest of this comment as focusing on that.
Aside: To state it directly, I think the main reason to study goal-directedness in this lower-validity setting (of giving models goals in prompts) is that CoT-based goal-directedness might act as a precursor for in-forward-pass goal directedness (which seems far more worrying re deceptive alignment) — so we can study it earlier. So again, reasonable to study, but if you agree with me that this is the main reason for such experiments being valid, it’s an important frame to have when thinking about this kind of work: artificially inducing goal directedness is a model-organism approach rather than a natural experiment.
Thinking out loud, a list for goal-directedness work I want to see might be; sub-bullets are more detailed ideas:
Are base models goal directed? Are RLHF finetuned models goal directed? (naturalistic setting)
Could look like this recent work on the consistency of model answers to values questions, but more adapted to goals you hypothesize the models to have (like how consistently do models follow a particular behavior outlined in the Model Spec)
How do RLHFed models deal with conflicting goals — do they engage in sophisticated reasoning about this or instead seem to follow simple heuristics?
To the extent these models are goal directed (including because you induce this via prompting), is anything interesting going on:
Do they goal generalize the way we would expect? Similar to this recent work, but aimed at some risk other than reward hacking, I would be particularly interested in the time horizon over which the world is affected, as this is perhaps a proxy for a model having non-myopic goals.
Is there specification gaming or ‘in context reward hacking’ across many different settings?
As mentioned, maybe CoT and prompting provide an early warning sign for forward-pass goal-directed-ness. Is this true? How much of an early warning sign?
It looks like the settings in this post are sorta a general capability eval for a model accomplishing goals. I wonder if you think they add a ton of value over existing agent benchmarks like SWE-Bench? My intuition says you would be better off on trying to focus in on a narrower question that is particularly relevant to safety, like one of those I mentioned.
Sorry if this comment was rude or mean, it’s been a couple weeks and this post has no feedback even though you asked, I figured something might be better than nothing. It looks to me like your overall approach and ways of thinking about this are good!
What’s the evidence that this document is real / written by Anthropic?
This sentence seems particularly concerning:
We believe the first two issues can be addressed by focusing on deterrence rather than pre-harm enforcement: instead of deciding what measures companies should take to prevent catastrophes (which are still hypothetical and where the ecosystem is still iterating to determine best practices), focus the bill on holding companies responsible for causing actual catastrophes.
Nice work, these seem like interesting and useful results!
High level question/comment which might be totally off: one benefit of having a single, large, SAE neuron space that each token gets projected into is that features don’t get in each other’s way, except insofar as you’re imposing sparsity. Like, your “I’m inside a parenthetical” and your “I’m attempting a coup” features will both activate in the SAE hidden layer, as long as they’re in the top k features (for some sparsity). But introducing switch SAEs breaks that: if these two features are in different experts, only one of them will activate in the SAE hidden layer (based on whatever your gating learned).
The obvious reply is “but look at the empirical results you fool! The switch SAEs are pretty good!” And that’s fair. I weakly expect what is happening in your experiment is that similar but slightly specialized features are being learned by each expert (a testable hypothesis), and maybe you get enough of this redundancy that it’s fine e.g,. the expert with “I’m inside a parenthetical” also has a “Words relevant to coups” feature and this is enough signal for coup detection in that expert.
Again, maybe this worry is totally off or I’m misunderstanding something.
Thanks for the addition, that all sounds about right to me!
Leaving Dangling Questions in your Critique is Bad Faith
Note: I’m trying to explain an argumentative move that I find annoying and sometimes make myself; this explanation isn’t very good, unfortunately.
Example
Them: This effective altruism thing seems really fraught. How can you even compare two interventions that are so different from one another?
Explanation of Example
I think the way the speaker poses the above question is not as a stepping stone for actually answering the question, it’s simply as a way to cast doubt on effective altruists. My response is basically, “wait, you’re just going to ask that question and then move on?! The answer really fucking matters! Lives are at stake! You are clearly so deeply unserious about the project of doing lots of good, such that you can pose these massively important questions and then spend less than 30 seconds trying to figure out the answer.” I think I might take these critics more seriously if they took themselves more seriously.
Description of Dangling Questions
A common move I see people make when arguing or criticizing something is to pose a question that they think the original thing has answered incorrectly or is not trying sufficiently hard to answer. But then they kinda just stop there. The implicit argument is something like “The original thing didn’t answer this question sufficiently, and answering this question sufficiently is necessary for the original thing to be right.”
But importantly, the criticisms usually don’t actually argue that — they don’t argue for some alternative answer to the original questions, if they do they usually aren’t compelling, and they also don’t really try to argue that this question is so fundamental either.
One issue with Dangling Questions is that they focus the subsequent conversation on a subtopic that may not be a crux for either party, and this probably makes the subsequent conversation less useful.
Example
Me: I think LLMs might scale to AGI.
Friend: I don’t think LLMs are actually doing planning, and that seems like a major bottleneck to them scaling to AGI.
Me: What do you mean by planning? How would you know if LLMs were doing it?
Friend: Uh…idk
Explanation of Example
I think I’m basically shifting the argumentative burden onto my friend when it falls on both of us. I don’t have a good definition of planning or a way to falsify whether LLMs can do it — and that’s a hole in my beliefs just as it is a hole in theirs. And sure, I’m somewhat interested in what they say in response, but I don’t expect them to actually give a satisfying answer here. I’m posing a question I have no intention of answering myself and implying it’s important for the overall claim of LLMs scaling to AGI (my friend said it was important for their beliefs, but I’m not sure it’s actually important for mine). That seems like a pretty epistemically lame thing to do.
Traits of “Dangling Questions”
They are used in a way that implies the target thing is wrong vis a vis the original idea, but this argument is not made convincingly.
The author makes minimal effort to answer the question with an alternative. Usually they simply pose it. The author does not seem to care very much about having the correct answer to the question.
The author usually implies that this question is particularly important for the overall thing being criticized, but does not usually make this case.
These questions share a lot in common with the paradigm criticisms discussed in Criticism Of Criticism Of Criticism, but I think they are distinct in that they can be quite narrow.
One of the main things these questions seem to do is raise the reader’s uncertainty about the core thing being criticized, similar to the Just Asking Questions phenomenon. To me, Dangling Questions seem like a more intellectual version of Just Asking Questions — much more easily disguised as a good argument.
Here’s another example, though it’s imperfect.
Example
From an AI Snake Oil blog post:
Research on scaling laws shows that as we increase model size, training compute, and dataset size, language models get “better”. … But this is a complete misinterpretation of scaling laws. What exactly is a “better” model? Scaling laws only quantify the decrease in perplexity, that is, improvement in how well models can predict the next word in a sequence. Of course, perplexity is more or less irrelevant to end users — what matters is “emergent abilities”, that is, models’ tendency to acquire new capabilities as size increases.
Explanation of Example
The argument being implied is something like “scaling laws are only about perplexity, but perplexity is different from the metric we actually care about — how much? who knows? —, so you should ignore everything related to perplexity, also consider going on a philosophical side-quest to figure out what ‘better’ really means. We think ‘better’ is about emergent abilities, and because they’re emergent we can’t predict them so who knows if they will continue to appear as we scale up”. In this case, the authors have ventured an answer to their Dangling Question, “what is a ‘better’ model?“, they’ve said it’s one with more emergent capabilities than a previous model. This answer seems flat out wrong to me; acceptable answers include: downstream performance, self-reported usefulness to users, how much labor-time it could save when integrated in various people’s work, ability to automate 2022 job tasks, being more accurate on factual questions, and much more. I basically expect nobody to answer the question “what does it mean for one AI system to be better than another?” with “the second has more capabilities that were difficult to predict based on the performance of smaller models and seem to increase suddenly on a linear-performance, log-compute plot”.
Even given the answer “emergent abilities”, the authors fail to actually argue that we don’t have a scaling precedent for these. Again, I think the focus on emergent abilities is misdirected, so I’ll instead discuss the relationship between perplexity and downstream benchmark performance — I think this is fair game because this is a legitimate answer to the “what counts as ‘better’?” question and because of the original line “Scaling laws only quantify the decrease in perplexity, that is, improvement in how well models can predict the next word in a sequence”. The quoted thing is technically true but in this context highly misleading, because we can, in turn, draw clear relationships between perplexity and downstream benchmark performance; here are three recent papers which do so, here are even more studies that relate compute directly to downstream performance on non-perplexity metrics. Note that some of these are cited in the blog post. I will also note that this seems like one example of a failure I’ve seen a few times where people conflate “scaling laws” with what I would refer to as “scaling trends” where the scaling laws refer to specific equations for predicting various metrics based on model inputs such as # parameters and amount of data to predict perplexity, whereas scaling trends are the more general phenomenon we observe that scaling up just seems to work and in somewhat predictable ways; the scaling laws are useful for the predicting, but whether we have those specific equations or not has no effect on this trend we are observing, the equations just yield a bit more precision. Yes, scaling laws relating parameters and data to perplexity or training loss do not directly give you info about downstream performance, but we seem to be making decent progress on the (imo still not totally solved) problem of relating perplexity to downstream performance, and together these mean we have somewhat predictable scaling trends for metrics that do matter.
Example
Here’s another example from that blog post where the authors don’t literally pose a question, but they are still doing the Dangling Question thing in many ways. (context is referring to these posts):
Also, like many AI boosters, he conflates benchmark performance with real-world usefulness.
Explanation of Example
(Perhaps it would be better to respond to the linked AI Snake Oil piece, but that’s a year old and lacks lots of important evidence we have now). I view the move being made here as posing the question “but are benchmarks actually useful to real world impact?“, assuming the answer is no — or poorly arguing so in the linked piece — and going on about your day. It’s obviously the case that benchmarks are not the exact same as real world usefulness, but the question of how closely they’re related isn’t some magic black box of un-solvability! If the authors of this critique want to complain about the conflation between benchmark performance and real-world usefulness, they should actually bring the receipts showing that these are not related constructs and that relying on benchmarks would lead us astray. I think when you actually try that, you get an answer like: benchmark scores seem worse than user’s reported experience and than user’s reported usefulness in real world applications, but there is certainly a positive correlation here; we can explain some of the gap via techniques like few-shot prompting that are often used for benchmarks, a small amount via dataset contamination, and probably much of this gap comes from a validity gap where benchmarks are easy to assess but unrealistic, but thankfully we have user-based evaluations like LMSYS that show a solid correlation between benchmark scores and user experience, … (if I actually wanted to make the argument the authors were, I would be spending like >5 paragraphs on it and elaborating on all of the evidences mentioned above, including talking more about real world impacts, this is actually a difficult question and the above answer is demonstrative rather than exemplar)
Caveats and Potential Solutions
There is room for questions in critiques. Perfect need not be the enemy of good when making a critique. Dangling Questions are not always made in bad faith.
Many of the people who pose Dangling Questions like this are not trying to act in bad faith. Sometimes they are just unserious about the overall question, and they don’t care much about getting to the right answer. Sometimes Dangling Questions are a response to being confused and not having tons of time to think through all the arguments, e.g., they’re a psychological response something like “a lot feels wrong about this, here are some questions that hint at what feels wrong to me, but I can’t clearly articulate it all because that’s hard and I’m not going to put in the effort”.
My guess at a mental move which could help here: when you find yourself posing a question in the context of an argument, ask whether you care about the answer, ask whether you should spend a few minutes trying to determine the answer, ask whether the answer to this question would shift your beliefs about the overall argument, ask whether the question puts undue burden on your interlocutor.
If you’re thinking quickly and aren’t hoping to construct a super solid argument, it’s fine to have Dangling Questions, but if your goal is to convince others of your position, you should try to answer your key questions, and you should justify why they matter to the overall argument.
Another example of me posing a Dangling Question in this:
What happens to OpenAI if GPT-5 or the ~5b training run isn’t much better than GPT-4? Who would be willing to invest the money to continue? It seems like OpenAI either dissolves or gets acquired.
Explanation of Example
(I’m not sure equating GPT-5 with a ~5b training run is right). In the above quote, I’m arguing against The Scaling Picture by asking whether anybody will keep investing money if we see only marginal gains after the next (public) compute jump. I think I spent very little time trying to answer this question, and that was lame (though acceptable given this was a Quick Take and not trying to be a strong argument). I think for an argument around this to actually go through, I should argue: without much larger dollar investments, The Scaling Picture won’t hold; those dollar investments are unlikely conditional on GPT-5 not being much better than GPT-4. I won’t try to argue these in depth, but I do think some compelling evidence is that OpenAI is rumored to be at ~$3.5 billion annualized revenue, and this plausibly justifies considerable investment even if the GPT-5 gain over this isn’t tremendous.
I agree that repeated training will change the picture somewhat. One thing I find quite nice about the linked Epoch paper is that the range of tokens is an order of magnitude, and even though many people have ideas for getting more data (common things I hear include “use private platform data like messaging apps”), most of these don’t change the picture because they don’t move things more than an order of magnitude, and the scaling trends want more orders of magnitude, not merely 2x.
Repeated data is the type of thing that plausibly adds an order of magnitude or maybe more.
I sometimes want to point at a concept that I’ve started calling The Scaling Picture. While it’s been discussed at length (e.g., here, here, here), I wanted to give a shot at writing a short version:
The picture:
We see improving AI capabilities as we scale up compute, projecting the last few years of progress in LLMs forward might give us AGI (transformative economic/political/etc. impact similar to the industrial revolution; AI that is roughly human-level or better on almost all intellectual tasks) later this decade (note: the picture is not about specific capabilities so much as the general picture).
Relevant/important downstream capabilities improve as we scale up pre-training compute (size of model and amount of data), although for some metrics there are very sublinear returns — this is the current trend. Therefore, you can expect somewhat predictable capability gains in the next few years as we scale up spending (increase compute), and develop better algorithms / efficiencies.
AI capabilities in the deep learning era are the result of three inputs: data, compute, algorithms. Keeping algorithms the same, and scaling up the others, we get better performance — that’s what scaling means. We can lump progress in data and algorithms together under the banner “algorithmic progress” (i.e., how much intelligence can you get per compute) and then to some extent we can differentiate the source of progress: algorithmic progress is primarily driven by human researchers, while compute progress is primarily driven by spending more money to buy/rent GPUs. (this may change in the future). In the last few years of AI history, we have seen massive gains in both of these areas: it’s estimated that the efficiency of algorithms has improved about 3x/year, and the amount of compute used has increased 4.1x/year. These are ludicrous speeds relative to most things in the world.
Edit to add: This paper seems like it might explain that breakdown better.
Edit to add: The below arguments are just supposed to be pointers toward longer argument one could make, but the one sentence version usually isn’t compelling on its own.
Arguments for:
Scaling laws (mathematically predictable relationship between pretraining compute and perplexity) have held for ~12 orders of magnitude already
We are moving though ‘orders of magnitude of compute’ quickly, so lots of probability mass should be soon (this argument is more involved, following from having uncertainty over orders of magnitude of compute that might be necessary for AGI, like the approach taken here; see here for discussion)
Once you get AIs that can speed up AI progress meaningfully, progress on algorithms could go much faster, e.g., by AIs automating the role of researchers at OpenAI. You also get compounding economic returns that allow compute to grow — AIs that can be used to make a bunch of money, and that money can be put into compute. It seems plausible that you can get to that level of AI capabilities in the next few orders of magnitude, e.g., GPT-5 or GPT-6. Automated researchers are crazy.
Moore’s law has held for a long time. Edit to add: I think a reasonable breakdown for the “compute” category mentioned above is “money spent” and “FLOP purchasable per dollar”. While Moore’s Law is technically about the density of transistors, the thing we likely care more about is FLOP/$, which follows similar trends.
Many people at AGI companies think this picture is right, see e.g., this, this, this (can’t find an aggregation)
Arguments against:
Might run out of data. There are estimated to be 100T-1000T internet tokens, we will likely hit this level in a couple years.
Might run out of money — we’ve seen ~$100m training runs, we’re likely at $100m-1b this year, tech R&D budgets are ~30B, governments could fund $1T. One way to avoid this ‘running out of money’ problem is if you get AIs that speed up algorithmic progress sufficiently.
Scaling up is a non-trivial engineering problem and it might cause slow downs due to e.g., GPU failure and difficulty parallelizing across thousands of GPUs
Revenue might just not be that big and investors might decide it’s not worth the high costs
OTOH, automating jobs is a big deal if you can get it working
Marginal improvements (maybe) for huge increased costs; bad ROI.
There are numerous other economics arguments against, mainly arguing that huge investments in AI will not be sustainable, see e.g., here
Maybe LLMs are missing some crucial thing
Not doing true generalisation to novel tasks in the ARC-AGI benchmark
Not able to learn on the fly — maybe long context windows or other improvements can help
Lack of embodiment might be an issue
This is much faster than many AI researchers are predicting
This runs counter to many methods of forecasting AI development
Will be energy intensive — might see political / social pressures to slow down.
We might see slowdowns due to safety concerns.
Neat idea. I notice that this looks similar to dealing with many-shot jailbreaking:
For jailbreaking you are trying to learn the policy “Always imitate/generate-from a harmless assistant”, here you are trying to learn “Always imitate safe human”. In both, your model has some prior for outputting harmful next tokens, the context provides an update toward a higher probability of outputting harmful text (because of seeing previous examples of the assistant doing so, or because the previous generations came from an AI). And in both cases we would like some training technique that causes the model’s posterior on harmful next tokens to be low.
I’m not sure there’s too much else of note about this similarity, but it seemed worth noting because maybe progress on one can help with the other.
Cool! I’m not very familiar with the paper so I don’t have direct feedback on the content — seems good. But I do think I would have preferred a section at the end with your commentary / critiques of the paper, also that’s potentially a good place to try and connect the paper to ideas in AI safety.
Thanks for your continued engagement.
I appreciate your point about compelling experimental evidence, and I think it’s important that we’re currently at a point with very little of that evidence. I still feel a lot of uncertainty here, and I expect the evidence to basically always be super murky and for interpretations to be varied/controversial, but I do feel more optimistic than before reading your comment.
I don’t expect this to be a very large effect. It feels similar to an argument like “company A will be better on ESG dimensions and therefore more and customers will switch to using it”. Doing a quick review of the literature on that, it seems like there’s a small but notable change in consumer behavior for ESG-labeled products. In the AI space, it doesn’t seem to me like any customers care about OpenAI’s safety team disappearing (except a few folks in the AI safety world). In this particular case, I expect the technical argument needed to demonstrate that some family of AI systems are aligned while others are not is a really complicated argument; I expect fewer than 500 people would be able to actually verify such an argument (or the initial “scalable alignment solution”), maybe zero people. I realize this is a bit of a nit because you were just gesturing toward one of many ways it could be good to have an alignment solution.
I endorse arguing for alternative perspectives and appreciate you doing it. And I disagree with your synthesis here.