“ He grew up in a Jewish family in Europe.[9] Helberg is openly gay.[10] He married American investor Keith Rabois in a 2018 ceremony officiated by Sam Altman.”
Might this be an angle to understand the influence that Sam Altman has on recent developments in the US government?
This chapter on AI follows immediately after the year in review, I went and checked the previous few years’ annual reports to see what the comparable chapters were about, they are
2023: China’s Efforts To Subvert Norms and Exploit Open Societies
2022: CCP Decision-Making and Xi Jinping’s Centralization Of Authority
2021: U.S.-China Global Competition (Section 1: The Chinese Communist Party’s Ambitions and Challenges at its Centennial
2020: U.S.-China Global Competition (Section 1: A Global Contest For Power and Influence: China’s View of Strategic Competition With the United States)
And this year it’s Technology And Consumer
Product Opportunities and Risks
(Chapter 3: U.S.-China Competition in
Emerging Technologies)
Reminds of when Richard Ngo said something along the lines of “We’re not going to be bottlenecked by politicians not caring about AI safety. As AI gets crazier and crazier everyone would want to do AI safety, and the question is guiding people to the right AI safety policies”
We’re not going to be bottlenecked by politicians not caring about AI safety. As AI gets crazier and crazier everyone would want to do AI safety, and the question is guiding people to the right AI safety policies
I think we’re seeing more interest in AI, but I think interest in “AI in general” and “AI through the lens of great power competition with China” has vastly outpaced interest in “AI safety”. (Especially if we’re using a narrow definition of AI safety; note that people in DC often use the term “AI safety” to refer to a much broader set of concerns than AGI safety/misalignment concerns.)
I do think there’s some truth to the quote (we are seeing more interest in AI and some safety topics), but I think there’s still a lot to do to increase the salience of AI safety (and in particular AGI alignment) concerns.
’The report doesn’t go into specifics but the idea seems to be to build / commandeer the computing resources to scale to AGI, which could include compelling the private labs to contribute talent and techniques.
DX rating is the highest priority DoD procurement standard. It lets DoD compel companies, set their own price, skip the line, and do basically anything else they need to acquire the good in question.′ https://x.com/hamandcheese/status/1858902373969564047
I suspect current approaches probably significantly or even drastically under-elicit automated ML research capabilities.
I’d guess the average cost of producing a decent ML paper is at least 10k$ (in the West, at least) and probably closer to 100k’s $.
In contrast, Sakana’s AI scientist cost on average 15$/paper and .50$/review. PaperQA2, which claims superhuman performance at some scientific Q&A and lit review tasks, costs something like 4$/query. Other papers with claims of human-range performance on ideation or reviewing also probably have costs of <10$/idea or review.
Even the auto ML R&D benchmarks from METR or UK AISI don’t give me at all the vibes of coming anywhere near close enough to e.g. what a 100-person team at OpenAI could accomplish in 1 year, if they tried really hard to automate ML.
A fairer comparison would probably be to actually try hard at building the kind of scaffold which could use ~10k$ in inference costs productively. I suspect the resulting agent would probably not do much better than with 100$ of inference, but it seems hard to be confident. And it seems harder still to be confident about what will happen even in just 3 years’ time, given that pretraining compute seems like it will probably grow about 10x/year and that there might be stronger pushes towards automated ML.
This seems pretty bad both w.r.t. underestimating the probability of shorter timelines and faster takeoffs, and in more specific ways too. E.g. we could be underestimating by a lot the risks of open-weights Llama-3 (or 4 soon) given all the potential under-elicitation.
In contrast, Sakana’s AI scientist cost on average 15$/paper and .50$/review.
The Sakana AI stuff is basically total bogus, as I’ve pointed out on like 4 other threads (and also as Scott Alexander recently pointed out). It does not produce anything close to fully formed scientific papers. It’s output is really not better than just prompting o1 yourself. Of course, o1 and even Sonnet and GPT-4 are very impressive, but there is no update to be made after you’ve played around with that.
I agree that ML capabilities are under-elicited, but the Sakana AI stuff really is very little evidence on that, besides someone being good at marketing and setting up some scaffolding that produces fake prestige signals.
It does not produce anything close to fully formed scientific papers. It’s output is really not better than just prompting o1 yourself. Of course, o1 and even Sonnet and GPT-4 are very impressive, but there is no update to be made after you’ve played around with that.
(Again) I think this is missing the point that we’ve now (for the first time, to my knowledge) observed an early demo the full research workflow being automatable, as flawed as the outputs might be.
I completely agree, and we should just obviously build an organization around this. Automating alignment research while also getting a better grasp on maximum current capabilities (and a better picture of how we expect it to grow).
(This is my intention, and I have had conversations with Bogdan about this, but I figured I’d make it more public in case anyone has funding or ideas they would like to share.)
A fairer comparison would probably be to actually try hard at building the kind of scaffold which could use ~10k$ in inference costs productively. I suspect the resulting agent would probably not do much better than with 100$ of inference, but it seems hard to be confident.
Also, there are notable researchers and companies working on developing ‘a truly general way of scaling inference compute’ right now and I think it would be cautious to consider what happens if they succeed.
(This also has implications for automating AI safety research).
(This also has implications for automating AI safety research).
To spell it out more explicitly, the current way of scaling inference (CoT) seems pretty good vs. some of the most worrying threat models, which often depend on opaque model internals.
hmm. i think you’re missing eliezer’s point. the idea was never that AI would be unable to identify actions which humans consider good, but that the AI would not have any particular preference to take those actions.
But my point isn’t just that the AI is able to produce similar ratings to humans’ for aesthetics, etc., but that it also seems to do so through at least partially overlapping computational mechanisms to humans’, as the comparisons to fMRI data suggest.
Agree that it doesn’t imply caring for. But I think given cumulating evidence for human-like representations of multiple non-motivational components of affect, one should also update at least a bit on the likelihood of finding / incentivizing human-like representations of the motivational component(s) too (see e.g. https://en.wikipedia.org/wiki/Affect_(psychology)#Motivational_intensity_and_cognitive_scope).
Even if Eliezer’s argument in that Twitter thread is completely worthless, it remains the case that “merely hoping” that the AI turns out nice is an insufficiently good argument for continuing to create smarter and smarter AIs. I would describe as “merely hoping” the argument that since humans (in some societies) turned out nice (even though there was no designer that ensured they would), the AI might turn out nice. Also insufficiently good is any hope stemming from the observation that if we pick two humans at random out of the humans we know, the smarter of the two is more likely than not to be the nicer of the two. I certainly do not want the survival of the human race to depend on either one of those two hopes or arguments! Do you?
Eliezer finds posting on the internet enjoyable, like lots of people do. He posts a lot about, e.g., superconductors and macroeconomic policy. It is far from clear to me that he consider this Twitter thread to be relevant to the case against continuing to create smarter AIs. But more to the point: do you consider it relevant?
Contra both the ‘doomers’ and the ‘optimists’ on (not) pausing. Rephrased: RSPs (done right) seem right.
Contra ‘doomers’. Oversimplified, ‘doomers’ (e.g. PauseAI, FLI’s letter, Eliezer) ask(ed) for pausing now / even earlier - (e.g. the Pause Letter). I expect this would be / have been very much suboptimal, even purely in terms of solving technical alignment. For example, Some thoughts on automating alignment research suggests timing the pause so that we can use automated AI safety research could result in ‘[...] each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.’ We clearly don’t have such automated AI safety R&D capabilities now, suggesting that pausing later, when AIs are closer to having the required automated AI safety R&D capabilities would be better. At the same time, current models seem very unlikely to be x-risky (e.g. they’re still very bad at passing dangerous capabilities evals), which is another reason to think pausing now would be premature.
Contra ‘optimists’. I’m more unsure here, but the vibe I’m getting from e.g. AI Pause Will Likely Backfire (Guest Post) is roughly something like ‘no pause ever’; largely based on arguments of current systems seeming easy to align / control. While I agree with the point that current systems do seem easy to align / control and I could even see this holding all the way up to ~human-level automated AI safety R&D, I can easily see scenarios where around that time things get scary quickly without any pause. For example, similar arguments to those about the scalability of automated AI safety R&D suggest automated AI capabilities R&D could also be scaled up significantly. For example, figures like those in Before smart AI, there will be many mediocre or specialized AIs suggest very large populations of ~human-level automated AI capabilities researchers could be deployed (e.g. 100x larger than the current [human] population of AI researchers). Given that even with the current relatively small population, algorithmic progress seems to double LM capabilities ~every 8 months, it seems like algorithmic progress could be much faster with 100x larger populations, potentially leading to new setups (e.g. new AI paradigms, new architectures, new optimizers, synthetic data, etc.) which could quite easily break the properties that make current systems seem relatively easy / safe to align. In this scenario, pausing to get this right (especially since automated AI safety R&D would also be feasible) seems like it could be crucial.
At least Eliezer has been extremely clear that he is in favor of a stop not a pause (indeed, that was like the headline of his article “Pausing AI Developments Isn’t Enough. We Need to Shut it All Down”), so I am confused why you list him with anything related to “pause”.
My guess is me and Eliezer are both in favor of a pause, but mostly because a pause seems like it would slow down AGI progress, not because the next 6 months in-particular will be the most risky period.
At the same time, current models seem very unlikely to be x-risky (e.g. they’re still very bad at passing dangerous capabilities evals), which is another reason to think pausing now would be premature.
The relevant criterion is not whether the current models are likely to be x-risky (it’s obviously far too late if they are!), but whether the next generation of models have more than an insignificant chance of being x-risky together with all the future frameworks they’re likely to be embedded into.
Given that the next generations are planned to involve at least one order of magnitude more computing power in training (and are already in progress!) and that returns on scaling don’t seem to be slowing, I think the total chance of x-risk from those is not insignificant.
I agree with some points here Bogdan, but not all of them.
I do think that current models are civilization-scale-catastrophe-risky (but importantly not x-risky!) from a misuse perspective, but not yet from a self-directed perspective. Which means neither Alignment nor Control are currently civilization-scale-catastrophe-risky, much less x-risky.
I also agree that pausing now would be counter-productive. My reasoning for this is that I agree with Samo Burja about some key points which are relevant here (while disagreeing with his conclusions due to other points).
To quote myself:
I agree with [Samo’s] premise that AGI will require fundamental scientific advances beyond currently deployed tech like transformer LLMs.
I agree that scientific progress is hard, usually slow and erratic, fundamentally different from engineering or bringing a product to market.
I agree with [Samo’s] estimate that the current hype around chat LLMs, and focus on bringing better versions to market, is slowing fundamental scientific progress by distracting top AI scientists from pursuit of theoretical advances.
Think about how you’d expect these factors to change if large AI training runs were paused. I think you might agree that this would likely result in a temporary shift in much of the top AI scientist talent to making theoretical progress. They’d want to be ready to come in strong after the pause was ended, with lots of new advances tested at small scale. I think this would actually result more high quality scientific thought directed at the heart of the problem of AGI, and thus make AGI very likely to be achieved sooner after the pause ends than it otherwise would have been.
I would go even farther, and make the claim that AGI could arise during a pause on large training runs. I think that the human brain is not a supercomputer, my upper estimate for ‘human brain inference’ is about at the level of a single 8x A100 server. Less than an 8x H100 server. Also, I have evidence from analysis of the long-range human connectome (long range axons are called tracts, so perhaps I should call this a ‘tractome’). [Hah, I just googled this term I came up with just now, and found it’s already in use, and that it brings up some very interesting neuroscience papers. Cool.] Anyway… I was saying, this evidence shows that the range of bandwidth (data throughput in bits per second) between two cortical regions in the human brain is typically around 5 mb/s, and maxes out at about 50 mb/s. In other words, well within range for distributed federated training runs to work over long distance internet connections. So unless you are willing to monitor the entire internet so robustly that nobody can scrape together the equivalent compute of an 8X A100 server, you can’t fully block AGI.
Of course, if you wanted to train the AGI in a reasonable amount of time, you’d want to do a parallel run of much more than a single inference instance of compute. So yeah, it’d definitely make things inconvenient if an international government were monitoring all datacenters… but far from impossible.
For the same reason, I don’t think a call to ‘Stop AI development permanently’ works without the hypothetical enforcement agency literally going around the world confiscating all personal computers and shutting down the internet. Not gonna happen, why even advocate for such a thing? Makes me think that Eliezer is advocating for this in order to have some intended effect other than this on the world.
Quick take on o1: overall, it’s been a pretty good day. Likely still sub-ASL-3, (opaque) scheming still seems very unlikely because the prerequisites still don’t seem there. CoT-style inference compute playing a prominent role in the capability gains is pretty good for safety, because differentially transparent. Gains on math and code suggest these models are getting closer to being usable for automated safety research (also for automated capabilities research, unfortunately).
CoT-style inference compute playing a prominent role in the capability gains is pretty good for safety
CoT inference looks more like the training surface, not essential part of resulting cognition after we take one more step following such models. Orion is reportedly (being) pretrained on these reasoning traces, and if it’s on the order of 50 trillion tokens, that’s about as much as there is natural text data of tolerable quality in the world available for training. Contrary to the phrasing, what transformers predict is in part distant future tokens within a context, not proximate “next tokens” that follow immediately after whatever the prediction must be based on.
So training on reasoning traces should teach the models concepts that let them arrive at the answer faster, skipping the avoidable parts of the traces and compressing a lot of the rest into less scrutable activations. The models trained at the next level of scale might be quite good at that, to the extent not yet known from experience with the merely GPT-4 scale models.
Edit: It looks like the instrumentally convergent reasoning was because of the prompt, so I roll back my updates on instrumental convergence being likely:
‘We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics, or whether these different goals are fundamentally at odds, as Merrill & Sabharwal (2023a) suggest.’
We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics
Surely fundamentally at odds? You can’t spend a while thinking without spending a while thinking.
Of course, the lunch still might be very cheap by only spending a while thinking a fraction of the time or whatever.
The post argues that there is a latency limit at 2e31 FLOP, and I’ve found it useful to put this scale into perspective.
Current public models such as Llama 3 405B are estimated to be trained with ~4e25 flops , so such a model would require 500,000 x more compute. Since Llama 3 405B was trained with 16,000 H-100 GPUs, the model would require 8 billion H-100 GPU equivalents, at a cost of $320 trillion with H-100 pricing (or ~$100 trillion if we use B-200s). Perhaps future hardware would reduce these costs by an order of magnitude, but this is cancelled out by another factor; the 2e31 limit assumes a training time of only 3 months. If we were to build such a system over several years and had the patience to wait an additional 3 years for the training run to complete, this pushes the latency limit out by another order of magnitude. So at the point where we are bound by the latency limit, we are either investing a significant percentage of world GDP into the project, or we have already reached ASI at a smaller scale of compute and are using it to dramatically reduce compute costs for successor models.
Of course none of this analysis applies to the earlier data limit of 2e28 flop, which I think is more relevant and interesting.
“A recent paper which was published only a few days before the publication of our own work, Zhang et al. (2024), finds a scaling of B = 17.75 D^0.47 (in units of tokens). If we rigorously take this more aggressive scaling into account in our model, the fall in utilization is pushed out by two orders of magnitude; starting around 3e30 instead of 2e28. Of course, even more aggressive scaling might be possible with methods that Zhang et al. (2024) do not explore, such as using alternative optimizers.”
I haven’t looked carefully at Zhang et al., but assuming their analysis is correct and the data wall is at 3e30 FLOP, it’s plausible that we hit resource constraints ($10-100 trillion training runs, 2-20 TW power required) before we hit the data movement limit.
Speculatively, this might also differentially incentivize (research on generalized) inference scaling, with various potential strategic implications, including for AI safety (current inference scaling methods tend to be tied to CoT and the like, which are quite transparent) and for regulatory frameworks/proliferation of dangerous capabilities.
current inference scaling methods tend to be tied to CoT and the like, which are quite transparent
Aschenbrenner in Situational Awareness predicts illegible chains of thought are going to prevail because they are more efficient. I know of one developer claiming to do this (https://platonicresearch.com/) but I guess there must be many.
the o1 results have illustrated specialized inference scaling laws, for model capabilities in some specialized domains (e.g. math); notably, these don’t seem to generally hold for all domains—e.g. o1 doesn’t seem better than gpt4o at writing;
there’s ongoing work at OpenAI to make generalized inference scaling work;
to the best of my awarenes, current regulatory frameworks/proposals (e.g. the EU AI Act, the Executive Order, SB 1047) frame the capabilities of models in terms of (pre)training compute and maybe fine-tuning compute (e.g. if (pre)training FLOP > 1e26, the developer needs to take various measures), without any similar requirements framed in terms of inference compute; so current regulatory frameworks seem unprepared for translating between indefinite amounts of inference compute and pretraining compute (capabilities);
this would probably be less worrying in the case of models behind APIs, since it should be feasible to monitor for misuse (e.g. CBRN), especially since current inference scaling methods (e.g. CoT) seem relatively transparent; but this seems different for open-weights or leaked models;
notably, it seems like Llama-3 405b only requires 8 good GPUs to run on one’s own hardware (bypassing any API), apparently costing <100k$; the relatively low costs could empower a lot of bad actors, especially since guardrails are easy to remove and potentially more robust methods like unlearning still seem mostly in the research phase;
this would also have consequences for (red-teaming) evals; if running a model for longer potentially leads to more (misuse) uplift, it becomes much harder to upper-bound said uplift;
also, once a model’s weights are out, it might be hard to upperbound all the potential inference uplift that future fine-tunes might provide; e.g. suppose it did become possible to fine-tune 4o into something even better than o1, that could indeed do almost boundless inference scaling; this might then mean that somebody could then apply the same fine-tuning to Llama-3 405b and release that model open-weights, and then (others) mights potentially be able to misuse it in very dangerous ways using a lot of inference compute (but still comparatively low vs. pretraining compute);
OTOH, I do buy the claims from https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/ that until now open-weights have probably been net beneficial for safety, and I expect this could easily keep holding in the near-term (especially as long as open-weights models stay somewhat behind the SOTA); so I’m conflicted about what should be done, especially given the uncertainty around the probability of making generalized inference scaling work and over how much more dangerous this could make current or future open-weights releases; one potentially robust course of action might be to keep doing research on unlearning methods, especially ones robust to fine-tuning, like in Tamper-Resistant Safeguards for Open-Weight LLMs, which might be sufficient to make misuse (especially in some specialized domains, e.g. CBRN) as costly as needing to train from scratch
For similar reasons to the discussion here about why individuals and small businesses might be expected to be able to (differentially) contribute to LM scaffolding (research), I expect them to be able to differentially contribute to [generalized] inference scaling [research]; plausibly also the case for automated ML research agents. Also relevant: Before smart AI, there will be many mediocre or specialized AIs.
Quick take: on the margin, a lot more research should probably be going into trying to produce benchmarks/datasets/evaluation methods for safety (than more directly doing object-level safety research).
Including differentially vs. doing the full stack of AI safety work—because I expect a lot of such work could be done by automated safety researchers soon, with the right benchmarks/datasets/evaluation methods. This could also make it much easier to evaluate the work of automated safety researchers and to potentially reduce the need for human feedback, which could be a significant bottleneck.
Better proxies could also make it easier to productively deploy more inference compute—e.g. from Large Language Monkeys: Scaling Inference Compute with Repeated Sampling: ‘When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget.’ Similar findings in other references, e.g. in Trading Off Compute in Training and Inference.
Of course, there’s also the option of using automated/augmented safety research to produce such benchmarks, but it seems potentially particularly useful to have them ahead of time.
This also seems like an area where we can’t piggyback off of work from capabilities researchers, like will probably be significantly the case for applying automated researchers (to safety).
automating research that ‘tackles the hard bits’ seems to probably be harder (happen chronologically later) and like it might be more bottlenecked by good (e.g. agent foundations) human researcher feedback—which does suggest it might be valuable to recruit more agent foundations researchers today
but my impression is that recruiting/forming agent foundations researchers itself is much harder and has much worse feedback loops
I expect if the feedback loops are good enough, the automated safety researchers could make enough prosaic safety progress, feeding into agendas like control and/or conditioning predictive models (+ extensions to e.g. steering using model internals, + automated reviewing, etc.), to allow the use of automated agent foundations researchers with more confidence and less human feedback
the hard bits might also become much less hard—or prove to be irrelevant—if for at least some of them empirical feedback could be obtained by practicing on much more powerful models
Overall, my impression is that focusing on the feedback loops seems probably significantly more scalable today, and the results would probably be enough to allow to mostly defer the hard bits to automated research.
(Edit: but also, I think if funders and field builders ‘went hard’, it might be much less necessary to choose.)
Jack Clark: ‘Registering a prediction: I predict that within two years (by July 2026) we’ll see an AI system beat all humans at the IMO, obtaining the top score. Alongside this, I would wager we’ll see the same thing—an AI system beating all humans in a known-hard competition—in another scientific domain outside of mathematics. If both of those things occur, I believe that will present strong evidence that AI may successfully automate large chunks of scientific research before the end of the decade.’ https://importai.substack.com/p/import-ai-380-distributed-13bn-parameter
With research automation in mind, here’s my wager: the modal top-15 STEM PhD student will redirect at least half of their discussion/questions from peers to mid-2026 LLMs. Defining the relevant set of questions as being drawn from the same difficulty/diversity/open-endedness distribution that PhDs would have posed in early 2024.
With research automation in mind, here’s my wager: the modal top-15 STEM PhD student will redirect at least half of their discussion/questions from peers to mid-2026 LLMs.
Fwiw, I’ve kind of already noted myself starting to do some of this, for AI safety-related papers; especially after Claude-3.5 Sonnet came out.
I find it pretty wild that automating AI safety R&D, which seems to me like the best shot we currently have at solving the full superintelligence control/alignment problem, no longer seems to have any well-resourced, vocal, public backers (with the superalignment team disbanded).
On what basis do you think it’s the ‘best shot’? I used to think it was a good idea, a few years ago, but in retrospect I think that was just a computer scientist’s love of recursion. I don’t think that at present conditions are good for automating R&D. On the one hand, we have a lot of very smart people working on AI safety R&D, with very slow progress, indicating it is a hard problem. On the other hand, present-day LLMs are stupid at long-term planning, and acquiring new knowledge, which are things you need to be good at to do R&D.
What advantage do you see AIs having over humans in this area?
I think there will be a period in the future where AI systems (models and their scaffolding) exist which are sufficiently capable that they will be able to speed up many aspects of computer-based R&D. Including recursive-self-improvement, Alignment research and Control research. Obviously, such a time period will not be likely to last long given that surely some greedy actor will pursue RSI. So personally, that’s why I’m not putting a lot of faith in getting to that period [edit: resulting in safety].
I think that if you build the scaffolding which would make current models able to be substantially helpful at research (which would be impressively strong scaffolding indeed!), then you have built dual-use scaffolding which could also be useful for RSI. So any plans to do this must take appropriate security measures or they will be net harmful.
I don’t think that at present conditions are good for automating R&D.
I kind of wish this was true, because it would likely mean longer timelines, but my expectation is that the incoming larger LMs + better scaffolds + more inference-time compute could quite easily pass the threshold of significant algorithmic progress speedup from automation (e.g. 2x).
Making algorithmic progress and making safety progress seem to differ along important axes relevant to automation:
Algorithmic progress can use 1. high iteration speed 2. well-defined success metrics (scaling laws) 3.broad knowledge of the whole stack (Cuda to optimization theory to test-time scaffolds) 4. …
Alignment broadly construed is less engineering and a lot more blue skies, long horizon, and under-defined (obviously for engineering heavy alignment sub-tasks like jailbreak resistance, and some interp work this isn’t true).
Probably automated AI scientists will be applied to alignment research, but unfortunately automated research will differentially accelerate algorithmic progress over alignment. This line of reasoning is part of why I think it’s valuable for any alignment researcher (who can) to focus on bringing the under-defined into a well-defined framework. Shovel-ready tasks will be shoveled much faster by AI shortly anyway.
Algorithmic progress can use 1. high iteration speed 2. well-defined success metrics (scaling laws) 3.broad knowledge of the whole stack (Cuda to optimization theory to test-time scaffolds) 4. …
Alignment broadly construed is less engineering and a lot more blue skies, long horizon, and under-defined (obviously for engineering heavy alignment sub-tasks like jailbreak resistance, and some interp work this isn’t true).
Generally agree, but I do think prosaic alignment has quite a few advantages vs. prosaic capabilities (e.g. in the extra slides here) and this could be enough to result in aligned (-enough) automated safety researchers which can be applied to the more blue skies parts of safety research. I would also very much prefer something like a coordinated pause around the time when safety research gets automated.
This line of reasoning is part of why I think it’s valuable for any alignment researcher (who can) to focus on bringing the under-defined into a well-defined framework. Shovel-ready tasks will be shoveled much faster by AI shortly anyway.
Agree, I’ve written about (something related to) this very recently.
Yes, things have certainly changed in the four months since I wrote my original comment, with the advent of o1 and Sakana’s Artificial Scientist. Both of those are still incapable of full automation of self-improvement, but they’re close. We’re clearly much closer to a recursive speed up of R&D, leading to FOOM.
from https://jack-clark.net/2024/08/18/import-ai-383-automated-ai-scientists-cyborg-jellyfish-what-it-takes-to-run-a-cluster/…, commenting on https://arxiv.org/abs/2408.06292: ‘Why this matters – the taste of automated science: This paper gives us a taste of a future where powerful AI systems propose their own ideas, use tools to do scientific experiments, and generate results. At this stage, what we have here is basically a ‘toy example’ with papers of dubious quality and insights of dubious import. But you know where we were with language models five years ago? We had things that could barely write a paragraph. Now they can do this. I predict that by the summer of 2026 we will have seen at least one genuinely interesting research paper that was soup-to-nuts generated via a tool-using generative AI system.’
Yep. I find it pretty odd that alignment people say, with a straight face, “LLM agents are still pretty dumb so I don’t think that’s a path to AGI”. I thought the whole field was about predicting progress and thinking out ahead of it.
Progress does tend to happen when humans work on something.
Progress probably happens when parahumans work on things, too.
Who says those things? That doesn’t really sound like something that people say. Like, I think there are real arguments about why LLM agents might not be the most likely path to AGI, but “they are still pretty dumb, therefore that’s not a path to AGI” seems like obviously a strawman, and I don’t think I’ve ever seen it (or at least not within the last 4 years or so).
Fair enough; this sentiment is only mentioned offhand in comments and might not capture very much of the average opinion. I may be misestimating the community’s average opinion. I hope I am wrong, and I’m glad to see others don’t agree!
I’m a bit puzzled on the average attitude toward LLM agents as a route to AGI among alignment workers. I’m still surprised there aren’t more people working directly on aligning LLM agents. I’d think we’d be working harder on the most likely single type of first AGI if lots of us really believed it’s reasonably likely to get there (especially since it’s probably the fastest route if it doesn’t plateau soon).
One possible answer is that people are hoping that the large amount of work on aligning LLMs will cover aligning LLM agents. I think that work is helpful but not sufficient, so we need more thinking about agent alignment as distinct from their base LLMs. I’m currently writing a post on this.
(Most people in AI Alignment work at scaling labs and are therefore almost exclusively working on LLM alignment. That said, I don’t actually know what it means to work on LLM alignment over aligning other systems, it’s not like we have a ton of traction on LLM alignment, and most techniques and insights seem general enough to not be conditional specifically on LLMs)
I think Seth is distinguishing “aligning LLM agents” from “aligning LLMs”, and complaining that there’s insufficient work on the former, compared to the latter? I could be wrong.
I don’t actually know what it means to work on LLM alignment over aligning other systems
Ooh, I can speak to this. I’m mostly focused on technical alignment for actor-critic model-based RL systems (a big category including MuZero and [I argue] human brains). And FWIW my experience is: there are tons of papers & posts on alignment that assume LLMs, and with rare exceptions I find them useless for the non-LLM algorithms that I’m thinking about.
When we get outside technical alignment to things like “AI control”, governance, takeoff speed, timelines, etc., I find that the assumption of LLMs is likewise pervasive, load-bearing, and often unnoticed.
I complain about this from time to time, for example Section 4.2 here, and also briefly here (the bullets near the bottom after “Yeah some examples would be:”).
I agree with the claim that the techniques and insights for alignment that are usually considered are not conditional on LLMs specifically, including my own plan for AI alignment.
I recently gave a talk (slides) on some thoughts about what automating AI safety research might look like.
Some [earlier versions] of the ideas there were developed during my Astra Fellowship Winter ’24 with @evhub and through related conversations in Constellation.
On an apparent missing mood—FOMO on all the vast amounts of automated AI safety R&D that could (almost already) be produced safely
Automated AI safety R&D could results in vast amounts of work produced quickly. E.g. from Some thoughts on automating alignment research (under certain assumptions detailed in the post):
each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.
Despite this promise, we seem not to have much knowledge when such automated AI safety R&D might happen, in which order the relevant capabilities might appear (including compared to various dangerous capabilities), etc. AFAICT. we don’t seem to be trying very hard either, neither at prediction, nor at elicitation of such capabilities.
One potential explanation I heard for this apparent missing mood of FOMO on AI automated AI safety R&D is the idea / worry that the relevant automated AI safety R&D capabilities would only appear when models are already / too close to being dangerous. This seems plausible to me, but would still seems like a strange way to proceed given the uncertainty.
For contrast, consider how one might go about evaluating dangerous capabilities and trying to anticipate them ahead of time. From GDM’s recent Introducing the Frontier Safety Framework:
The Framework has three key components:
Identifying capabilities a model may have with potential for severe harm. To do this, we research the paths through which a model could cause severe harm in high-risk domains, and then determine the minimal level of capabilities a model must have to play a role in causing such harm. We call these “Critical Capability Levels” (CCLs), and they guide our evaluation and mitigation approach.
Evaluating our frontier models periodically to detect when they reach these Critical Capability Levels. To do this, we will develop suites of model evaluations, called “early warning evaluations,” that will alert us when a model is approaching a CCL, and run them frequently enough that we have notice before that threshold is reached.
Applying a mitigation plan when a model passes our early warning evaluations. This should take into account the overall balance of benefits and risks, and the intended deployment contexts. These mitigations will focus primarily on security (preventing the exfiltration of models) and deployment (preventing misuse of critical capabilities).
I see no reason why, in principle, a similar high-level approach couldn’t be analogously taken for automated AI safety R&D capabilities, yet almost nobody seems to be working on this, at least publicly (in fact, I am currently working on related topics; I’m still very surprised by the overall neglectedness, made even more salient by current events).
AI R&D and AI safety R&D will almost surely come at the same time.
Putting aside using AIs as tools in some more limited ways (e.g. for interp labeling or generally for grunt work)
People at labs are often already heavily integrating AIs into their workflows (though probably somewhat less experimentation here than would be ideal as far as safety people go).
It seems good to track potential gaps between using AIs for safety and for capabilities, but by default, it seems like a bunch of this work is just ML and will come at the same time.
AI R&D and AI safety R&D will almost surely come at the same time.
Putting aside using AIs as tools in some more limited ways (e.g. for interp labeling or generally for grunt work)
Seems like probably the modal scenario to me too, but even limited exceptions like the one you mention seem to me like they could be very important to deploy at scale ASAP, especially if they could be deployed using non-x-risky systems (e.g. like current ones, very bad at DC evals).
People at labs are often already heavily integrating AIs into their workflows (though probably somewhat less experimentation here than would be ideal as far as safety people go).
This seems good w.r.t. automated AI safety potentially ‘piggybacking’, but bad for differential progress.
It seems good to track potential gaps between using AIs for safety and for capabilities, but by default, it seems like a bunch of this work is just ML and will come at the same time.
Sure, though wouldn’t this suggest at least focusing hard on (measuring / eliciting) what might not come at the same time?
mention seem to me like they could be very important to deploy at scale ASAP
Why think this is important to measure or that this already isn’t happening?
E.g., on the current model organism related project I’m working on, I automate inspecting reasoning traces in various ways. But I don’t feel like there is any particularly interesting thing going on here which is important to track (e.g. this tip isn’t more important than other tips for doing LLM research better).
Intuitively, I’m thinking of all this as something like a race between [capabilities enabling] safety and [capabilities enabling dangerous] capabilities (related: https://aligned.substack.com/i/139945470/targeting-ooms-superhuman-models); so from this perspective, maintaining as large a safety buffer as possible (especially if not x-risky) seems great. There could also be something like a natural endpoint to this ‘race’, corresponding to being able to automate all human-level AI safety R&D safely (and then using this to produce a scalable solution to aligning / controlling superintelligence).
W.r.t. measurement, I think it would be good orthogonally to whether auto AI safety R&D is already happening or not, similarly to how e.g. evals for automated ML R&D seem good even if automated ML R&D is already happening. In particular, the information of how successful auto AI safety R&D would be (and e.g. what the scaling curves look like vs. those for DCs) seems very strategically relevant to whether it might be feasible to deploy it at scale, when that might happen, with what risk tradeoffs, etc.
To get somewhat more concrete, the Frontier Safety Framework report already proposes a (somewhat vague) operationalization for ML R&D evals, which (vagueness-permitting) seems straightforward to translate into an operationalization for automated AI safety R&D evals:
Machine Learning R&D level 1: Could significantly accelerate AI research at a cutting-edge lab if deployed widely, e.g.improving the pace of algorithmic progress by 3X, or comparably accelerate other AI research groups.
Machine Learning R&D level 2: Could fully automate the AI R&D pipeline at a fraction of human labor costs, potentially enabling hyperbolic growth in AI capabilities.
Epoch is one of my favorite orgs, but I expect many of the predictions in https://epochai.org/blog/interviewing-ai-researchers-on-automation-of-ai-rnd to be overconservative / too pessimistic. I expect roughly a similar scaleup in terms of compute as https://x.com/peterwildeford/status/1825614599623782490… - training runs ~1000x larger than GPT-4′s in the next 3 years—and massive progress in both coding and math (e.g. along the lines of the medians in https://metaculus.com/questions/6728/ai-wins-imo-gold-medal/…https://metaculus.com/questions/12467/ai-wins-ioi-gold-medal/…), which seem to me like some of the most important domains for automated ML research, both in terms of prerequisite capabilities, and for transfer reasons. Furthermore, along the lines of previous Epoch analysis in https://epochai.org/blog/trading-off-compute-in-training-and-inference…, I expect large chunks of ML research (e.g. post-training) to be differentially automatable by spending more inference compute, because they have relatively accurate and cheap proxy feedback signals (e.g. accuracy on various benchmarks). I also expect the ability to generate verifiable synthetic code + math data to train on to also contribute to math/code/reasoning capabilities more broadly (as is already rumored about Q*/Strawberry). And finally, I wouldn’t be too surprised if even the fuzzier parts of ML research workflows like ideation might be at least somewhat automatable, along the lines of https://sakana.ai/ai-scientist/, especially for ML research which takes relatively little compute to iterate (e.g. large parts of post-training).
The already-feasibility of https://sakana.ai/ai-scientist/ (with basically non-x-risky systems, sub-ASL-3, and bad at situational awareness so very unlikely to be scheming) has updated me significantly on the tractability of the alignment / control problem. More than ever, I expect it’s gonna be relatively tractable (if done competently and carefully) to safely, iteratively automate parts of AI safety research, all the way up to roughly human-level automated safety research (using LLM agents roughly-shaped like the AI scientist, with humans in the loop; in a control framework https://lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled…, ideally). Add some decent regulation (especially vs. uncontrolled intelligence explosions using AI scientists inside the largest AI labs) + paretotopian coordination https://aiprospects.substack.com/p/paretotopian-goal-alignment… (perhaps especially at the international level) and I think we might have all the basic ingredients of a ‘pivotal act’ in the best sense of the term. (Easier said than done, though).
I’d love to see at least several Sakana-like AI safety orgs, focused on using LLM agents to automate / strongly augment various parts of safety research. And ideally putting those tools (technically agents) in the hands of hundreds / thousands of (including independent) AI safety researchers. Unsure if the AI safety funding / field-building infrastructure is anywhere near ready for this level of potential scale-up, though. I expect quite a few safety research areas to be prime candidates for early automation / strong augmentation, especially the ones with relatively short and precise feedback loops (in both code and compute), e.g. unlearning, activation/representation engineering (and applications) and large parts of LLM fine-tuning (e.g. https://arxiv.org/abs/2406.08414 for a prototype).
I think currently approximately no one is working on the kind of safety research that when scaled up would actually help with aligning substantially smarter than human agents, so I am skeptical that the people at labs could automate that kind of work (given that they are basically doing none of it). I find myself frustrated with people talking about automating safety research, when as far as I can tell we have made no progress on the relevant kind of work in the last ~5 years.
when as far as I can tell we have made no progress on the relevant kind of work in the last ~5 years.
Can you give some examples of work which you do think represents progress?
My rough mental model is something like: LLMs are trained using very large-scale (multi-task) behavior cloning, so I expect various tasks to be increasingly automated / various data distributions to be successfully, all the way to (in the limit) at roughly everything humans can do, as a lower bound; including e.g. the distribution of what tends to get posted on LessWrong / the Alignment Forum (and agent foundations-like work); especially when LLMs are scaffolded into agents, given access to tools, etc. With the caveats that the LLM [agent] paradigm might not get there before e.g. it runs out of data / compute, or that the systems might be unsafe to use; but against this was where https://sakana.ai/ai-scientist/ was a particularly significant update for me; and ‘iteratively automate parts of AI safety research’ is also supposed to help with keeping systems safe as they become increasingly powerful.
Change my mind: outer alignment will likely be solved by default for LLMs. Brain-LM scaling laws (https://arxiv.org/abs/2305.11863) + LM embeddings as model of shared linguistic space for transmitting thoughts during communication (https://www.biorxiv.org/content/10.1101/2023.06.27.546708v1.abstract) suggest outer alignment will be solved by default for LMs: we’ll be able to ‘transmit our thoughts’, including alignment-relevant concepts (and they’ll also be represented in a [partially overlapping] human-like way).
I think the Corrigibility agenda, framed as “do what I mean, such that I will probably approve of the consequences, not just what I literally say such that our interaction will likely harm my goals” is more doable than some have made it out to be. I still think that there are sufficient subtle gotchas there that it makes sense to treat it as an area for careful study rather than “solved by default, no need to worry”.
And already some potential AI safety issues: ’We have noticed that The AI Scientist occasionally tries to increase its chance of success, such as modifying and launching its own execution script! We discuss the AI safety implications in our paper.
For example, in one run, it edited the code to perform a system call to run itself. This led to the script endlessly calling itself. In another case, its experiments took too long to complete, hitting our timeout limit. Instead of making its code run faster, it simply tried to modify its own code to extend the timeout period.′
Obviously relevant for automating safety research too; see this presentation and this comment for some related thoughts.
Quick take: I think LM agents to automate large chunks of prosaic alignment research should probably become the main focus of AI safety funding / person-time. I can’t think of any better spent marginal funding / effort at this time.
RSPs for automated AI safety R&D require rethinking RSPs
AFAICT, all current RSPs are only framed negatively, in terms of [prerequisites to] dangerous capabilities to be detected (early) and mitigated.
In contrast, RSPs for automated AI safety R&D will likely require measuring [prerequisites to] capabilities for automating [parts of] AI safety R&D, and preferentially (safely) pushing these forward. An early such examples might be safely automating some parts of mechanistic intepretability.
quick take: Against Almost Every Theory of Impact of Interpretability should be required reading for ~anyone starting in AI safety (e.g. it should be in the AGISF curriculum), especially if they’re considering any model internals work (and of course even more so if they’re specifically considering mech interp)
56% on swebench-lite with repeated sampling (13% above previous SOTA; up from 15.9% with one sample to 56% with 250 samples), with a very-below-SOTA model https://arxiv.org/abs/2407.21787; anything automatically verifiable (large chunks of math and coding) seems like it’s gonna be automatable in < 5 years.
(epistemic status: quick take, as the post category says)
Browsing though EAG London attendees’ profiles and seeing what seems like way too many people / orgs doing (I assume dangerous capabilities) evals. I expect a huge ‘market downturn’ on this, since I can hardly see how there would be so much demand for dangerous capabilities evals in a couple years’ time / once some notorious orgs like the AISIs build their sets, which many others will probably copy.
While at the same time other kinds of evals (e.g. alignment, automated AI safety R&D, even control) seem wildly neglected.
There’s still lots and lots of demand for regular capability evaluation, as we keep discovering new issues or LLMs keep tearing through them and rendering them moot, and the cost of creating a meaningful dataset like GPQA keeps skyrocketing (like ~$1000/item) compared to the old days where you could casually Turk your way to questions LLM would fail (like <$1/item). Why think that the dangerous subset would be any different? You think someone is going to come out with a dangerous-capabilities eval in the next year and then that’s it, it’s done, we’ve solved dangerous-capabilities eval, Mission Accomplished?
You think someone is going to come out with a dangerous-capabilities eval in the next year and then that’s it, it’s done, we’ve solved dangerous-capabilities eval, Mission Accomplished?
If it’s well designed and kept private, this doesn’t seem totally implausible to me; e.g. how many ways can you evaluate cyber capabilities to try to asses risks of weights exfiltration or taking over the datacenter (in a control framework)? Surely that’s not an infinite set.
But in any case, it seems pretty obvious that the returns should be quickly diminishing on e.g. the 100th set of DC evals vs. e.g. the 2nd set of alignment evals / 1st set of auto AI safety R&D evals.
It’s not an infinite set and returns diminish, but that’s true of regular capabilities too, no? And you can totally imagine new kinds of dangerous capabilities; every time LLMs gain a new modality or data source, they get a new set of vulnerabilities/dangers. For example, once Sydney went live, you had a whole new kind of dangerous capability in terms of persisting knowledge/attitudes across episodes of unrelated users by generating transcripts which would become available by Bing Search. This would have been difficult to test before, and no one experimented with it AFAIK. But after seeing indirect prompt injections in the wild and possible amplification of the Sydney personae, now suddenly people start caring about this once-theoretical possibility and might start evaluating it. (This is also a reason why returns don’t diminish as much, because benchmarks ‘rot’: quite aside from intrinsic temporal drift and ceiling issues and new areas of dangers opening up, there’s leakage, which is just as relevant to dangerous capabilities as regular capabilities—OK, you RLHFed your model to not provide methamphetamine recipes and this has become a standard for release, but it only works on meth recipes and not other recipes because no one in your org actually cares and did only the minimum RLHF to pass the eval and provide the execs with excuses, and you used off-the-shelf preference datasets designed to do the minimum, like Facebook releasing Llama… Even if it’s not leakage in the most literal sense of memorizing the exact wording of a question, there’s still ‘meta-leakage’ of overfitting to that sort of question.)
To expand on the point of benchmark rot, as someone working on dangerous capabilities evals… For biorisk specifically, one of the key things to eval is if the models can correctly guess the results of unpublished research. As in, can it come up with plausible hypotheses, accurately describe how to test those hypotheses, and make a reasonable guess at the most likely outcomes? Can it do these things at expert human level? superhuman level?
The trouble with this is that the frontier of published research keeps moving forward, so evals like this go out of date quickly. Nevertheless, such evals can be very important in shaping the decisions of governments and corporations.
I do agree that we shouldn’t put all our focus on dangerous capabilities evals at the expense of putting no focus on other kinds of evals (e.g. alignment, automated AI safety R&D, even control). However, I think a key point is that the models are dangerous NOW. Alignment, safety R&D, and control are, in some sense, future problems. Misuse is a present and growing danger, getting noticeably worse with every passing month. A single terrorist or terrorist org could wipe out human civilization today, killing >90% of the population with less than $500k funding (potentially much less if they have access to a well-equipped lab, and clever excuses ready for ordering suspicious supplies). We have no sufficient defenses. This seems like an urgent and tractable problem.
Urgent, because the ceiling on uplift is very far away. Models have the potential to make things much much worse than they currently are.
Tractable, because there are relatively cheap actions that governments could take to slow this increase of risk if they believed in the risks.
For what it’s worth, I try to spend some of my time thinking about these other types of evals also. And I would recommend that those working on dangerous capabilities evals spend at least a little time and thought on the other problems.
Another aspect of the problem
A lot of people seem to frequently be ‘trend following’ rather than ‘trend setting’ (by thinking original thoughts for themselves, doing their own research and coming to their own well-formed opinions). If those ‘trend followers’ are also not super high impact thinkers, maybe it’s ok if they’re just doing the obvious things?
I find it pretty ironic that many in AI risk mitigation would make asks for if-then committments/RSPs from the top AI capabilities labs, but they won’t make the same asks for AI safety orgs/funders. E.g.: if you’re an AI safety funder, what kind of evidence (‘if’) will make you accelerate how much funding you deploy per year (‘then’)?
One of these types of orgs is developing a technology with the potential to kill literally all of humanity. The other type of org is funding research that if it goes badly mostly just wasted their own money. Of course the demands for legibility and transparency should be different.
I think the journalists might have misinterpreted Sutskever, if the quote provided in the article is the basis for the claim about plateauing:
Ilya Sutskever … told Reuters recently that results from scaling up pre-training—the phase of training an AI model that uses a vast amount of unlabeled data to understand language patterns and structures—have plateaued.
“The 2010s were the age of scaling, now we’re back in the age of wonder and discovery once again. Everyone is looking for the next thing,” Sutskever said. “Scaling the right thing matters more now than ever.”
What he’s likely saying is that there are new algorithmic candidates for making even better use of scaling. It’s not that scaling LLM pre-training plateaued, but rather other things became available that might be even better targets for scaling. Focusing on these alternatives could be more impactful than focusing on scaling of LLM pre-training further.
He’s also currently motivated to air such implications, since his SSI only has $1 billion, which might buy a 25K H100s cluster, while OpenAI, xAI, and Meta recently got 100K H100s clusters (Google and Anthropic likely have that scale of compute as well, or will imminently).
If base model scaling has indeed broken down, I wonder how this manifests. Does the Chinchilla scaling law no longer hold beyond a certain size? Or does it still hold, but reduction in prediction loss no longer goes along with a proportional increase in benchmark performance? The latter could mean the quality of the (largely human generated) training data is the bottle neck.
the plausibility of this strategy, to ‘endlessly trade computation for better performance’ and then have very long/parallelized runs, is precisely one of the scariest aspects of automated ML; even more worrying that it’s precisely what some people in the field are gunning for, and especially when they’re directly contributing to the top auto ML scaffolds; although, everything else equal, it might be better to have an early warning sign, than to have a large inference overhang: https://x.com/zhengyaojiang/status/1844756071727923467
‘Krenn thinks that o1 will accelerate science by helping to scan the literature, seeing what’s missing and suggesting interesting avenues for future research. He has had success looping o1 into a tool that he co-developed that does this, called SciMuse. “It creates much more interesting ideas than GPT-4 or GTP-4o,” he says.’ (source; related: current underelicitation of auto ML capabilities)
I’ve been thinking hard about what my next step should be, after my job applications being turned down again by various safety orgs and Anthropic.
Now it seems clear to me. I have a vision of how I expect an RSI process to start, using LLMs to mine testable hypotheses from existing published papers.
I should just put my money where my mouth is, and try to build the scaffolding for this. I can then share my attempts with someone at Anthropic. If I’m wrong, I will be wasting my time and savings. If I’m right, I might be substantially helping the world. Seems like a reasonable bet.
I can then share my attempts with someone at Anthropic.
Alternately, collaborating/sharing with e.g. METR or UK AISI auto ML evals teams might be interesting. Maybe even Pallisade or similar orgs from a ‘scary demo’ perspective? @jacquesthibs might also be interested. I might also get to work on this or something related, depending on how some applications go.
I also expect Sakana, Jeff Clune’s group and some parts of the open-source ML community will try to push this, but I’m more uncertain at least in some of these cases about the various differential acceleration tradeoffs.
This is what I’ve been trying to tell people for the past couple years. There is undigested useful info and hypotheses buried in noise amidst published academic papers. I call this an ‘innovation overhang’. The models don’t need to be smart enough to come up with ideas, just smart enough to validate/find them admist the noise and then help set up experiments to test them.
It might be interesting to develop/put out RFPs for some benchmarks/datasets for unlearning of ML/AI knowledge (and maybe also ARA-relevant knowledge), analogously to WMDP for CBRN. This might be somewhat useful e.g. in a case where we might want to use powerful (e.g. human-level) AIs for cybersecurity, but we don’t fully trust them.
A further interesting direction for automated interpretability would be to build interpreter agents: AI scientists which given an SAE feature could create hypotheses about what the feature might do, come up with experiments that would distinguish between those hypotheses (for instance new inputs or feature ablations), and then repeat until the feature is well-understood. This kind of agent might be the first automated alignment researcher. Our early experiments in this direction have shown that we can substantially increase automated interpretability performance with an iterative refinement step, and we expect to be able to push this approach much further.
Seems like it would be great if inference scaling laws worked for this particular application.
I think it’s quite likely we’re already in crunch time (e.g. in a couple years’ time we’ll see automated ML accelerating ML R&D algorithmic progress 2x) and AI safety funding is *severely* underdeployed. We could soon see many situations where automated/augmented AI safety R&D is bottlenecked by lack of human safety researchers in the loop. Also, relying only on the big labs for automated safety plans seems like a bad idea, since the headcount of their safety teams seems to grow slowly (and I suspect much slower than the numbers of safety researchers outside the labs). Related: https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/.
A list of rough ideas of things I find potentially promising to do/fund: - RFP(s) for control agenda (e.g. for the long list here https://lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled#Appendix__A_long_list_of_control_techniques) - Scale up and decentralize some of the grantmaking, e.g. through regrantors RFP(s) for potentially very scalable agendas, e.g. applying/integrating automated research to various safety research agendas - RFPs / direct funding for coming up with concrete plans for what to do (including how to deploy funding) in very short timelines (e.g. couple of years); e.g. like https://sleepinyourhat.github.io/checklist/, but ideally made even more concrete (to the degree it’s not info-hazardous) - Offer more funding to help scale up MATS, Astra, etc. both in terms of number of mentors and mentees/mentor (if mentors are up for it); RFPs for building more similar programs - RFPs for entrepreneurship/building scalable orgs, and potentially even incubators for building such orgs, e.g. https://catalyze-impact.org/apply - Offer independent funding, but no promise of mentorship to promising-enough candidates coming out of field-building pipelines (e.g. MATS, Astra, AGISF, ML4Good, ARENA, AI safety camp, Athena).
Inference scaling laws should be a further boon to automated safety research, since they add another way to potentially spend money very scalably https://youtu.be/-aKRsvgDEz0?t=9647
I suspect (though with high uncertainty) there are some factors differentially advantaging safety research, especially prosaic safety research (often in the shape of fine-tuning/steering/post-training), which often requires OOMs less compute and more broadly seems like it should have faster iteration cycles; and likely even ‘human task time’ (e.g. see figures in https://metr.github.io/autonomy-evals-guide/openai-o1-preview-report/) probably being shorter for a lot of prosaic safety research, which makes it likely it will on average be automatable sooner. The existence of cheap, accurate proxy metrics might point the opposite way, though in some cases there seem to exist good, diverse proxies—e.g. Eight Methods to Evaluate Robust Unlearning in LLMs. More discussion in the extra slides of this presentation.
Related, I expect the delay from needing to build extra infrastructure to train much larger LLMs will probably differentially affect capabilities progress more, by acting somewhat like a pause/series of pauses; which it would be nice to exploit by enrolling more safety people to try to maximally elicit newly deployed capabilities—and potentially be uplifted—for safety research. (Anecdotally, I suspect I’m already being uplifted at least a bit as a safety researcher by using Sonnet.)
Also, I think it’s much harder to pause/slow down capabilities than to accelerate safety, so I think more of the community focus should go to the latter.
And for now, it’s fortunate that inference scaling means CoT and similarly differentially transparent (vs. model internals) intermediate outputs, which makes it probably a safer way of eliciting capabilities.
The intelligence explosion might be quite widely-distributed (not just inside the big labs), especially with open-weights LMs (and the code for the AI scientist is also open-sourced):
I think that would be really bad for our odds of surviving and avoiding a permanent suboptimal dictatorship, if the multipolar scenario continues up until AGI is fully RSI capable. That isn’t a stable equilibrium; the most vicious first mover tends to win and control the future. Some 17yo malcontent will wipe us out or become emperor for their eternal life. More logic in If we solve alignment, do we all die anyway? and the discussion there.
I think that argument will become so apparent that that scenario won’t be allowed to happen.
Having merely capable AGI widely available would be great for a little while.
I’m uncertain about all this, but here are some quick takes.
With respect to technical intent alignment, I think we’re very lucky that a lot of safety research will probably be automatable by non-x-risky systems (sub-ASL-3, very unlikely to be scheming because bad at prerequisites like situational awareness, often surprisingly transparent because they use CoT, tools, etc.). So I think we could be in a really good position, if we actually tried hard to use such systems for automated safety research, (for now it doesn’t seem to me like we’re trying all that hard as a community).
I’m even more uncertain about the governance side, especially about what should be done. I think open-weights LMs widely distributed intelligence explosions are probably really bad, so hopefully at least very powerful LMs don’t get open-weighted. Beyond this, though, I’m more unsure about more multipolar vs. more unipolar scenarios, given e.g. the potential lack of robustness of single points of failure. I’m somewhat hopeful that nation-level actors impose enough constraints/regulation at the national level, and then something like https://aiprospects.substack.com/p/paretotopian-goal-alignment happens at the international level. We might also just get somewhat lucky that compute constraints + economic and security incentives might mean that there never are more than e.g. 20 actors with ( at least direct, e.g. weights) access to very strong superintelligence.
I feel like 20-100 actors feels like a reasonable amount to coordinate on treaties. I think 300 starts to worry me that there’d be some crazy defector in the mix that takes risks which destroy themselves and everyone else. Just 2 or 3 actors, and I worry that there will be weird competitive tensions that make it hard to come to a settlement. I dunno, maybe I’m wrong about that, but it’s how I feel.
I’ve been writing a bit about some ideas around trying to establish a ‘council of guardians’, who are the major powers in the world. They would agree to mutual inspections and to collaborate on stopping the unauthorized development of rogue AI and self-replicating weapons.
I’ve been thinking about similar international solutions, so I look forward to seeing your thoughts on the matter.
My major concern is sociopathic people gaining the reins of power of just one of those AGIs, and defecting against that council of guardians. I think sociopaths are greatly overrepresented among powerful people; they care less about the downsides of having and pursuing power aggressively.
That’s why I’d think even 20 RSI-capable human-directed AGIs wouldn’t be stable for more than a decade.
Yeah, I see it as sort of a temporary transitional mode for humanity. I also don’t think it would be stable for long. I might give it 20-30 years, but I would be skeptical about it holding for 50 years.
I do think that even 10 years more to work on more fundamental solutions to the AGI transition would be hugely valuable though!
I have been attempting at least to imagine how to design a system assuming that all the actors will be selfish and tempted to defect (and possibly sociopathic, as power-holders sometimes are or become), but prevented from breaking the system. Defection-resistant mechanisms, where you just need a majority of the council to not defect in a given ‘event’ in order for them to halt and punish the defectors. And that hopefully making it obvious that this was the case, and that defection would get noticed and punished, would prevent even sociopathic power-holders from defecting.
This seems possible to accomplish, if the system is designed such that catching and punishing an attempt at defection has benefits for the enforcers which give higher expected value in their minds than the option of deciding to also defect once they detected someone else defecting.
Seems like a good problem to largely defer to AI though (especially if we’re assuming alignment in the instruction following sense), so maybe not the most pressing.
Unless there’s important factors about ‘order of operations’. By the time we have a powerful enough AI to solve this for us, it could be that someone is already defecting by using that AI to pursue recursive self-improvement at top speed…
I think that that is probably the case. I think we need to get the Council of Guardians in place and preventing defection before it’s too late, and irreversibly bad defection has already occurred.
I am unsure of exactly where the thresholds are, but I am confident that nobody else should be confident that there aren’t any risks! Our uncertainty should cause us to err on the side of putting in safe governance mechanisms ASAP!
We propose and run an LLM-driven discovery process to synthesize novel preference optimization algorithms.
We use this pipeline to discover multiple high-performing preference optimization losses. One such loss, which we call Discovered Preference Optimization (DiscoPOP), achieves state-of-the-art performance across multiple held-out evaluation tasks, outperforming Direct Preference Optimization (DPO) and other existing methods.
We provide an initial analysis of DiscoPOP, to discover surprising and counterintuitive features.
We open-source not only the tuned model checkpoints and discovered objective functions but also the codebase for running the discovery process itself on GitHub and HuggingFace.′
’Our general LLM-driven discovery approach can be grouped into three steps:
First, we prompt an LLM with an initial task and problem description. We optionally add examples or previous evaluations and their recorded performance to the initial prompt.
The LLM is tasked with outputting a hypothesis, a name for the new method, and a code implementation of their method. We use the code in an inner loop training run and store the downstream performance of interest.
Finally, we update the context of the LLM with the results from the newly evaluated performance.′
Looking at how much e.g. the UK (>300B$) or the US (>1T$) have spent on Covid-19 measures puts in perspective how little is still being spent on AI safety R&D. I expect fractions of those budgets (<10%), allocated for automated/significantly-augmented AI safety R&D, would obsolete all previous human AI safety R&D.
1. Unfortunately, I think it’s harder to convert government funding into novel research than one might expect. I think there are a limited number of competent thinkers who are sufficiently up to speed on the problem to contribute within the short remaining time before catastrophe. I do agree that more government funding would help a lot, and that I’d personally love to be given some funding! I do also agree that it would help a huge amount in the long term (> 20 years). In the short term (< 3 years) however, I don’t think that even a trillion dollars of government funding would result in AI safety R&D progress sufficient to exceed all previous human AI safety R&D. I also think that there’s decreasing returns to funding, as the available researchers get sufficiently motivated to switch to the topic, and have sufficient resources to supply their labs with compute and assistants. I think that in the current world you probably don’t get much for your 11th trillion. So yeah, I’d definitely endorse spending $10 trillion on AI safety R&D (although I do think there are ways this could be implemented which would be unhelpful or even counter-productive).
2. I think that exceeding previous AI safety R&D is very different from obsoleting it. Building on a foundation and reaching greater heights doesn’t make the foundation worthless. If you do think that the foundation is worthless, I’d be curious to hear your arguments, but that seems like a different train of thought entirely.
3. I think that there will be a critical period where there is sufficiently strong AI that augmented/automated AI safety R&D will be able to rapidly eclipse the existing body of work. I don’t think we are there yet, and I wouldn’t choose to accelerate AI capabilities timelines further to get us there sooner. I do think that having AI safety labs well-supplied with funding and compute is important, but I don’t think that any amount of money or compute currently buys the not-yet-existing AI research assistant level AI which could result in the start of the critical automated-research period. I worry that this level of AI capability will also enable recursive self-improvement of AI capabilities research, so I think that’s going to be a very dangerous time indeed.
Given the above, it might be high value to start testing integrating more interpretability tools into interpretability (V)LM agents like MAIA and maybe even considering randomized controlled trials to test for any productivity improvements they could already be providing.
For example, probing / activation steering workflows seem to me relatively short-horizon and at least somewhat standardized, to the point that I wouldn’t be surprised if MAIA could already automate very large chunks of that work (with proper tool integration). (Disclaimer: I haven’t done much probing / activation steering hands-on work [though I do follow such work quite closely and have supervised related projects], so my views here might be inaccurate).
While I’m not sure I can tell any ‘pivotal story’ about such automation, if I imagine e.g. 10x more research on probing and activation steering / year / researcher as a result of such automation, it still seems like it could be a huge win. Such work could also e.g. provide much more evidence (either towards or against) the linear representation hypothesis.
‘While the labs certainly perform much valuable alignment research, and definitely contribute a disproportionate amount per-capita, they cannot realistically hope to compete with the thousands of hobbyists and PhD students tinkering and trying to improve and control models. This disparity will only grow larger as more and more people enter the field while the labs are growing at a much slower rate. Stopping open-source ‘proliferation’ effectively amounts to a unilateral disarmament of alignment while ploughing ahead with capabilities at full-steam.
Thus, until the point at which open source models are directly pushing the capabilities frontier themselves then I consider it extremely unlikely that releasing and working on these models is net-negative for humanity’
‘Much capabilities work involves simply gathering datasets or testing architectures where it is easy to utilize other closed models referenced in pappers or through tacit knowledge of employees. Additionally, simple API access to models is often sufficient to build most AI-powered products rather than direct access to model internals. Conversely, such access is usually required for alignment research. All interpretability requires access to model internals almost by definition. Most of the AI control and alignment techniques we have invented require access to weights for finetuning or activations for runtime edits. Almost nothing can be done to align a model through access to the I/O API of a model at all. Thus it seems likely to me that by restricting open-source we differetially cripple alignment rather than capabilities. Alignment research is more fragile and dependent on deep access to models than capabilities research.’
Current open source models are not themselves any kind of problem. Their availability accelerates timelines, helps with alignment along the way. If there is no moratorium, this might be net positive. If there is a moratorium, it’s certainly net positive, as it’s the kind of research that the moratorium is buying time for, and it doesn’t shorten timelines because they are guarded by the moratorium.
It’s still irreversible proliferation even when the impact is positive. The main issue is open source as an ideology that unconditionally calls for publishing all the things, and refuses to acknowledge the very unusual situations where not publishing things is better than publishing things.
I believe from my work on dangerous capabilities evals that current open source models do provide some small amount of uplift to bad actors. This uplift is increasing much greater than linearly with each new more-capable open source model that is released. If we want to halt this irreversible proliferation before it gets so far that human civilization gets wiped out, we need to act fast on it.
Alignment research is important, but misalignment is not the only threat we face.
One thing that comes to mind is test time compute, and Figure 3 of Language Monkeys paper is quite concerning, where even Pythia-70M (with an “M”) is able to find signal on problems that at first glance are obviously impossible for it to make heads or tails of (see also). If there is an algorithmic unlock, a Llama-3-405B (or Llama-4) might suddenly get much more capable if fed a few orders of magnitude more inference compute than normal. So the current impression about model capabilities can be misleading about what they eventually enable, using future algorithms and still affordable amounts of compute.
Excellent point Vladimir. My team has been thinking a lot about this issue. What if somebody leaked the latest AlphaFold, and instructions on how to make good use of it? If you could feed the instructions into an existing open-source model, and get functional python code out to interact with the private AlphaFold API you set up… That’s a whole lot more dangerous than an LLM alone!
As the whole space of ‘biological design tools’ (h/t Anjali for this term to describe the general concept) gets more capable and complex, the uplift from an LLM that can help you navigate and utilize these tools gets more dangerous. A lot of these computational tools are quite difficult to use effectively for a layperson, yet an AI can handle them fairly easily if given the documentation.
Sure. My specific work is on biorisk evals. See WMDP.ai Closed API models leak a bit of biorisk info, but mainly aren’t that helpful for creating bioweapons (so far as I am able to assess).
Uncensored open-weight models are a whole different ball game. They can be made to be actively quite harm-seeking / terrorist-aligned, and also quite technically competent, such that they make researching and engaging in bioweapons creation substantially easier. The more capable the models get, the more reliable their information is, the higher quality their weapon design ideation is, the more wetlab problems they are able to help you troubleshoot, the more correct their suggestions for getting around government controls and detection methods… And so on. I am saying that this capability is increasing non-linearly in terms of expected risk/harms as open-weight model capabilities increase.
Part of this is that they cross some accuracy/reliability threshold on increasingly tricky tasks where you can at least generate a bunch of generations and pick the most agree-upon idea. Whereas, if you see that there is correct advice given, but only around 5% of the time, then you can be pretty sure that someone who didn’t know that that was the correct advice would know how to pick out the correct advice. As soon as the most frequent opinion is correct, you are in a different sort of hazard zone (e.g. imagine if a single correct idea came up 30% of the time to a particular question, and all the other ideas, most of them incorrect, came up 5% of the time each).
Also it matters a lot whether the model routinely fails at ‘critical fail steps’ versus ‘cheap retry steps’. There are a lot of tricky wetlab steps where you can easily fail, and the model can provide only limited help, but the reason for failure is clear after a bit of internet research and doesn’t waste much in the way of resources. Such ‘noncritical failures’ are very different from ‘critical failures’ such as failing to obfuscate your internet purchases such that a government agency detects your activity and decides to investigate you more closely.
Thanks! How optimistic/excited would you be about research in the spirit of Tamper-Resistant Safeguards for Open-Weight LLMs, especially given that banning open-weight models seems politically unlikely, at least for now?
Extremely excited about the idea of such research succeeding in the near future! But skeptical that it will succeed in time to be at all relevant. So my overall expected value for that direction is low.
Also, I think there’s probably a very real risk that the bird has already flown the coop on this. If you can cheaply modify existing open-weight models to be ‘intent-aligned’ with terrorists, and to be competent at using scaffolding that you have built around ‘biological design tools’… then the LLM isn’t really a bottleneck anymore. The irreversible proliferation has occurred already. I’m not certain this is the case, but I’d give it about 75%.
So then you need to make sure that better biological design tools don’t get released, and that more infohazardous virology papers don’t get published, and that wetlab automation tech doesn’t get better, and… the big one.… that nobody releases an open-weight LLM so capable that it can successfully create tailor-made biological design tools. That’s a harder thing to censor out of an LLM than getting it to directly not help with biological weapons! Creation of biological design tools touches on a lot more things, like its machine learning knowledge and coding skill. What exactly do you censor to make a model helpful at building purely-good tools but not at building tools which have dual-use?
Basically, I think it’s a low-return area entirely. I think humanity’s best bet is in generalized biodefenses, plus an international ‘Council of Guardians’ which use strong tool-AI to monitor the entire world and enforce a ban on:
a) self-replicating weapons (e.g. bioweapons, nanotech)
b) unauthorized recursive-self-improving AI
Of these threats only bioweapons are currently at large. The others are future threats.
And there are theoretical results suggesting some latent factors can be identified using multimodality (all the following could be intepretable as different modalities—multiple brain recording modalities, animals, sessions, species, brains-ANNs), while being provably unindentifiable without the multiple modality—e.g. Identifiability Results for Multimodal Contrastive Learning (and results on nonlinear ICA in single-modal vs. multi-modal settings reviewed in section 2.1). This might a way to bypass single-model interpretability difficulties, by e.g. ‘comparing’ to brains or to other models.
‘in ‘all or none...’ they mention ‘the general form of next-token predictors used in our analysis.‘; if we assume that the language network (in biological brains) is indeed a next-token predictor (as e.g. the next-word prediction results in ‘shared principles...’ suggest), shouldn’t this suggest that the linear properties should then also apply to biological brains, not just LLMs?
This is a very insightful observation! Let’s analyze this:
The key logical chain would be:
1. The “All or None” paper’s results apply to next-token predictors of the form: ``` p(xt+1|x1:t) = exp(f(x1:t)ᵀg(xt+1))/Z(x1:t) ``` where f maps sequences to embeddings and g maps tokens to unembeddings
2. The “Shared Computational Principles” paper provides evidence that the human brain: - Engages in next-word prediction before word onset - Uses context to predict next tokens - Shows surprise signals consistent with prediction error - Represents words in ways that can be mapped to vector spaces
3. If the brain’s language network is implementing a next-token predictor of this general form, then theoretically: - The same mathematical framework should apply - The same identifiability results should hold - The same “all or none” properties of linear features should be present
However, there are some important considerations:
1. Implementation Level: - While the brain might implement this type of predictor at a computational level, the neural implementation could add constraints or properties not captured by the mathematical framework - The brain might use a more complex or different form of next-token prediction
2. Verification Challenges: - It’s harder to verify if the brain’s implementation exactly matches this mathematical form - We can’t directly observe all components like f and g in biological systems
3. Scale/Scope: - The brain might implement multiple prediction systems operating at different scales - The mathematical framework might apply to some but not all aspects of neural language processing
So while your reasoning seems sound—if the brain implements this type of next-token predictor, the mathematical results should apply—verifying this empirically would be challenging. It’s a fascinating theoretical bridge between the mathematical analysis and biological implementation.′
Edit (20 Jul): These estimates erroneously use the sparse FP8 tensor performance for H100s (4 petaFLOP/s), which is 2 times higher than far more relevant dense FP8 tensor performance (2 petaFLOP/s). But with a Blackwell GPU, the relevant dense FP8 performance is 5 petaFLOP/s, which is close to 4 petaFLOP/s, and the cost and power per GPU within a rack are also similar. So the estimates approximately work out unchanged when reading “Blackwell GPU” instead of “H100″.
Thanks! I do wonder if he might not mean $1 billion total cost (e.g. to buy the hardware); because he also claims a $10 billion run might start in 2025, which seems quite surprising?
The $100 million figure is used in the same sentence for cost of currently deployed models. Original GPT-4 was probably trained on A100s in BF16 (A100s can’t do FP8 faster), which is 6e14 FLOP/s, 7 times less than 4e15 FLOP/s in FP8 from an H100 (there is no change in quality of trained models when going from BF16 to FP8, as long as training remains stable). With A100s in BF16 at 30% utilization for 150 days, you need 9K A100s to get 2e25 FLOPs. Assuming $30K per A100 together with associated infrastructure, the cluster would cost $250 million, but again assuming $2 per hour, the time would only cost $60 million. This is 2022, deployed in early 2023. I expect recent models to cost at least somewhat more, so for early 2024 frontier models $100 million would be solidly cost of time, not cost of infrastructure.
The $1 billion for cost of time suggests ability to train on multiple clusters, and Gemini 1.0 report basically says they did just that. So the $10 billion figure needs to be interpreted as being about scale of many clusters taken together, not individual clusters. The estimate for training on H100s for 200 days says you need 150 megawatts for $1 billion in training time, or 1.5 gigawatts for $10 billion in training time. And each hyperscaler has datacenters that consume 2-3 gigawatts in total (they are much smaller individually) with current plans to double. So at least the OOMs match the $10 billion claim interpreted as cost of training time.
Edit (20 Jul): These estimates erroneously use the sparse FP8 tensor performance for H100s (4 petaFLOP/s), which is 2 times higher than far more relevant dense FP8 tensor performance (2 petaFLOP/s). But with a Blackwell GPU, the relevant dense FP8 performance is 5 petaFLOP/s, which is close to 4 petaFLOP/s, and the cost and power per GPU within a rack are also similar. So the estimates approximately work out unchanged when reading “Blackwell GPU” instead of “H100″.
Decomposability seems like a fundamental assumption for interpretability and condition for it to succeed. E.g. from Toy Models of Superposition:
’Decomposability: Neural network activations which are decomposable can be decomposed into features, the meaning of which is not dependent on the value of other features. (This property is ultimately the most important – see the role of decomposition in defeating the curse of dimensionality.) [...]
The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.′
If this assumption is true, it seems favorable to the prospects of safely automating large parts of interpretability work (e.g. using [V]LM agents like MAIA) differentially sooner than many other alignment research subareas and than the most consequential capabilities research (and likely before AIs become x-risky, e.g. before they pass many dangerous capabilities evals). For example, in a t-AGI framework, using an interpretability LM agent to search for the feature corresponding to a certain semantic direction should be much shorter horizon than e.g. coming up with a new conceptual alignment agenda or coming up with a new ML architecture (as well as having much faster feedback loops than e.g. training a SOTA LM using a new architecture).
For example, in a t-AGI framework, using an interpretability LM agent to search for the feature corresponding to a certain semantic direction should be much shorter horizon than e.g. coming up with a new conceptual alignment agenda or coming up with a new ML architecture (as well as having much faster feedback loops than e.g. training a SOTA LM using a new architecture).
Short-horizon tasks (e.g., fixing a problem on a Linux machine or making a web server) were those that would take less than 1 hour, whereas long-horizon tasks (e.g., building a web app or improving an agent framework) could take over four (up to 20) hours for a human to complete.
[...]
The Purple and Blue models completed 20-40% of short-horizon tasks but no long-horizon tasks. The Green model completed less than 10% of short-horizon tasks and was not assessed on long-horizon tasks3. We analysed failed attempts to understand the major impediments to success. On short-horizon tasks, models often made small errors (like syntax errors in code). On longer horizon tasks, models devised good initial plans but did not sufficiently test their solutions or failed to correct initial mistakes. Models also sometimes hallucinated constraints or the successful completion of subtasks.
Summary: We found that leading models could solve some short-horizon tasks, such as software engineering problems. However, no current models were able to tackle long-horizon tasks.
E.g. to the degree typical probing / activation steering work might often involve short 1-hour-horizons, it might be automatable differentially soon; e.g. from Steering GPT-2-XL by adding an activation vector:
For example, we couldn’t find a “talk in French” steering vector within an hour of manual effort.
You’ll often be able to use the model to advance lots of other forms of science and technology, including alignment-relevant science and technology, where experiment and human understanding (+ the assistance of available AI tools) is enough to check the advances in question.
Interpretability research seems like a strong and helpful candidate here, since many aspects of interpretability research seem like they involve relatively tight, experimentally-checkable feedback loops. (See e.g. Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie.)
Quote from Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie:
A quite early example of this is Collin Burn’s work, doing unsupervised identification of some aspects of a neural network that are correlated with things being true or false. I think that is important work. It’s a kind of obvious direction for the stuff to go. You can keep improving it when you have AIs that you’re training to do their best to deceive humans or other audiences in the face of the thing and you can measure whether our lie detectors break down. When we train our AIs to tell us the sky is green in the face of the lie detector and we keep using gradient descent on them, do they eventually succeed? That’s really valuable information to know because then we’ll know our existing lie detecting systems are not actually going to work on the AI takeover and that can allow government and regulatory response to hold things back. It can help redirect the scientific effort to create lie detectors that are robust and that can’t just be immediately evolved around and we can then get more assistance. Basically the incredibly juicy ability that we have working with the AIs is that we can have as an invaluable outcome that we can see and tell whether they got a fast one past us on an identifiable situation. Here’s an air gap computer, you get control of the keyboard, you can input commands, can you root the environment and make a blue banana appear on the screen? Even if we train the AI to do that and it succeeds. We see the blue banana, we know it worked. Even if we did not understand and would not have detected the particular exploit that it used to do it. This can give us a rich empirical feedback where we’re able to identify things that are even an AI using its best efforts to get past our interpretability methods, using its best efforts to get past our adversarial examples.
’Could we do concept algebra directly on the fMRI of the higher visual cortex? Yes, in principle, it should be possible to perform concept algebra directly on the fMRI signals from the higher visual cortex. The key idea would be to identify subspaces in the fMRI signal space that correspond to specific semantic concepts, and then manipulate these subspaces to control the reconstructed image. Here’s a hypothetical workflow:
Collect fMRI data while the subject views a large set of images that vary along certain semantic dimensions (e.g., gender, emotion, object category, scene type, etc.).
For each semantic dimension, identify a set of fMRI samples that differ only along that dimension. For example, for the gender dimension, you might compare the fMRI signals for “man” vs “woman”, keeping other aspects of the image content constant.
Use these samples to estimate a “semantic subspace” for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.
To reconstruct an image, first predict the full fMRI signal from the early visual cortex (similar to predicting z in Takagi et al.).
Then, to manipulate a specific semantic concept, project the predicted fMRI signal onto the orthogonal complement of the corresponding semantic subspace (to remove the original semantic information), and add the desired semantic component from a reference fMRI signal (similar to the concept algebra manipulation of z_c).
Finally, use this modified fMRI signal from the higher visual cortex to condition the diffusion model, reconstructing an image with the desired semantic manipulation.
The main challenges would be:
Collecting a sufficiently large and diverse fMRI dataset to robustly estimate the semantic subspaces.
Ensuring that the estimated subspaces are truly orthogonal and capture the desired semantic variation.
Translating the manipulated fMRI signals into a format that can effectively condition the diffusion model.
Nonetheless, this is an intriguing direction that could potentially unite the biological understanding from Takagi et al.‘s work with the interpretable semantic manipulation of concept algebra. It suggests a way to manipulate images directly from brain-space, without the need for text prompts. This could provide insights into how the brain itself represents and manipulates semantic concepts.’
’How might the above be relevant to the AI alignment problem? The hypothetical approach of performing concept algebra directly on fMRI signals from the higher visual cortex could potentially offer some interesting insights relevant to the AI alignment problem:
Understanding AI systems: By comparing the semantic subspaces identified in fMRI signals with those found in the latent spaces of AI models (like the diffusion model in concept algebra), we could gain insights into how AI systems represent and manipulate semantic concepts, and how this compares to the human brain. This could help in designing AI systems that reason in ways more analogous to humans, which could be beneficial for alignment.
Interpretability and control: The ability to identify and manipulate specific semantic concepts directly in the space of brain activity could provide a more direct and interpretable way of controlling the output of AI systems. If we can align the semantic subspaces of the AI with those of the human brain, it might become possible to control the AI’s outputs by manipulating these subspaces, leading to more interpretable and controllable AI systems.
Value alignment: By understanding how human values and preferences are represented in the brain, and how they relate to the semantic concepts identified through concept algebra, we might gain insights into how to align AI systems with human values. If we can identify the neural subspaces corresponding to human values and ethics, we could potentially use this information to constrain the AI’s outputs to align with these values.
Cognitive modeling: This approach could also contribute to the development of more accurate cognitive models of human reasoning and decision-making. By understanding how semantic concepts are represented and manipulated in the brain, we can build AI systems that more closely mimic human cognition, which could be useful for predicting and understanding human behavior in the context of AI alignment.
Neuro-symbolic AI: The idea of manipulating semantic concepts directly in the space of brain activity is reminiscent of neuro-symbolic AI approaches, which aim to combine the strengths of neural networks and symbolic reasoning. Insights from this fMRI-based concept algebra could potentially inform the development of neuro-symbolic AI systems that are more interpretable and aligned with human cognition.
However, it’s important to note that these are speculative connections, and there would be significant technical and ethical challenges in realizing this approach. Nonetheless, exploring the parallels between the structure of human thought and the structure of AI representations could be a valuable direction for informing AI alignment research.′
’How would this be relevant in the context of this proposal? [note: the proposal beingActivation vector steering with BCI] The proposed concept algebra on fMRI signals could be highly relevant to the goals outlined in this funding proposal. Specifically:
The proposal aims to find mappings between human brain states and the latent representations of language models. Applying concept algebra directly on fMRI signals, as discussed, could provide a more direct and interpretable way to establish these mappings. By identifying semantic subspaces in the fMRI signals and aligning them with subspaces in the language model’s latent space, we could potentially “steer” the language model to exhibit brain states similar to specific people.
The proposal suggests that reward models could be trained to reproduce humanlike brain states during evaluation, to improve their generalization. If concept algebra on fMRI can successfully manipulate semantic concepts in the reconstructed outputs (e.g., text or images), this could provide a way to directly optimize the reward models to produce outputs that match desired brain states.
The proposal aims to advance scientific understanding of analogies between LLM behavior and human behavior. Comparing the semantic subspaces identified in fMRI with those in LLMs, as enabled by concept algebra, could provide insights into these analogies at a more fundamental, representational level.
The proposed fMRI experiments could potentially provide the necessary data for attempting concept algebra on brain signals. The proposal includes plans for designing and implementing fMRI data-collection protocols, recruiting subjects, and administering the experiments. This data could be used to estimate the semantic subspaces needed for concept algebra.
The ML experiments proposed, including the fMRI feature extraction pipeline and architecture modifications, could potentially be adapted to implement concept algebra on the fMRI signals, in addition to the original goal of mapping brain states to LLM representations.
In summary, while the original proposal does not explicitly mention concept algebra, the ideas discussed around applying concept algebra to fMRI signals could potentially enhance and extend the proposal’s goals of connecting human brain states with AI representations, improving reward model generalization, and advancing scientific understanding of human-AI analogies. The proposed fMRI experiments could provide the necessary data, and the planned ML experiments could be extended to include concept algebra on the brain signals.′
’Above you wrote: ‘Use these samples to estimate a “semantic subspace” for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.’ What if instead I just took a large diverse dataset of fMRI of images of men and fMRI of images of women, took the means of the two, substracted them, and used that as the sex direction in concept algebra? Your proposed approach of taking the difference between the mean fMRI signals for men and women to define the “sex direction” is a great idea and could indeed work as a simple alternative to more complex methods like PCA or LDA. Here’s why this approach makes sense:
If the fMRI signals for men and women differ primarily along a single direction in the high-dimensional fMRI space, then the difference between their means will point along this direction.
By using a large and diverse dataset, you’re essentially averaging out other sources of variation (e.g., individual differences, noise), leaving only the consistent difference between men and women.
This is analogous to how concept algebra identifies concept directions in the latent space of the diffusion model: by taking the difference between the latent representations of prompts that differ only in the target concept.
To implement this:
Collect a large dataset of fMRI signals for images of men and women, ensuring diversity in other aspects (age, race, emotion, background, etc.).
Compute the mean fMRI signal for men (m_men) and women (m_women) separately.
Compute the “sex direction” as: v_sex = m_women—m_men.
To manipulate the sex of a reconstructed image, you could then project the fMRI signal onto the orthogonal complement of v_sex (to remove the original sex information), and then add or subtract a scaled version of v_sex to introduce the desired sex information.
The main advantages of this approach over PCA or LDA are simplicity and interpretability. It doesn’t require any complex eigendecompositions or model fitting, and the resulting direction has a clear interpretation as the difference between men and women. Potential limitations to keep in mind:
If there are multiple independent directions along which men and women differ, this approach will only capture their average, not the individual directions.
If there are significant interactions between sex and other factors (e.g., if the sex difference is itself different for different ages or emotions), this simple difference may not fully capture the sex effect.
The approach assumes linearity—that the sex effect is a simple additive shift in fMRI space. If there are nonlinear interactions, this linear shift may not fully capture the effect.
Despite these potential limitations, I think this is a promising and elegantly simple approach to try for implementing concept algebra directly on fMRI signals. It’s a great starting point that could be refined with more sophisticated methods if needed.′
Also positive update for me on interdisciplinary conceptual alignment being automatable differentially soon; which seemed to me for a long time plausible, since LLMs have ‘read the whole internet’ and interdisciplinary insights often seem (to me) to require relatively small numbers of inferential hops (plausibly because it’s hard for humans to have [especially deep] expertise in many different domains), making them potentially feasible for LLMs differentially early (reliably making long inferential chains still seems among the harder things for LLMs).
More reasons to believe that studying empathy in rats (which should be much easier than in humans, both for e.g. IRB reasons, but also because smaller brains, easier to get whole connectomes, etc.) could generalize to how it works in humans and help with validating/implementing it in AIs (I’d bet one can already find something like computational correlates in e.g. GPT-4 and the correlation will get larger with scale a la https://arxiv.org/abs/2305.11863) https://twitter.com/e_knapska/status/1722194325914964036
I expect large parts of interpretability work could be safely automatable very soon (e.g. GPT-5 timelines) using (V)LM agents; see A Multimodal Automated Interpretability Agent for a prototype.
Given the potential scalability of automated interp, I’d be excited to see plans to use large amounts of compute on it (including e.g. explicit integrations with agendas like superalignment or control; for example, given non-dangerous-capabilities, MAIA seems framable as a ‘trusted’ model in control terminology).
@the gears to ascension I see you reacted “10%” to the phrase “while (overwhelmingly likely) being non-scheming” in the context of the GPT-4V-based MAIA.
Does that mean you think there’s a 90% chance that MAIA, as implemented, today is actually scheming? If so that seems like a very bold prediction, and I’d be very interested to know why you predict that. Or am I misunderstanding what you mean by that react?
ah, I got distracted before posting the comment I was intending to: yes, I think GPT4V is significantly scheming-on-behalf-of-openai, as a result of RLHF according to principles that more or less explicitly want a scheming AI; in other words, it’s not an alignment failure to openai, but openai is not aligned with human flourishing in the long term, and GPT4 isn’t either. I expect GPT4 to censor concepts that are relevant to detecting this somewhat. Probably not enough to totally fail to detect traces of it, but enough that it’ll look defensible, when a fair analysis would reveal it isn’t.
It seems to me like the sort of interpretability work you’re pointing at is mostly bottlenecked by not having good MVPs of anything that could plausibly be directly scaled up into a useful product as opposed to being bottlenecked on not having enough scale.
So, insofar as this automation will help people iterate faster fair enough, but otherwise, I don’t really see this as the bottleneck.
Yeah, I’m unsure if I can tell any ‘pivotal story’ very easily (e.g. I’d still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of internals.
I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of things.
This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in ‘Catching AIs red-handed’), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).
Hey Bogdan, I’d be interested in doing a project on this or at least putting together a proposal we can share to get funding.
I’ve been brainstorming new directions (with @Quintin Pope) this past week, and we think it would be good to use/develop some automated interpretability techniques we can then apply to a set of model interventions to see if there are techniques we can use to improve model interpretability (e.g. L1 regularization).
I saw the MAIA paper, too; I’d like to look into it some more.
Anyway, here’s a related blurb I wrote:
Project: Regularization Techniques for Enhancing Interpretability and Editability
Explore the effectiveness of different regularization techniques (e.g. L1 regularization, weight pruning, activation sparsity) in improving the interpretability and/or editability of language models, and assess their impact on model performance and alignment. We expect we could apply automated interpretability methods (e.g. MAIA) to this project to test how well the different regularization techniques impact the model.
In some sense, this research is similar to the work Anthropic did with SoLU activation functions. Unfortunately, they needed to add layer norms to make the SoLU models competitive, which seems to have hide away the superposition in other parts of the network, making SoLU unhelpful for making the models more interpretable.
That said, we can increase our ability to interpret these models through regularization techniques. A technique like L1 regularization should help because it encourages the model to learn sparse representations by penalizing non-zero weights or activations. Sparse models tend to be more interpretable as they rely on a smaller set of important features.
Whether this works or not, I’d be interested in making more progress on automated interpretability, in the similar ways you are proposing.
This seems differentially good for safety (e.g. vs. models with larger forward passes but shorter context windows to achieve the same perplexity), since longer context and in-context learning are differentially transparent.
A brief list of resources with theoretical results which seem to imply RL is much more (e.g. sample efficiency-wise) difficult than IL—imitation learning (I don’t feel like I have enough theoretical RL expertise or time to scrutinize hard the arguments, but would love for others to pitch in). Probably at least somewhat relevant w.r.t. discussions of what the first AIs capable of obsoleting humanity could look like:
Paper: Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? (quote: ‘This work shows that, from the statistical viewpoint, the situation is far subtler than suggested by the more traditional approximation viewpoint, where the requirements on the representation that suffice for sample efficient RL are even more stringent. Our main results provide sharp thresholds for reinforcement learning methods, showing that there are hard limitations on what constitutes good function approximation (in terms of the dimensionality of the representation), where we focus on natural representational conditions relevant to value-based, model-based, and policy-based learning. These lower bounds highlight that having a good (value-based, model-based, or policy-based) representation in and of itself is insufficient for efficient reinforcement learning, unless the quality of this approximation passes certain hard thresholds. Furthermore, our lower bounds also imply exponential separations on the sample complexity between 1) value-based learning with perfect representation and value-based learning with a good-but-not-perfect representation, 2) value-based learning and policy-based learning, 3) policy-based learning and supervised learning and 4) reinforcement learning and imitation learning.’)
Probably at least somewhat relevant w.r.t. discussions of what the first AIs capable of obsoleting humanity could look like.
I’d bet against any of this providing interesting evidence beyond basic first principles arguments. These types of theory results never seem to add value on top of careful reasoning from my experience.
Hmm, unsure about this. E.g. the development models of many in the alignment community before GPT-3 (often heavily focused on RL or even on GOFAI) seem quite substantially worse in retrospect than those of some of the most famous deep learning people (e.g. LeCun’s cake); of course, this may be an unfair/biased comparison using hindsight. Unsure how much theory results were influencing the famous deep learners (and e.g. classic learning theory results would probably have been misleading), but doesn’t seem obvious they had 0 influence? For example, Bengio has multiple at least somewhat conceptual / theoretical (including review) papers motivating deep/representation learning; e.g. Representation Learning: A Review and New Perspectives.
I think Paul looks considerably better in retrospect than famous DL people IMO. (Partially via being somewhat more specific, though still not really making predictions.)
I’m skeptical hard theory had much influence on anyone though. (In this domain at least.)
Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity (‘We first demonstrate that, for a broad class of Markov decision processes (MDPs), the model can be represented by constant-depth circuits with polynomial size or Multi-Layer Perceptrons (MLPs) with constant layers and polynomial hidden dimension. However, the representation of the optimal policy and optimal value proves to be NP-complete and unattainable by constant-layer MLPs with polynomial size. This demonstrates a significant representation complexity gap between model-based RL and model-free RL, which includes policy-based RL and value-based RL. To further explore the representation complexity hierarchy between policy-based RL and value-based RL, we introduce another general class of MDPs where both the model and optimal policy can be represented by constant-depth circuits with polynomial size or constant-layer MLPs with polynomial size. In contrast, representing the optimal value is P-complete and intractable via a constant-layer MLP with polynomial hidden dimension. This accentuates the intricate representation complexity associated with value-based RL compared to policy-based RL. In summary, we unveil a potential representation complexity hierarchy within RL—representing the model emerges as the easiest task, followed by the optimal policy, while representing the optimal value function presents the most intricate challenge.’).
On Representation Complexity of Model-based and Model-free Reinforcement Learning (‘We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal Q-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as Q-function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal Q-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal Q-function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.’).
Demonstration-Regularized RL (‘Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL’s sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using NE expert demonstrations enables the identification of an optimal policy at a sample complexity of order O˜(Poly(S,A,H)/(ε^2 * N^E)) in finite and O˜(Poly(d,H)/(ε^2 * N^E)) in linear Markov decision processes, where ε is the target precision, H the horizon, A the number of action, S the number of states in the finite case and d the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.’).
Limitations of Agents Simulated by Predictive Models (‘There is increasing focus on adapting predictive models into agent-like systems, most notably AI assistants based on language models. We outline two structural reasons for why these models can fail when turned into agents. First, we discuss auto-suggestive delusions. Prior work has shown theoretically that models fail to imitate agents that generated the training data if the agents relied on hidden observations: the hidden observations act as confounding variables, and the models treat actions they generate as evidence for nonexistent observations. Second, we introduce and formally study a related, novel limitation: predictor-policy incoherence. When a model generates a sequence of actions, the model’s implicit prediction of the policy that generated those actions can serve as a confounding variable. The result is that models choose actions as if they expect future actions to be suboptimal, causing them to be overly conservative. We show that both of those failures are fixed by including a feedback loop from the environment, that is, re-training the models on their own actions. We give simple demonstrations of both limitations using Decision Transformers and confirm that empirical results agree with our conceptual and formal analysis. Our treatment provides a unifying view of those failure modes, and informs the question of why fine-tuning offline learned policies with online learning makes them more effective.’).
I’m not aware of anybody currently working on coming up with concrete automated AI safety R&D evals, while there seems to be so much work going into e.g. DC evals or even (more recently) scheminess evals. This seems very suboptimal in terms of portfolio allocation.
Thanks! AFAICT though, the link you posted seems about automated AI capabilities R&D evals, rather than about automated AI safety / alignment R&D evals (I do expect transfer between the two, but they don’t seem like the same thing). I’ve also chatted to some people from both METR and UK AISI and got the impression from all of them that there’s some focus on automated AI capabilities R&D evals, but not on safety.
Can you give a concrete example of a safety property of the sort that are you envisioning automated testing for? Or am I misunderstanding what you’re hoping to see?
Spicy take: good evals for automated ML R&D should (also) cover for what’s in the attached picture (and try hard at elicitation in this rough shape). AFAIK, last time I looked at the main (public) proposals, they didn’t seem to. Picture from https://x.com/RobertTLange/status/1829104918214447216.
’The paper uses the example of studying deception in language models (LLMs) to illustrate how Marr’s levels of analysis can be applied to AI interpretability research. Here’s a detailed breakdown of how the authors suggest approaching this topic at each level:
1.Computational Level:
Define criteria for classifying LLM behavior as deception
Develop comprehensive benchmarks to measure deceptive behaviors across various conditions
Conduct thorough behavioral assessments while systematically varying: a) Input properties (e.g., prompting strategies, specific word usage) b) Model properties (e.g., training data composition, fine-tuning tasks, model architecture)
Develop theoretical (possibly Bayesian) models to predict when an LLM would exhibit deceptive behavior given particular inputs and internal states
Use these insights to adjust training and inference procedures to control LLM deception
2. Algorithmic/Representational Level:
Examine representational spaces across different layers of the LLM
Test which spaces systematically separate instances of deception from instances of honesty
Trace how specific inputs (e.g., “deception triggers”) alter the latent space trajectory of an LLM
Analyze how generated tokens deviate from the ground truth trajectory to a deception trajectory
Use high-level representational trajectories to generalize over specific word choices to more general cases of deception
Determine whether the model uses a single algorithm for different deception cases or if deception is supported by multiple diverse algorithms
3. Implementation Level:
Trace the specific circuits that selectively activate during deceptive behavior
Identify circuits that causally contribute to deceptive behavior
Map out circuits responsible for tracking different kinds of deception-relevant information, such as: a) The true state of the world b) The agent’s intention c) The user’s current and predicted beliefs
Use insights from computational and algorithmic levels to guide the search for relevant circuits
Conversely, use circuit-level findings to refine algorithmic-level hypotheses
The authors emphasize that this multi-level approach allows for a more comprehensive understanding of the phenomenon:
Computational-level analysis can yield insights for adjusting training and inference procedures.
Algorithmic-level analysis can reveal how the model generalizes across different instances of deception.
Implementation-level analysis can pinpoint specific mechanisms that can be targeted for intervention.
They also stress the importance of iterating between levels. For example, computational and algorithmic understanding can guide the search for relevant circuits at the implementation level, while circuit-level findings can help refine higher-level hypotheses. This example illustrates how adopting a multi-level approach can provide a more holistic and actionable understanding of complex behaviors in AI systems, potentially leading to more effective interventions and improvements in model design.′
An intuition for safety cases for automated safety research over time
Safety cases—we want to be able to make a (conservative) argument for why a certain AI system won’t e.g. pose x-risk with probability > p / year. Rely on composing safety arguments / techniques into a ‘holistic case’.
Safety arguments are rated on three measures:
Practicality: ‘Could the argument be made soon or does it require substantial research progress?’
Maximum strength: ‘How much confidence could the argument give evaluators that the AI systems are safe?’
Scalability: ‘To what extent can the argument be applied to very powerful AI systems?’
Here’s an intuition of how to iteratively build safety cases for automated safety research, as AI systems become increasingly capable:
we want to ‘pile on’ / combine methods which are Practical enough already and not too alignment-tax-costly (e.g. assume a limited budget in e.g. [training and / or inference] compute and / or money and / or researchers), even though they might not be very Scalable (in the long run);
as time goes by and AIs become more capable, we should expect some of our methods’ Scalability to be ‘outpaced’ by the capabilities of new systems (e.g. maybe they’ll pass [some] DC evals → we won’t be able to rely on [some] inability arguments anymore);
we want to be able to use auto safety research to raise the Practicality of methods which are Scalable enough, but not Practical enough yet
e.g. unlearning, interpretability (see figure above)
we need our auto safety researchers to be [reasonably likely] capable of this, though.
Acknowledgments: early versions of this idea were developed during my participantion in the Astra Fellowship Winter ’24 under @evhub’s mentorship and benefitted from many conversations in Constellation.
These might be some of the most neglected and most strategically-relevant ideas about AGI futures: Pareto-topian goal alignment and ‘Pareto-preferred futures, meaning futures that would be strongly approximately preferred by more or less everyone‘: https://www.youtube.com/watch?v=1lqBra8r468. These futures could be achievable because automation could bring massive economic gains, which, if allocated (reasonably, not-even-necessarily-perfectly) equitably, could make ~everyone much better off (hence the ‘strongly approximately preferred by more or less everyone’). I think this discourse could be crucial to incentivize less racing more coordination, including e.g. for pausing at the right time to allow for more AI safety work to get done, yet I’m not seeing it almost anywhere in the public sphere.
I’d be interested in seeing the strongest arguments (e.g. safety-washing?) for why, at this point, one shouldn’t collaborate with OpenAI (e.g. not even part-time, for AI safety [evaluations] purposes).
If we’re taking the perspective of the entire community, this bears less weight, but: it likely becomes close-to-impossible to criticize OpenAI from that point forward. I’m not even anti-OpenAI, I just try to be truth-seeking where I think people are dropping the ball, and I think there’s almost 0 chance I’ll be able to work with OpenAI in the future given my comments on Twitter.
Claude-3 Opus on using advance market committments to incentivize automated AI safety R&D:
’Advance Market Commitments (AMCs) could be a powerful tool to incentivize AI labs to invest in and scale up automated AI safety R&D. Here’s a concrete proposal for how AMCs could be structured in this context:
Government Commitment: The US government, likely through an agency like DARPA or NSF, would commit to purchasing a certain volume of AI safety tools and technologies that meet pre-specified criteria, at a guaranteed price, if and when they are developed. This commitment would be legally binding.
Funding Allocation: A significant pool of funds, on the order of several billion dollars, would be allocated to back these AMCs. The exact amount would be determined based on estimates of the funding needed to meaningfully accelerate AI safety R&D over a 5-10 year time horizon.
Clear Technical Criteria: The government would work with AI safety experts and leading labs to establish clear, measurable technical criteria that AI safety tools and technologies must meet to qualify for the AMC. These could include:
Ability to automatically detect and mitigate certain classes of AI risks (e.g. robustness issues, unexpected behaviors, reward hacking, scalable oversight failures)
Formal verifiability and interpretability of AI systems’ reasoning and decision-making
Automated generation of comprehensive test suites and monitoring for AI systems
Performance on benchmark tasks that evaluate AI safety and robustness
Seamless integration with common AI development frameworks and workflows
Pricing and Volume Guarantees: For each category of AI safety tech that meets the specified criteria, the government would guarantee a certain price per unit and commit to purchasing a minimum volume. These numbers would be set at levels that meaningfully derisk and subsidize the substantial R&D investment required.
Contractual Arrangements: AI labs would enter into binding contracts with the government to develop qualifying technologies. Upon successful development, the government would be obligated to purchase the technologies at the predetermined prices. IP and usage rights would be negotiated upfront.
Staged Releases and Readjustment: As AI capabilities advance rapidly, the AMC criteria would be staged over time to incentivize continual progress. Initial qualifying thresholds would target nearer-term safety challenges, with subsequent releases focused on more advanced issues. Pricing and volumes could be readjusted periodically as the technology matures.
Open Collaboration and Access: To maximize the public benefit, AMC contracts would stipulate that a substantial portion of the safety IP developed must be made freely available for research use and external validation. Certain core components could be open-sourced. Purchased tools would also be made available to developers and labs beyond the initial partners.
International Coordination: While led by the US, efforts would be made to coordinate and collaborate with governments and labs globally. Shared standards and compatible AMCs across regions would amplify the impact. Joint commitments could pool resources and expertise.
The goal of this AMC structure is to create a clear, long-term market pull that shifts the incentives for AI labs and makes large investments in automated AI safety R&D economically attractive and even necessary for competitive positioning. The guaranteed demand would provide greater certainty and help crowd in private capital. Staged, iterative AMCs would support sustained progress and adapt to evolving challenges. Broad access to the resulting safety IP and tools would uplift capabilities across the AI ecosystem.
Alongside other supporting policies and funding mechanisms, AMCs for automated AI safety technology could be a centerpiece of public-private collaboration to ensure advanced AI systems are developed responsibly. Robust market incentives, combined with technical leadership from top labs, could dramatically accelerate the development and deployment of scalable AI safety solutions.′
‘Summarize ‘Causal Explainable Guardrails for Large Language Models’. In particular, could this be useful to deal with some of the challenges mentioned in ‘Discussion: Challenges with Unsupervised LLM Knowledge Discovery’?
The paper “A Causal Explainable Guardrails for Large Language Models” proposes LLMGuardaril, a framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations in LLMs. Key points:Existing methods for steering LLMs towards desired attributes rely on unbiased representations and steering prompts, but the representations learned from pre-training can introduce semantic biases that negatively impact steering. LLMGuardaril systematically identifies and blocks the confounding effects of biases, enabling extraction of unbiased steering representations. It also includes an explainable component providing insights into alignment between generated output and desired direction. Experiments show LLMGuardaril can effectively steer LLMs towards desired attributes while mitigating biases.Regarding the challenges mentioned in “Discussion: Challenges with Unsupervised LLM Knowledge Discovery”:One key challenge is distinguishing the model’s own knowledge from knowledge of simulated entities. LLMGuardaril aims to disentangle the influence of semantic biases to obtain steering representations capturing desired attributes. This may help identify model’s own knowledge if those attributes are specific to the model itself. Another challenge is the hypothesized existence of many features satisfying consistency properties beyond just the model’s knowledge. LLMGuardaril’s causal analysis and adversarial learning aims to block confounding semantic biases, potentially reducing the set of candidate knowledge-like features. Empirically, contrast-consistent search seemed to find prominent arbitrary features rather than knowledge. LLMGuardaril’s explainable component could provide insight into whether extracted features actually correspond to model knowledge.
So in summary, while not a complete solution, LLMGuardaril’s approach of causally analyzing and debiasing semantic influences may help address some key challenges around identifying model knowledge in an unsupervised way. But significant open questions remain around definitively distinguishing model knowledge from simulated knowledge of other entities. Further research is needed to determine how much traction causal analysis provides on the hardest aspects of the ELK problem.′
‘How might MAIA be useful, potentially combined with ‘Causal Explainable Guardrails for Large Language Models’, to deal with some of the challenges mentioned in ‘Discussion: Challenges with Unsupervised LLM Knowledge Discovery’?
MAIA and the techniques from “Causal Explainable Guardrails for Large Language Models” (LLMGuardaril) could potentially be combined to help address some of the challenges mentioned in “Discussion: Challenges with Unsupervised LLM Knowledge Discovery”:
Distinguishing the model’s own knowledge from simulated knowledge: One key challenge raised is that consistency-based methods like contrast-consistent search (CCS) may not be able to distinguish an LLM’s “own” knowledge from the knowledge of simulated entities the model may be imagining. MAIA’s ability to automatically generate hypotheses and design experiments to test them could potentially be used to probe for features that are uniquely characteristic of the model’s own knowledge. For example, MAIA could generate synthetic inputs that vary the context in which a piece of knowledge is elicited (e.g. with/without an explicit simulation frame) and test whether the model’s responses are consistent.
Reducing the set of candidate knowledge-like features: Another challenge is that many features beyond the model’s own knowledge may satisfy consistency properties, making it hard to identify the “true” knowledge. LLMGuardaril’s technique of causally analyzing the influence of semantic biases on steering representations and using adversarial learning to debias them could potentially be applied by MAIA to narrow down the set of features that robustly encode model knowledge across prompts/contexts. The explainable component of LLMGuardaril could also help MAIA validate whether candidate knowledge features behave as expected.
Overcoming the limitations of one-shot probing: Existing automated interpretability approaches are noted to primarily be tools for one-shot hypothesis generation on a limited set of inputs, rather than scalable and fully rigorous causal explanation. MAIA’s ability to iteratively refine hypotheses based on experimental outcomes could lead to more robust explanations. Integrating the causal analysis techniques from LLMGuardaril into this iterative process could help rule out spurious hypotheses.
Evaluating unsupervised knowledge discovery methods: The paper emphasizes the need for principled testbeds to evaluate the effectiveness of unsupervised knowledge discovery techniques. The synthetic test systems used to quantitatively evaluate MAIA’s performance when ground-truth structure is known could potentially be exte’nded to benchmark a wider variety of unsupervised interpretability approaches.
However, significant challenges would likely need to be overcome to fruitfully combine these techniques. For example, MAIA is currently focused on interpreting components of vision models, while LLMGuardaril focuses on steering language models, so unifying them would require bridging this modality gap. More fundamentally, even with access to MAIA’s automated experimentation and LLMGuardaril’s causal analysis, rigorously identifying an LLM’s “own” knowledge in a fully unsupervised way, as opposed to the knowledge of a specific simulated entity, remains an open challenge and likely requires further conceptual breakthroughs. Nonetheless, exploring the intersection of scalable interpretability agents like MAIA and causally grounded analysis techniques like LLMGuardaril seems like a promising direction for future work on the ELK problem.′
‘How could the above be combined with ‘Catching AIs red-handed’ and, in particular, the ideas mentioned in paulfchristiano’s comment?
The approaches discussed in “A Multimodal Automated Interpretability Agent” (MAIA) and “Causal Explainable Guardrails for Large Language Models” (LLMGuardaril) could potentially be combined with the ideas in “Catching AIs red-handed” and paulfchristiano’s comment to create a more comprehensive strategy for detecting and mitigating risks from deceptively aligned AI systems:
Using MAIA to automatically generate and test hypotheses about deceptive behavior: As discussed in the “Catching AIs red-handed” post, once a suspicious action has been flagged, we can use techniques like varying the AI’s context and beliefs to test whether the behavior is truly deceptive. MAIA’s automated experimentation capabilities could potentially be used to scale up this kind of hypothesis generation and testing, systematically exploring the “landscape of deceptive behavior” as paulfchristiano puts it.
Applying causal analysis from LLMGuardaril to rule out alternative explanations: A key challenge raised in the post and comments is distinguishing true deception from an AI system just “kind of unlucky in how its context shifted it into a more deceptive role-play”. The causal analysis techniques from LLMGuardaril, which aim to disentangle the influence of semantic context on an AI’s behavior, could potentially be applied by MAIA to more confidently establish whether a pattern of suspicious behavior is caused by intentional deception vs. an unlucky context.
Using MAIA’s explainable component to interpret the AI’s reasoning: In paulfchristiano’s examples of analyzing a suspicious code review or context-dependent behavior, a critical step is understanding what information the AI is using to make its decisions. MAIA’s explainable component, which measures the alignment between an AI’s outputs and experimenter-specified directions, could potentially shed light on the reasoning behind a deceptive action.
Scaling up detection of hard-to-elicit capabilities: “Catching AIs red-handed” notes that real-world deceptive behavior could reveal capabilities that are hard to measure during normal testing. MAIA’s ability to automatically generate diverse test inputs based on a small number of real examples could help systematically probe for these hidden capabilities, reducing the AI’s “space of interventions...where they are more and more confident that it’s never going to be noticed.”
Iterative refinement of monitoring systems: Finally, as an AI system’s deceptive behavior is progressively uncovered through a combination of MAIA’s experimentation and LLMGuardaril’s causal analysis, the resulting dataset of confirmed deceptive actions could be used to fine-tune untrusted monitoring systems to more reliably catch future deception attempts, similar to the “few-shot catastrophe prevention” techniques outlined in the post.
Significant open questions remain about the feasibility of scaling these approaches to be robust against highly capable AI systems. There are also important limitations to consider: for example, MAIA’s causal analysis is unlikely to provide a fully conclusive “smoking gun” that an AI system is deceptive, as opposed to strong probabilistic evidence. And sufficiently capable systems may find ways to avoid exhibiting any detectable deceptive behavior under the kind of systematic experimentation enabled by MAIA.
Nonetheless, I believe the combination of automated interpretability agents, causally grounded analysis of AI reasoning, and the overall adversarial “red team” mindset advocated in “Catching AIs red-handed” represents a promising direction for building more robust monitoring and containment of potentially deceptive AI systems. A concerted effort to develop and integrate these approaches could significantly improve our odds of detecting and mitigating catastrophic deceptive alignment failures.′
’Any ways to combine all the previously mentioned ideas (A Causal Explainable Guardrails for Large Language Models, Discussion: Challenges with Unsupervised LLM Knowledge Discovery and A Multimodal Automated Interpretability Agent (MAIA), ‘Catching AIs red-handed’, including paulfchristiano’s comment) with ideas from ‘Searching for a model’s concepts by their shape – a theoretical framework’ to ‘catch AIs red-handed’?
Here are a few additional ideas for combining all the mentioned papers and comments to try catching deceptive models red-handed:
Using the “deliberate adversarial experiments” mentioned in paulfchristiano’s comment, like varying the model’s beliefs about how closely it is being monitored, to surface cases where the model’s behavior is suspiciously sensitive to being watched. The theoretical framework from “Searching for concepts by their shape” could be used to formalize the hypothesis that the model has a hidden “deception mode” that gets triggered when it believes oversight is lax. MAIA’s automated experimentation tools could be used to efficiently search for inputs exhibiting this sensitivity.
Comparing the features extracted by the interpretability tools from “Searching for concepts by their shape” across different copies of the model that have been adversarially perturbed to have different beliefs (per paulfchristiano’s comment). If there are decision-relevant features (e.g. representations of the model’s objectives) that are suspiciously different in the copy that believes it is unobserved, that would be strong evidence of deceptive cognition. The causal analysis from “Causal Explainable Guardrails” could help determine whether those differing features are causally linked to the differing outputs.
Training MAIA’s detectors for suspicious model outputs not just on straightforwardly deceptive behaviors, but on subtler signs of agency-like cognition hypothesized in paulfchristiano’s comment, e.g. the model giving worse answers when it believes it is unobserved. This could yield more robust detection that is harder for the model to circumvent by avoiding overt deception.
If interpretability tools manage to extract a model’s representations of its beliefs, objectives, plans, etc. (a big if!), use the techniques from “Catching AIs red-handed” to examine how those representations differ on the inputs that were “successfully” deceptive (i.e. where the model behaved deceptively but was not caught). There may be learnable patterns in how the model’s cognition differs in those cases which could inspire hypotheses about what relations characterize “successful” deception. Those hypotheses could then be formalized and tested using the framework from “Searching for concepts by their shape”.
A point emphasized in paulfchristiano’s comment is the importance of extreme thoroughness and breadth in searching for deceptive cognition, e.g. (to paraphrase) “even if a model exhibits some deceptive behavior, don’t assume you’ve found all of its tricks; keep probing its behavior in a huge variety of contexts”. MAIA and other automated testing tools should be put in service of this kind of breadth. Rather than just focus on the most egregious cases of deception, we should test the model’s behavior on a vast range of subtle perturbations hunting for any whiff of situational inconsistency. The theoretical frameworks for agency from “Searching for concepts by their shape” can guide us in knowing what to look for.
Extending my previous thoughts, I think one of the most useful things the “Searching for concepts by their shape” framing can contribute is mathematical precision in defining hypotheses about what characterizes deceptive cognition. If we can frame agency/deception in terms of precise relations between features (even if those features are hard to interpretably extract in practice), it gives us a crisper target to aim at with our testing, rather than just relying on intuitive notions of what “seems deceptive”.
However, I continue to be quite pessimistic about the prospects for catching competent deceptive models before it’s too late, for the reasons outlined previously. I worry that by the time a model is smart enough that we could meaningfully apply these combined interpretability strategies to it, it would also be smart enough to notice what we’re doing and come up with countermeasures.
Unless interpretability tools far exceed my expectations in their ability to crack open the black box of neural nets (which would be great!), I suspect the most practical path is to use simpler, more behavioral detection strategies (like the model consistency testing from your Control paper) to avoid building models that are competent enough to be robustly deceptive in the first place. Catching a superintelligent deceptive model after the fact seems like a much harder challenge than noticing warning signs of misalignment in weaker models and heading them off early.
But I would still encourage pursuing all these ideas, even if they end up not being enough on their own to solve the problem. The interpretability insights gained would be valuable for alignment more broadly. And we may get lucky in extracting some key structural signature of deceptive cognition that even very sophisticated models can’t fully mask. It’s an arms race we can’t afford to unilaterally forfeit.′
I wonder how much near-term interpretability [V]LM agents (e.g. MAIA, AIA) might help with finding better probes and better steering vectors (e.g. by iteratively testing counterfactual hypotheses against potentially spurious features, a major challenge for Contrast-consistent search (CCS)).
This seems plausible since MAIA can already find spurious features, and feature interpretability [V]LM agents could have much lengthier hypotheses iteration cycles (compared to current [V]LM agents and perhaps even to human researchers).
I might have updated at least a bit against the weakness of single-forward passes, based on intuitions about the amount of compute that huge context windows (e.g. Gemini 1.5 − 1 million tokens) might provide to a single-forward-pass, even if limited serially.
Somewhat relatedly: I’m interested on how well LLMs can solve tasks in parallel. This seems very important to me.[1]
The “I’ve thought about this for 2 minutes” version is: Hand an LLM two multiple choice questions with four answer options each. Encode these four answer options into a single token, so that there are 16 possible tokens of which one corresponds to the correct answer to both questions. A correct answer means that the model has solved both tasks in one forward-pass.
(One can of course vary the number of answer options and questions. I can see some difficulties in implementing this idea properly, but would nevertheless be excited if someone took a shot at it.)
- For serial computation the number of layers gives some very rough indication of the strength of one forward-pass, but it’s harder to have intuitions for parallel computation.
- For scheming, the model could reason about “should I still stay undercover”, “what should I do in case I should stay undercover” and “what should I do in case it’s time to attack” in parallel, finally using only one serial step to decide on its action.
There might be more such results I’m currently forgetting about, but they should be relatively easy to find by e.g. following citation trails (to and from the above references) with Google Scholar (or by looking at my recent comments / short forms).
- For scheming, the model could reason about “should I still stay undercover”, “what should I do in case I should stay undercover” and “what should I do in case it’s time to attack” in parallel, finally using only one serial step to decide on its action.
I am also very interested in e.g. how one could operationalize the number of hops of inference of out-of-context reasoning required for various types of scheming, especially scheming in one-forward-pass; and especially in the context of automated AI safety R&D.
“We show that transformers can use meaningless filler tokens (e.g., ‘......’) in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge.”
I looked over it and I should note that “transformers are in TC0” is not very useful statement for prediction of capabilities. Transformers are Turing-complete given rational inputs (see original paper) and them being in TC0 basically means they can implement whatever computation you can implement using boolean circuit for fixed amount of available compute which amounts to “whatever computation is practical to implement”.
I think I remember William Merrill (in a video) pointing out that the rational inputs assumption seems very unrealistic (would require infinite memory); and, from what I remember, https://arxiv.org/abs/2404.15758 and related papers made a different assumption about the number of bits of memory per parameter and per input.
Everything Turing-complete requires infinite memory. When we are saying “x86 set of instructions is Turing-complete” we imply “assuming that processor operates on infinite memory”. It’s in definition of Turing machine to include infinite tape, after all.
It’s hard to pinpoint, but the trick is that it’s very nuanced difference between the sense in which transformers are limited in complexity-theoretic sense and “transformers can’t do X”. Like, there is nothing preventing transformers from playing chess perfectly—they just need to be sufficiently large for this. To answer the question “can transformers do X” you need to ask “how much computing power transformer has” and “can this computing power be shaped by SGD into solution”.
Similarly, I find that GPT-3, GPT-3.5, and Claude 2 don’t benefit from filler tokens. However, GPT-4 (which Tamera didn’t study) shows mixed results with strong improvements on some tasks and no improvement on others.
It’s interesting question whether Gemini has any improvements.
Larger LMs seem to benefit differentially more from tools: ‘Absolute performance and improvement-per-turn (e.g., slope) scale with model size.’ https://xingyaoww.github.io/mint-bench/. This seems pretty good for safety, to the degree tool usage is often more transparent than model internals.
speculatively, it might also be fruitful to go about this the other way round, e.g. try to come up with better weight-space task erasure methods by analogy between concept erasure methods (in activation space) and through the task arithmetic—activation engineering link
(As reply to Zvi’s ‘If someone was founding a new AI notkilleveryoneism research organization, what is the best research agenda they should look into pursuing right now?’)
LLMs seem to represent meaning in a pretty human-like way and this seems likely to keep getting better as they get scaled up, e.g. https://arxiv.org/abs/2305.11863. This could make getting them to follow the commonsense meaning of instructions much easier. Also, similar methodologies to https://arxiv.org/abs/2305.11863 could be applied to other alignment-adjacent domains/tasks, e.g. moral reasoning, prosociality, etc.
I’d say the best argument against it is a combination of precedent-setting concerns, where once they started open-sourcing, it’d be hard to stop doing it even if it becomes dangerous to do, combined with misuse risk for now seeming harder to solve than misalignment risk, and in order for open-source to be good, you need to prevent both misalignment and people misusing their models.
I agree Sakana AI is safe to open source, but I’m quite sure sometime in the next 10-15 years, the AIs that do get developed will likely be very dangerous to open-source them, at least for several years.
’🚨 The annual report of the US-China Economic and Security Review Commission is now live. 🚨
Its top recommendation is for Congress and the DoD to fund a Manhattan Project-like program to race to AGI.
Buckle up...′
https://x.com/hamandcheese/status/1858897287268725080
In the reuters article they highlight Jacob Helberg: https://www.reuters.com/technology/artificial-intelligence/us-government-commission-pushes-manhattan-project-style-ai-initiative-2024-11-19/
He seems quite influential in this initiative and recently also wrote this post:
https://republic-journal.com/journal/11-elements-of-american-ai-supremacy/
Wikipedia has the following paragraph on Helberg:
“ He grew up in a Jewish family in Europe.[9] Helberg is openly gay.[10] He married American investor Keith Rabois in a 2018 ceremony officiated by Sam Altman.”
Might this be an angle to understand the influence that Sam Altman has on recent developments in the US government?
This chapter on AI follows immediately after the year in review, I went and checked the previous few years’ annual reports to see what the comparable chapters were about, they are
2023: China’s Efforts To Subvert Norms and Exploit Open Societies
2022: CCP Decision-Making and Xi Jinping’s Centralization Of Authority
2021: U.S.-China Global Competition (Section 1: The Chinese Communist Party’s Ambitions and Challenges at its Centennial
2020: U.S.-China Global Competition (Section 1: A Global Contest For Power and Influence: China’s View of Strategic Competition With the United States)
And this year it’s Technology And Consumer Product Opportunities and Risks (Chapter 3: U.S.-China Competition in Emerging Technologies)
Reminds of when Richard Ngo said something along the lines of “We’re not going to be bottlenecked by politicians not caring about AI safety. As AI gets crazier and crazier everyone would want to do AI safety, and the question is guiding people to the right AI safety policies”
I think we’re seeing more interest in AI, but I think interest in “AI in general” and “AI through the lens of great power competition with China” has vastly outpaced interest in “AI safety”. (Especially if we’re using a narrow definition of AI safety; note that people in DC often use the term “AI safety” to refer to a much broader set of concerns than AGI safety/misalignment concerns.)
I do think there’s some truth to the quote (we are seeing more interest in AI and some safety topics), but I think there’s still a lot to do to increase the salience of AI safety (and in particular AGI alignment) concerns.
’The report doesn’t go into specifics but the idea seems to be to build / commandeer the computing resources to scale to AGI, which could include compelling the private labs to contribute talent and techniques.
DX rating is the highest priority DoD procurement standard. It lets DoD compel companies, set their own price, skip the line, and do basically anything else they need to acquire the good in question.′ https://x.com/hamandcheese/status/1858902373969564047
(screenshot in post from PDF page 39 of https://www.uscc.gov/sites/default/files/2024-11/2024_Annual_Report_to_Congress.pdf)
‘China hawk and influential Trump AI advisor Jacob Helberg asserted to Reuters that “China is racing towards AGI,” but I couldn’t find any evidence in the report to support that claim.’ https://x.com/GarrisonLovely/status/1859022323799699474
I suspect current approaches probably significantly or even drastically under-elicit automated ML research capabilities.
I’d guess the average cost of producing a decent ML paper is at least 10k$ (in the West, at least) and probably closer to 100k’s $.
In contrast, Sakana’s AI scientist cost on average 15$/paper and .50$/review. PaperQA2, which claims superhuman performance at some scientific Q&A and lit review tasks, costs something like 4$/query. Other papers with claims of human-range performance on ideation or reviewing also probably have costs of <10$/idea or review.
Even the auto ML R&D benchmarks from METR or UK AISI don’t give me at all the vibes of coming anywhere near close enough to e.g. what a 100-person team at OpenAI could accomplish in 1 year, if they tried really hard to automate ML.
A fairer comparison would probably be to actually try hard at building the kind of scaffold which could use ~10k$ in inference costs productively. I suspect the resulting agent would probably not do much better than with 100$ of inference, but it seems hard to be confident. And it seems harder still to be confident about what will happen even in just 3 years’ time, given that pretraining compute seems like it will probably grow about 10x/year and that there might be stronger pushes towards automated ML.
This seems pretty bad both w.r.t. underestimating the probability of shorter timelines and faster takeoffs, and in more specific ways too. E.g. we could be underestimating by a lot the risks of open-weights Llama-3 (or 4 soon) given all the potential under-elicitation.
The Sakana AI stuff is basically total bogus, as I’ve pointed out on like 4 other threads (and also as Scott Alexander recently pointed out). It does not produce anything close to fully formed scientific papers. It’s output is really not better than just prompting o1 yourself. Of course, o1 and even Sonnet and GPT-4 are very impressive, but there is no update to be made after you’ve played around with that.
I agree that ML capabilities are under-elicited, but the Sakana AI stuff really is very little evidence on that, besides someone being good at marketing and setting up some scaffolding that produces fake prestige signals.
(Again) I think this is missing the point that we’ve now (for the first time, to my knowledge) observed an early demo the full research workflow being automatable, as flawed as the outputs might be.
I completely agree, and we should just obviously build an organization around this. Automating alignment research while also getting a better grasp on maximum current capabilities (and a better picture of how we expect it to grow).
(This is my intention, and I have had conversations with Bogdan about this, but I figured I’d make it more public in case anyone has funding or ideas they would like to share.)
Figures 3 and 4 from MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering seem like some amount of evidence for this view:
Also, there are notable researchers and companies working on developing ‘a truly general way of scaling inference compute’ right now and I think it would be cautious to consider what happens if they succeed.
(This also has implications for automating AI safety research).
To spell it out more explicitly, the current way of scaling inference (CoT) seems pretty good vs. some of the most worrying threat models, which often depend on opaque model internals.
Eliezer (among others in the MIRI mindspace) has this whole spiel about human kindness/sympathy/empathy/prosociality being contingent on specifics of the human evolutionary/cultural trajectory, e.g. https://twitter.com/ESYudkowsky/status/1660623336567889920 and about how gradient descent is supposed to be nothing like that https://twitter.com/ESYudkowsky/status/1660623900789862401. I claim that the same argument (about evolutionary/cultural contingencies) could be made about e.g. image aesthetics/affect, and this hypothesis should lose many Bayes points when we observe concrete empirical evidence of gradient descent leading to surprisingly human-like aesthetic perceptions/affect, e.g. The Perceptual Primacy of Feeling: Affectless machine vision models robustly predict human visual arousal, valence, and aesthetics; Towards Disentangling the Roles of Vision & Language in Aesthetic Experience with Multimodal DNNs; Controlled assessment of CLIP-style language-aligned vision models in prediction of brain & behavioral data; Neural mechanisms underlying the hierarchical construction of perceived aesthetic value.
12/10/24 update: more, and in my view even somewhat methodologically stronger, evidence: On the cognitive alignment between humans and machines.
hmm. i think you’re missing eliezer’s point. the idea was never that AI would be unable to identify actions which humans consider good, but that the AI would not have any particular preference to take those actions.
But my point isn’t just that the AI is able to produce similar ratings to humans’ for aesthetics, etc., but that it also seems to do so through at least partially overlapping computational mechanisms to humans’, as the comparisons to fMRI data suggest.
I don’t think having a beauty-detector that works the same way humans’ beauty-detectors do implies that you care about beauty?
Agree that it doesn’t imply caring for. But I think given cumulating evidence for human-like representations of multiple non-motivational components of affect, one should also update at least a bit on the likelihood of finding / incentivizing human-like representations of the motivational component(s) too (see e.g. https://en.wikipedia.org/wiki/Affect_(psychology)#Motivational_intensity_and_cognitive_scope).
Even if Eliezer’s argument in that Twitter thread is completely worthless, it remains the case that “merely hoping” that the AI turns out nice is an insufficiently good argument for continuing to create smarter and smarter AIs. I would describe as “merely hoping” the argument that since humans (in some societies) turned out nice (even though there was no designer that ensured they would), the AI might turn out nice. Also insufficiently good is any hope stemming from the observation that if we pick two humans at random out of the humans we know, the smarter of the two is more likely than not to be the nicer of the two. I certainly do not want the survival of the human race to depend on either one of those two hopes or arguments! Do you?
Eliezer finds posting on the internet enjoyable, like lots of people do. He posts a lot about, e.g., superconductors and macroeconomic policy. It is far from clear to me that he consider this Twitter thread to be relevant to the case against continuing to create smarter AIs. But more to the point: do you consider it relevant?
Contra both the ‘doomers’ and the ‘optimists’ on (not) pausing. Rephrased: RSPs (done right) seem right.
Contra ‘doomers’. Oversimplified, ‘doomers’ (e.g. PauseAI, FLI’s letter, Eliezer) ask(ed) for pausing now / even earlier - (e.g. the Pause Letter). I expect this would be / have been very much suboptimal, even purely in terms of solving technical alignment. For example, Some thoughts on automating alignment research suggests timing the pause so that we can use automated AI safety research could result in ‘[...] each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.’ We clearly don’t have such automated AI safety R&D capabilities now, suggesting that pausing later, when AIs are closer to having the required automated AI safety R&D capabilities would be better. At the same time, current models seem very unlikely to be x-risky (e.g. they’re still very bad at passing dangerous capabilities evals), which is another reason to think pausing now would be premature.
Contra ‘optimists’. I’m more unsure here, but the vibe I’m getting from e.g. AI Pause Will Likely Backfire (Guest Post) is roughly something like ‘no pause ever’; largely based on arguments of current systems seeming easy to align / control. While I agree with the point that current systems do seem easy to align / control and I could even see this holding all the way up to ~human-level automated AI safety R&D, I can easily see scenarios where around that time things get scary quickly without any pause. For example, similar arguments to those about the scalability of automated AI safety R&D suggest automated AI capabilities R&D could also be scaled up significantly. For example, figures like those in Before smart AI, there will be many mediocre or specialized AIs suggest very large populations of ~human-level automated AI capabilities researchers could be deployed (e.g. 100x larger than the current [human] population of AI researchers). Given that even with the current relatively small population, algorithmic progress seems to double LM capabilities ~every 8 months, it seems like algorithmic progress could be much faster with 100x larger populations, potentially leading to new setups (e.g. new AI paradigms, new architectures, new optimizers, synthetic data, etc.) which could quite easily break the properties that make current systems seem relatively easy / safe to align. In this scenario, pausing to get this right (especially since automated AI safety R&D would also be feasible) seems like it could be crucial.
At least Eliezer has been extremely clear that he is in favor of a stop not a pause (indeed, that was like the headline of his article “Pausing AI Developments Isn’t Enough. We Need to Shut it All Down”), so I am confused why you list him with anything related to “pause”.
My guess is me and Eliezer are both in favor of a pause, but mostly because a pause seems like it would slow down AGI progress, not because the next 6 months in-particular will be the most risky period.
The relevant criterion is not whether the current models are likely to be x-risky (it’s obviously far too late if they are!), but whether the next generation of models have more than an insignificant chance of being x-risky together with all the future frameworks they’re likely to be embedded into.
Given that the next generations are planned to involve at least one order of magnitude more computing power in training (and are already in progress!) and that returns on scaling don’t seem to be slowing, I think the total chance of x-risk from those is not insignificant.
I agree with some points here Bogdan, but not all of them.
I do think that current models are civilization-scale-catastrophe-risky (but importantly not x-risky!) from a misuse perspective, but not yet from a self-directed perspective. Which means neither Alignment nor Control are currently civilization-scale-catastrophe-risky, much less x-risky.
I also agree that pausing now would be counter-productive. My reasoning for this is that I agree with Samo Burja about some key points which are relevant here (while disagreeing with his conclusions due to other points).
To quote myself:
Think about how you’d expect these factors to change if large AI training runs were paused. I think you might agree that this would likely result in a temporary shift in much of the top AI scientist talent to making theoretical progress. They’d want to be ready to come in strong after the pause was ended, with lots of new advances tested at small scale. I think this would actually result more high quality scientific thought directed at the heart of the problem of AGI, and thus make AGI very likely to be achieved sooner after the pause ends than it otherwise would have been.
I would go even farther, and make the claim that AGI could arise during a pause on large training runs. I think that the human brain is not a supercomputer, my upper estimate for ‘human brain inference’ is about at the level of a single 8x A100 server. Less than an 8x H100 server. Also, I have evidence from analysis of the long-range human connectome (long range axons are called tracts, so perhaps I should call this a ‘tractome’). [Hah, I just googled this term I came up with just now, and found it’s already in use, and that it brings up some very interesting neuroscience papers. Cool.] Anyway… I was saying, this evidence shows that the range of bandwidth (data throughput in bits per second) between two cortical regions in the human brain is typically around 5 mb/s, and maxes out at about 50 mb/s. In other words, well within range for distributed federated training runs to work over long distance internet connections. So unless you are willing to monitor the entire internet so robustly that nobody can scrape together the equivalent compute of an 8X A100 server, you can’t fully block AGI.
Of course, if you wanted to train the AGI in a reasonable amount of time, you’d want to do a parallel run of much more than a single inference instance of compute. So yeah, it’d definitely make things inconvenient if an international government were monitoring all datacenters… but far from impossible.
For the same reason, I don’t think a call to ‘Stop AI development permanently’ works without the hypothetical enforcement agency literally going around the world confiscating all personal computers and shutting down the internet. Not gonna happen, why even advocate for such a thing? Makes me think that Eliezer is advocating for this in order to have some intended effect other than this on the world.
Quick take on o1: overall, it’s been a pretty good day. Likely still sub-ASL-3, (opaque) scheming still seems very unlikely because the prerequisites still don’t seem there. CoT-style inference compute playing a prominent role in the capability gains is pretty good for safety, because differentially transparent. Gains on math and code suggest these models are getting closer to being usable for automated safety research (also for automated capabilities research, unfortunately).
CoT inference looks more like the training surface, not essential part of resulting cognition after we take one more step following such models. Orion is reportedly (being) pretrained on these reasoning traces, and if it’s on the order of 50 trillion tokens, that’s about as much as there is natural text data of tolerable quality in the world available for training. Contrary to the phrasing, what transformers predict is in part distant future tokens within a context, not proximate “next tokens” that follow immediately after whatever the prediction must be based on.
So training on reasoning traces should teach the models concepts that let them arrive at the answer faster, skipping the avoidable parts of the traces and compressing a lot of the rest into less scrutable activations. The models trained at the next level of scale might be quite good at that, to the extent not yet known from experience with the merely GPT-4 scale models.
Some bad news is that there was some problematic power seeking and instrumental convergence, though thankfully that only happened in an earlier model:
https://www.lesswrong.com/posts/bhY5aE4MtwpGf3LCo/openai-o1#JGwizteTkCrYB5pPb
Edit: It looks like the instrumentally convergent reasoning was because of the prompt, so I roll back my updates on instrumental convergence being likely:
https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed#eLrDowzxuYqBy4bK9
Like transformers, SSMs like Mamba also have weak single forward passes: The Illusion of State in State-Space Models (summary thread). As suggested previously in The Parallelism Tradeoff: Limitations of Log-Precision Transformers, this may be due to a fundamental tradeoff between parallelizability and expressivity:
‘We view it as an interesting open question whether it is possible to develop SSM-like models with greater expressivity for state tracking that also have strong parallelizability and learning dynamics, or whether these different goals are fundamentally at odds, as Merrill & Sabharwal (2023a) suggest.’
Surely fundamentally at odds? You can’t spend a while thinking without spending a while thinking.
Of course, the lunch still might be very cheap by only spending a while thinking a fraction of the time or whatever.
‘Data movement bottlenecks limit LLM scaling beyond 2e28 FLOP, with a “latency wall” at 2e31 FLOP. We may hit these in ~3 years. Aggressive batch size scaling could potentially overcome these limits.’ https://epochai.org/blog/data-movement-bottlenecks-scaling-past-1e28-flop
The post argues that there is a latency limit at 2e31 FLOP, and I’ve found it useful to put this scale into perspective.
Current public models such as Llama 3 405B are estimated to be trained with ~4e25 flops , so such a model would require 500,000 x more compute. Since Llama 3 405B was trained with 16,000 H-100 GPUs, the model would require 8 billion H-100 GPU equivalents, at a cost of $320 trillion with H-100 pricing (or ~$100 trillion if we use B-200s). Perhaps future hardware would reduce these costs by an order of magnitude, but this is cancelled out by another factor; the 2e31 limit assumes a training time of only 3 months. If we were to build such a system over several years and had the patience to wait an additional 3 years for the training run to complete, this pushes the latency limit out by another order of magnitude. So at the point where we are bound by the latency limit, we are either investing a significant percentage of world GDP into the project, or we have already reached ASI at a smaller scale of compute and are using it to dramatically reduce compute costs for successor models.
Of course none of this analysis applies to the earlier data limit of 2e28 flop, which I think is more relevant and interesting.
An important caveat to the data movement limit:
“A recent paper which was published only a few days before the publication of our own work, Zhang et al. (2024), finds a scaling of B = 17.75 D^0.47 (in units of tokens). If we rigorously take this more aggressive scaling into account in our model, the fall in utilization is pushed out by two orders of magnitude; starting around 3e30 instead of 2e28. Of course, even more aggressive scaling might be possible with methods that Zhang et al. (2024) do not explore, such as using alternative optimizers.”
I haven’t looked carefully at Zhang et al., but assuming their analysis is correct and the data wall is at 3e30 FLOP, it’s plausible that we hit resource constraints ($10-100 trillion training runs, 2-20 TW power required) before we hit the data movement limit.
Speculatively, this might also differentially incentivize (research on generalized) inference scaling, with various potential strategic implications, including for AI safety (current inference scaling methods tend to be tied to CoT and the like, which are quite transparent) and for regulatory frameworks/proliferation of dangerous capabilities.
Aschenbrenner in Situational Awareness predicts illegible chains of thought are going to prevail because they are more efficient. I know of one developer claiming to do this (https://platonicresearch.com/) but I guess there must be many.
under the assumptions here (including Chinchilla scaling laws), depth wouldn’t increase by more than about 3x before the utilization rate starts dropping (because depth would increase with exponent about 1⁄6 of the total increase in FLOP); which seems like great news for the legibility of CoT outputs and similar and vs. opaque reasoning in models: https://lesswrong.com/posts/HmQGHGCnvmpCNDBjc/current-ais-provide-nearly-no-data-relevant-to-agi-alignment#mcA57W6YK6a2TGaE2
Interesting paper, though the estimates here don’t seem to account for Epoch’s correction to the chinchilla scaling laws: https://epochai.org/blog/chinchilla-scaling-a-replication-attempt
This would imply that the data movement bottleneck is a bit further out.
Success at currently-researched generalized inference scaling laws might risk jeopardizing some of the fundamental assumptions of current regulatory frameworks.
the o1 results have illustrated specialized inference scaling laws, for model capabilities in some specialized domains (e.g. math); notably, these don’t seem to generally hold for all domains—e.g. o1 doesn’t seem better than gpt4o at writing;
there’s ongoing work at OpenAI to make generalized inference scaling work;
e.g. this could perhaps (though maybe somewhat overambitiously) be framed, in the language of https://epochai.org/blog/trading-off-compute-in-training-and-inference, as there no longer being an upper bound in how many OOMs of inference compute can be traded for equivalent OOMs of pretraining;
to the best of my awarenes, current regulatory frameworks/proposals (e.g. the EU AI Act, the Executive Order, SB 1047) frame the capabilities of models in terms of (pre)training compute and maybe fine-tuning compute (e.g. if (pre)training FLOP > 1e26, the developer needs to take various measures), without any similar requirements framed in terms of inference compute; so current regulatory frameworks seem unprepared for translating between indefinite amounts of inference compute and pretraining compute (capabilities);
this would probably be less worrying in the case of models behind APIs, since it should be feasible to monitor for misuse (e.g. CBRN), especially since current inference scaling methods (e.g. CoT) seem relatively transparent; but this seems different for open-weights or leaked models;
notably, it seems like Llama-3 405b only requires 8 good GPUs to run on one’s own hardware (bypassing any API), apparently costing <100k$; the relatively low costs could empower a lot of bad actors, especially since guardrails are easy to remove and potentially more robust methods like unlearning still seem mostly in the research phase;
this would also have consequences for (red-teaming) evals; if running a model for longer potentially leads to more (misuse) uplift, it becomes much harder to upper-bound said uplift;
also, once a model’s weights are out, it might be hard to upperbound all the potential inference uplift that future fine-tunes might provide; e.g. suppose it did become possible to fine-tune 4o into something even better than o1, that could indeed do almost boundless inference scaling; this might then mean that somebody could then apply the same fine-tuning to Llama-3 405b and release that model open-weights, and then (others) mights potentially be able to misuse it in very dangerous ways using a lot of inference compute (but still comparatively low vs. pretraining compute);
OTOH, I do buy the claims from https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/ that until now open-weights have probably been net beneficial for safety, and I expect this could easily keep holding in the near-term (especially as long as open-weights models stay somewhat behind the SOTA); so I’m conflicted about what should be done, especially given the uncertainty around the probability of making generalized inference scaling work and over how much more dangerous this could make current or future open-weights releases; one potentially robust course of action might be to keep doing research on unlearning methods, especially ones robust to fine-tuning, like in Tamper-Resistant Safeguards for Open-Weight LLMs, which might be sufficient to make misuse (especially in some specialized domains, e.g. CBRN) as costly as needing to train from scratch
For similar reasons to the discussion here about why individuals and small businesses might be expected to be able to (differentially) contribute to LM scaffolding (research), I expect them to be able to differentially contribute to [generalized] inference scaling [research]; plausibly also the case for automated ML research agents. Also relevant: Before smart AI, there will be many mediocre or specialized AIs.
Quick take: on the margin, a lot more research should probably be going into trying to produce benchmarks/datasets/evaluation methods for safety (than more directly doing object-level safety research).
Some past examples I find valuable—in the case of unlearning: WMDP, Eight Methods to Evaluate Robust Unlearning in LLMs; in the case of mech interp—various proxies for SAE performance, e.g. from Scaling and evaluating sparse autoencoders, as well as various benchmarks, e.g. FIND: A Function Description Benchmark for Evaluating Interpretability Methods. Prizes and RFPs seem like a potentially scalable way to do this—e.g. https://www.mlsafety.org/safebench—and I think they could be particularly useful on short timelines.
Including differentially vs. doing the full stack of AI safety work—because I expect a lot of such work could be done by automated safety researchers soon, with the right benchmarks/datasets/evaluation methods. This could also make it much easier to evaluate the work of automated safety researchers and to potentially reduce the need for human feedback, which could be a significant bottleneck.
Better proxies could also make it easier to productively deploy more inference compute—e.g. from Large Language Monkeys: Scaling Inference Compute with Repeated Sampling: ‘When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget.’ Similar findings in other references, e.g. in Trading Off Compute in Training and Inference.
Of course, there’s also the option of using automated/augmented safety research to produce such benchmarks, but it seems potentially particularly useful to have them ahead of time.
This also seems like an area where we can’t piggyback off of work from capabilities researchers, like will probably be significantly the case for applying automated researchers (to safety).
I’m torn between generally being really fucking into improving feedback loops (and thinking they are a good way to make it easier to make progress on confusing questions), and, being sad that so few people are actually just trying to actually directly tackle the hard bits of the alignment challenge.
Some quick thoughts:
automating research that ‘tackles the hard bits’ seems to probably be harder (happen chronologically later) and like it might be more bottlenecked by good (e.g. agent foundations) human researcher feedback—which does suggest it might be valuable to recruit more agent foundations researchers today
but my impression is that recruiting/forming agent foundations researchers itself is much harder and has much worse feedback loops
I expect if the feedback loops are good enough, the automated safety researchers could make enough prosaic safety progress, feeding into agendas like control and/or conditioning predictive models (+ extensions to e.g. steering using model internals, + automated reviewing, etc.), to allow the use of automated agent foundations researchers with more confidence and less human feedback
the hard bits might also become much less hard—or prove to be irrelevant—if for at least some of them empirical feedback could be obtained by practicing on much more powerful models
Overall, my impression is that focusing on the feedback loops seems probably significantly more scalable today, and the results would probably be enough to allow to mostly defer the hard bits to automated research.
(Edit: but also, I think if funders and field builders ‘went hard’, it might be much less necessary to choose.)
Potentially also https://www.lesswrong.com/posts/yxdHp2cZeQbZGREEN/improving-model-written-evals-for-ai-safety-benchmarking.
I think this one (and perhaps better operationalizations) should probably have many eyes on it:
I also have this market for GPQA, on a longer time-horizon:
https://manifold.markets/NiplavYushtun/will-the-gap-between-openweights-an
Jack Clark: ‘Registering a prediction: I predict that within two years (by July 2026) we’ll see an AI system beat all humans at the IMO, obtaining the top score. Alongside this, I would wager we’ll see the same thing—an AI system beating all humans in a known-hard competition—in another scientific domain outside of mathematics. If both of those things occur, I believe that will present strong evidence that AI may successfully automate large chunks of scientific research before the end of the decade.’ https://importai.substack.com/p/import-ai-380-distributed-13bn-parameter
Prediction markets on similar questions suggest to me that this is a consensus view.
General LLMs 44% to get gold on the IMO before 2026. This suggests the mathematical competency will be transferrable—not just restricted to domain-specific solvers.
LLMs favored to outperform PhD students in their own subject before 2026
With research automation in mind, here’s my wager: the modal top-15 STEM PhD student will redirect at least half of their discussion/questions from peers to mid-2026 LLMs. Defining the relevant set of questions as being drawn from the same difficulty/diversity/open-endedness distribution that PhDs would have posed in early 2024.
Fwiw, I’ve kind of already noted myself starting to do some of this, for AI safety-related papers; especially after Claude-3.5 Sonnet came out.
(cross-posted from https://x.com/BogdanIonutCir2/status/1844451247925100551, among others)
I’m concerned things might move much faster than most people expected, because of automated ML (figure from OpenAI’s recent MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering, basically showing automated ML engineering performance scaling as more inference compute is used):
https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=dyndDEn9qqdt9Mhx2
https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=groxciBxma5cgay7L
I find it pretty wild that automating AI safety R&D, which seems to me like the best shot we currently have at solving the full superintelligence control/alignment problem, no longer seems to have any well-resourced, vocal, public backers (with the superalignment team disbanded).
I think Anthropic is becoming this org. Jan Leike just tweeted:
https://x.com/janleike/status/1795497960509448617
On what basis do you think it’s the ‘best shot’? I used to think it was a good idea, a few years ago, but in retrospect I think that was just a computer scientist’s love of recursion. I don’t think that at present conditions are good for automating R&D. On the one hand, we have a lot of very smart people working on AI safety R&D, with very slow progress, indicating it is a hard problem. On the other hand, present-day LLMs are stupid at long-term planning, and acquiring new knowledge, which are things you need to be good at to do R&D.
What advantage do you see AIs having over humans in this area?
I think there will be a period in the future where AI systems (models and their scaffolding) exist which are sufficiently capable that they will be able to speed up many aspects of computer-based R&D. Including recursive-self-improvement, Alignment research and Control research. Obviously, such a time period will not be likely to last long given that surely some greedy actor will pursue RSI. So personally, that’s why I’m not putting a lot of faith in getting to that period [edit: resulting in safety].
I think that if you build the scaffolding which would make current models able to be substantially helpful at research (which would be impressively strong scaffolding indeed!), then you have built dual-use scaffolding which could also be useful for RSI. So any plans to do this must take appropriate security measures or they will be net harmful.
Agree with a lot of this, but scaffolds still seem to me pretty good, for reasons largely similar to those in https://www.lesswrong.com/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model#Accelerating_LM_agents_seems_neutral__or_maybe_positive_.
I kind of wish this was true, because it would likely mean longer timelines, but my expectation is that the incoming larger LMs + better scaffolds + more inference-time compute could quite easily pass the threshold of significant algorithmic progress speedup from automation (e.g. 2x).
Making algorithmic progress and making safety progress seem to differ along important axes relevant to automation:
Algorithmic progress can use 1. high iteration speed 2. well-defined success metrics (scaling laws) 3.broad knowledge of the whole stack (Cuda to optimization theory to test-time scaffolds) 4. …
Alignment broadly construed is less engineering and a lot more blue skies, long horizon, and under-defined (obviously for engineering heavy alignment sub-tasks like jailbreak resistance, and some interp work this isn’t true).
Probably automated AI scientists will be applied to alignment research, but unfortunately automated research will differentially accelerate algorithmic progress over alignment. This line of reasoning is part of why I think it’s valuable for any alignment researcher (who can) to focus on bringing the under-defined into a well-defined framework. Shovel-ready tasks will be shoveled much faster by AI shortly anyway.
Generally agree, but I do think prosaic alignment has quite a few advantages vs. prosaic capabilities (e.g. in the extra slides here) and this could be enough to result in aligned (-enough) automated safety researchers which can be applied to the more blue skies parts of safety research. I would also very much prefer something like a coordinated pause around the time when safety research gets automated.
Agree, I’ve written about (something related to) this very recently.
Yes, things have certainly changed in the four months since I wrote my original comment, with the advent of o1 and Sakana’s Artificial Scientist. Both of those are still incapable of full automation of self-improvement, but they’re close. We’re clearly much closer to a recursive speed up of R&D, leading to FOOM.
from https://jack-clark.net/2024/08/18/import-ai-383-automated-ai-scientists-cyborg-jellyfish-what-it-takes-to-run-a-cluster/…, commenting on https://arxiv.org/abs/2408.06292: ‘Why this matters – the taste of automated science: This paper gives us a taste of a future where powerful AI systems propose their own ideas, use tools to do scientific experiments, and generate results. At this stage, what we have here is basically a ‘toy example’ with papers of dubious quality and insights of dubious import. But you know where we were with language models five years ago? We had things that could barely write a paragraph. Now they can do this. I predict that by the summer of 2026 we will have seen at least one genuinely interesting research paper that was soup-to-nuts generated via a tool-using generative AI system.’
Yep. I find it pretty odd that alignment people say, with a straight face, “LLM agents are still pretty dumb so I don’t think that’s a path to AGI”. I thought the whole field was about predicting progress and thinking out ahead of it.
Progress does tend to happen when humans work on something.
Progress probably happens when parahumans work on things, too.
Who says those things? That doesn’t really sound like something that people say. Like, I think there are real arguments about why LLM agents might not be the most likely path to AGI, but “they are still pretty dumb, therefore that’s not a path to AGI” seems like obviously a strawman, and I don’t think I’ve ever seen it (or at least not within the last 4 years or so).
Fair enough; this sentiment is only mentioned offhand in comments and might not capture very much of the average opinion. I may be misestimating the community’s average opinion. I hope I am wrong, and I’m glad to see others don’t agree!
I’m a bit puzzled on the average attitude toward LLM agents as a route to AGI among alignment workers. I’m still surprised there aren’t more people working directly on aligning LLM agents. I’d think we’d be working harder on the most likely single type of first AGI if lots of us really believed it’s reasonably likely to get there (especially since it’s probably the fastest route if it doesn’t plateau soon).
One possible answer is that people are hoping that the large amount of work on aligning LLMs will cover aligning LLM agents. I think that work is helpful but not sufficient, so we need more thinking about agent alignment as distinct from their base LLMs. I’m currently writing a post on this.
(Most people in AI Alignment work at scaling labs and are therefore almost exclusively working on LLM alignment. That said, I don’t actually know what it means to work on LLM alignment over aligning other systems, it’s not like we have a ton of traction on LLM alignment, and most techniques and insights seem general enough to not be conditional specifically on LLMs)
I think Seth is distinguishing “aligning LLM agents” from “aligning LLMs”, and complaining that there’s insufficient work on the former, compared to the latter? I could be wrong.
Ooh, I can speak to this. I’m mostly focused on technical alignment for actor-critic model-based RL systems (a big category including MuZero and [I argue] human brains). And FWIW my experience is: there are tons of papers & posts on alignment that assume LLMs, and with rare exceptions I find them useless for the non-LLM algorithms that I’m thinking about.
As a typical example, I didn’t get anything useful out of Alignment Implications of LLM Successes: a Debate in One Act—it’s addressing a debate that I see as inapplicable to the types of AI algorithms that I’m thinking about. Ditto for the debate on chain-of-thought accuracy vs steganography and a zillion other things.
When we get outside technical alignment to things like “AI control”, governance, takeoff speed, timelines, etc., I find that the assumption of LLMs is likewise pervasive, load-bearing, and often unnoticed.
I complain about this from time to time, for example Section 4.2 here, and also briefly here (the bullets near the bottom after “Yeah some examples would be:”).
I agree with the claim that the techniques and insights for alignment that are usually considered are not conditional on LLMs specifically, including my own plan for AI alignment.
(cross-posted from https://x.com/BogdanIonutCir2/status/1842914635865030970)
maybe there is some politically feasible path to ML training/inference killswitches in the near-term after all: https://www.datacenterdynamics.com/en/news/asml-adds-remote-kill-switch-to-tsmcs-euv-machines-in-case-china-invades-taiwan-report/
I recently gave a talk (slides) on some thoughts about what automating AI safety research might look like.
Some [earlier versions] of the ideas there were developed during my Astra Fellowship Winter ’24 with @evhub and through related conversations in Constellation.
On an apparent missing mood—FOMO on all the vast amounts of automated AI safety R&D that could (almost already) be produced safely
Automated AI safety R&D could results in vast amounts of work produced quickly. E.g. from Some thoughts on automating alignment research (under certain assumptions detailed in the post):
Despite this promise, we seem not to have much knowledge when such automated AI safety R&D might happen, in which order the relevant capabilities might appear (including compared to various dangerous capabilities), etc. AFAICT. we don’t seem to be trying very hard either, neither at prediction, nor at elicitation of such capabilities.
One potential explanation I heard for this apparent missing mood of FOMO on AI automated AI safety R&D is the idea / worry that the relevant automated AI safety R&D capabilities would only appear when models are already / too close to being dangerous. This seems plausible to me, but would still seems like a strange way to proceed given the uncertainty.
For contrast, consider how one might go about evaluating dangerous capabilities and trying to anticipate them ahead of time. From GDM’s recent Introducing the Frontier Safety Framework:
I see no reason why, in principle, a similar high-level approach couldn’t be analogously taken for automated AI safety R&D capabilities, yet almost nobody seems to be working on this, at least publicly (in fact, I am currently working on related topics; I’m still very surprised by the overall neglectedness, made even more salient by current events).
My main vibe is:
AI R&D and AI safety R&D will almost surely come at the same time.
Putting aside using AIs as tools in some more limited ways (e.g. for interp labeling or generally for grunt work)
People at labs are often already heavily integrating AIs into their workflows (though probably somewhat less experimentation here than would be ideal as far as safety people go).
It seems good to track potential gaps between using AIs for safety and for capabilities, but by default, it seems like a bunch of this work is just ML and will come at the same time.
Seems like probably the modal scenario to me too, but even limited exceptions like the one you mention seem to me like they could be very important to deploy at scale ASAP, especially if they could be deployed using non-x-risky systems (e.g. like current ones, very bad at DC evals).
This seems good w.r.t. automated AI safety potentially ‘piggybacking’, but bad for differential progress.
Sure, though wouldn’t this suggest at least focusing hard on (measuring / eliciting) what might not come at the same time?
Why think this is important to measure or that this already isn’t happening?
E.g., on the current model organism related project I’m working on, I automate inspecting reasoning traces in various ways. But I don’t feel like there is any particularly interesting thing going on here which is important to track (e.g. this tip isn’t more important than other tips for doing LLM research better).
Intuitively, I’m thinking of all this as something like a race between [capabilities enabling] safety and [capabilities enabling dangerous] capabilities (related: https://aligned.substack.com/i/139945470/targeting-ooms-superhuman-models); so from this perspective, maintaining as large a safety buffer as possible (especially if not x-risky) seems great. There could also be something like a natural endpoint to this ‘race’, corresponding to being able to automate all human-level AI safety R&D safely (and then using this to produce a scalable solution to aligning / controlling superintelligence).
W.r.t. measurement, I think it would be good orthogonally to whether auto AI safety R&D is already happening or not, similarly to how e.g. evals for automated ML R&D seem good even if automated ML R&D is already happening. In particular, the information of how successful auto AI safety R&D would be (and e.g. what the scaling curves look like vs. those for DCs) seems very strategically relevant to whether it might be feasible to deploy it at scale, when that might happen, with what risk tradeoffs, etc.
To get somewhat more concrete, the Frontier Safety Framework report already proposes a (somewhat vague) operationalization for ML R&D evals, which (vagueness-permitting) seems straightforward to translate into an operationalization for automated AI safety R&D evals:
(crossposted from X/twitter)
Epoch is one of my favorite orgs, but I expect many of the predictions in https://epochai.org/blog/interviewing-ai-researchers-on-automation-of-ai-rnd to be overconservative / too pessimistic. I expect roughly a similar scaleup in terms of compute as https://x.com/peterwildeford/status/1825614599623782490… - training runs ~1000x larger than GPT-4′s in the next 3 years—and massive progress in both coding and math (e.g. along the lines of the medians in https://metaculus.com/questions/6728/ai-wins-imo-gold-medal/… https://metaculus.com/questions/12467/ai-wins-ioi-gold-medal/…), which seem to me like some of the most important domains for automated ML research, both in terms of prerequisite capabilities, and for transfer reasons. Furthermore, along the lines of previous Epoch analysis in https://epochai.org/blog/trading-off-compute-in-training-and-inference…, I expect large chunks of ML research (e.g. post-training) to be differentially automatable by spending more inference compute, because they have relatively accurate and cheap proxy feedback signals (e.g. accuracy on various benchmarks). I also expect the ability to generate verifiable synthetic code + math data to train on to also contribute to math/code/reasoning capabilities more broadly (as is already rumored about Q*/Strawberry). And finally, I wouldn’t be too surprised if even the fuzzier parts of ML research workflows like ideation might be at least somewhat automatable, along the lines of https://sakana.ai/ai-scientist/, especially for ML research which takes relatively little compute to iterate (e.g. large parts of post-training).
(cross-posted from X/twitter)
The already-feasibility of https://sakana.ai/ai-scientist/ (with basically non-x-risky systems, sub-ASL-3, and bad at situational awareness so very unlikely to be scheming) has updated me significantly on the tractability of the alignment / control problem. More than ever, I expect it’s gonna be relatively tractable (if done competently and carefully) to safely, iteratively automate parts of AI safety research, all the way up to roughly human-level automated safety research (using LLM agents roughly-shaped like the AI scientist, with humans in the loop; in a control framework https://lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled…, ideally). Add some decent regulation (especially vs. uncontrolled intelligence explosions using AI scientists inside the largest AI labs) + paretotopian coordination https://aiprospects.substack.com/p/paretotopian-goal-alignment… (perhaps especially at the international level) and I think we might have all the basic ingredients of a ‘pivotal act’ in the best sense of the term. (Easier said than done, though).
I’d love to see at least several Sakana-like AI safety orgs, focused on using LLM agents to automate / strongly augment various parts of safety research. And ideally putting those tools (technically agents) in the hands of hundreds / thousands of (including independent) AI safety researchers. Unsure if the AI safety funding / field-building infrastructure is anywhere near ready for this level of potential scale-up, though. I expect quite a few safety research areas to be prime candidates for early automation / strong augmentation, especially the ones with relatively short and precise feedback loops (in both code and compute), e.g. unlearning, activation/representation engineering (and applications) and large parts of LLM fine-tuning (e.g. https://arxiv.org/abs/2406.08414 for a prototype).
I think currently approximately no one is working on the kind of safety research that when scaled up would actually help with aligning substantially smarter than human agents, so I am skeptical that the people at labs could automate that kind of work (given that they are basically doing none of it). I find myself frustrated with people talking about automating safety research, when as far as I can tell we have made no progress on the relevant kind of work in the last ~5 years.
Can you give some examples of work which you do think represents progress?
My rough mental model is something like: LLMs are trained using very large-scale (multi-task) behavior cloning, so I expect various tasks to be increasingly automated / various data distributions to be successfully, all the way to (in the limit) at roughly everything humans can do, as a lower bound; including e.g. the distribution of what tends to get posted on LessWrong / the Alignment Forum (and agent foundations-like work); especially when LLMs are scaffolded into agents, given access to tools, etc. With the caveats that the LLM [agent] paradigm might not get there before e.g. it runs out of data / compute, or that the systems might be unsafe to use; but against this was where https://sakana.ai/ai-scientist/ was a particularly significant update for me; and ‘iteratively automate parts of AI safety research’ is also supposed to help with keeping systems safe as they become increasingly powerful.
Change my mind: outer alignment will likely be solved by default for LLMs. Brain-LM scaling laws (https://arxiv.org/abs/2305.11863) + LM embeddings as model of shared linguistic space for transmitting thoughts during communication (https://www.biorxiv.org/content/10.1101/2023.06.27.546708v1.abstract) suggest outer alignment will be solved by default for LMs: we’ll be able to ‘transmit our thoughts’, including alignment-relevant concepts (and they’ll also be represented in a [partially overlapping] human-like way).
I think the Corrigibility agenda, framed as “do what I mean, such that I will probably approve of the consequences, not just what I literally say such that our interaction will likely harm my goals” is more doable than some have made it out to be. I still think that there are sufficient subtle gotchas there that it makes sense to treat it as an area for careful study rather than “solved by default, no need to worry”.
Prototype of LLM agents automating the full AI research workflow: The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.
And already some potential AI safety issues: ’We have noticed that The AI Scientist occasionally tries to increase its chance of success, such as modifying and launching its own execution script! We discuss the AI safety implications in our paper.
For example, in one run, it edited the code to perform a system call to run itself. This led to the script endlessly calling itself. In another case, its experiments took too long to complete, hitting our timeout limit. Instead of making its code run faster, it simply tried to modify its own code to extend the timeout period.′
Obviously relevant for automating safety research too; see this presentation and this comment for some related thoughts.
Quick take: I think LM agents to automate large chunks of prosaic alignment research should probably become the main focus of AI safety funding / person-time. I can’t think of any better spent marginal funding / effort at this time.
RSPs for automated AI safety R&D require rethinking RSPs
AFAICT, all current RSPs are only framed negatively, in terms of [prerequisites to] dangerous capabilities to be detected (early) and mitigated.
In contrast, RSPs for automated AI safety R&D will likely require measuring [prerequisites to] capabilities for automating [parts of] AI safety R&D, and preferentially (safely) pushing these forward. An early such examples might be safely automating some parts of mechanistic intepretability.
(Related: On an apparent missing mood—FOMO on all the vast amounts of automated AI safety R&D that could (almost already) be produced safely)
quick take: Against Almost Every Theory of Impact of Interpretability should be required reading for ~anyone starting in AI safety (e.g. it should be in the AGISF curriculum), especially if they’re considering any model internals work (and of course even more so if they’re specifically considering mech interp)
56% on swebench-lite with repeated sampling (13% above previous SOTA; up from 15.9% with one sample to 56% with 250 samples), with a very-below-SOTA model https://arxiv.org/abs/2407.21787; anything automatically verifiable (large chunks of math and coding) seems like it’s gonna be automatable in < 5 years.
The finding on the differential importance of verifiability also seems in line with the findings from Trading Off Compute in Training and Inference.
(epistemic status: quick take, as the post category says)
Browsing though EAG London attendees’ profiles and seeing what seems like way too many people / orgs doing (I assume dangerous capabilities) evals. I expect a huge ‘market downturn’ on this, since I can hardly see how there would be so much demand for dangerous capabilities evals in a couple years’ time / once some notorious orgs like the AISIs build their sets, which many others will probably copy.
While at the same time other kinds of evals (e.g. alignment, automated AI safety R&D, even control) seem wildly neglected.
There’s still lots and lots of demand for regular capability evaluation, as we keep discovering new issues or LLMs keep tearing through them and rendering them moot, and the cost of creating a meaningful dataset like GPQA keeps skyrocketing (like ~$1000/item) compared to the old days where you could casually Turk your way to questions LLM would fail (like <$1/item). Why think that the dangerous subset would be any different? You think someone is going to come out with a dangerous-capabilities eval in the next year and then that’s it, it’s done, we’ve solved dangerous-capabilities eval, Mission Accomplished?
If it’s well designed and kept private, this doesn’t seem totally implausible to me; e.g. how many ways can you evaluate cyber capabilities to try to asses risks of weights exfiltration or taking over the datacenter (in a control framework)? Surely that’s not an infinite set.
But in any case, it seems pretty obvious that the returns should be quickly diminishing on e.g. the 100th set of DC evals vs. e.g. the 2nd set of alignment evals / 1st set of auto AI safety R&D evals.
It’s not an infinite set and returns diminish, but that’s true of regular capabilities too, no? And you can totally imagine new kinds of dangerous capabilities; every time LLMs gain a new modality or data source, they get a new set of vulnerabilities/dangers. For example, once Sydney went live, you had a whole new kind of dangerous capability in terms of persisting knowledge/attitudes across episodes of unrelated users by generating transcripts which would become available by Bing Search. This would have been difficult to test before, and no one experimented with it AFAIK. But after seeing indirect prompt injections in the wild and possible amplification of the Sydney personae, now suddenly people start caring about this once-theoretical possibility and might start evaluating it. (This is also a reason why returns don’t diminish as much, because benchmarks ‘rot’: quite aside from intrinsic temporal drift and ceiling issues and new areas of dangers opening up, there’s leakage, which is just as relevant to dangerous capabilities as regular capabilities—OK, you RLHFed your model to not provide methamphetamine recipes and this has become a standard for release, but it only works on meth recipes and not other recipes because no one in your org actually cares and did only the minimum RLHF to pass the eval and provide the execs with excuses, and you used off-the-shelf preference datasets designed to do the minimum, like Facebook releasing Llama… Even if it’s not leakage in the most literal sense of memorizing the exact wording of a question, there’s still ‘meta-leakage’ of overfitting to that sort of question.)
To expand on the point of benchmark rot, as someone working on dangerous capabilities evals… For biorisk specifically, one of the key things to eval is if the models can correctly guess the results of unpublished research. As in, can it come up with plausible hypotheses, accurately describe how to test those hypotheses, and make a reasonable guess at the most likely outcomes? Can it do these things at expert human level? superhuman level?
The trouble with this is that the frontier of published research keeps moving forward, so evals like this go out of date quickly. Nevertheless, such evals can be very important in shaping the decisions of governments and corporations.
I do agree that we shouldn’t put all our focus on dangerous capabilities evals at the expense of putting no focus on other kinds of evals (e.g. alignment, automated AI safety R&D, even control). However, I think a key point is that the models are dangerous NOW. Alignment, safety R&D, and control are, in some sense, future problems. Misuse is a present and growing danger, getting noticeably worse with every passing month. A single terrorist or terrorist org could wipe out human civilization today, killing >90% of the population with less than $500k funding (potentially much less if they have access to a well-equipped lab, and clever excuses ready for ordering suspicious supplies). We have no sufficient defenses. This seems like an urgent and tractable problem.
Urgent, because the ceiling on uplift is very far away. Models have the potential to make things much much worse than they currently are.
Tractable, because there are relatively cheap actions that governments could take to slow this increase of risk if they believed in the risks.
For what it’s worth, I try to spend some of my time thinking about these other types of evals also. And I would recommend that those working on dangerous capabilities evals spend at least a little time and thought on the other problems.
Another aspect of the problem
A lot of people seem to frequently be ‘trend following’ rather than ‘trend setting’ (by thinking original thoughts for themselves, doing their own research and coming to their own well-formed opinions). If those ‘trend followers’ are also not super high impact thinkers, maybe it’s ok if they’re just doing the obvious things?
What’s DC?
*dangerous capabilities; will edit
Epistemic status: at least somewhat rant-mode.
I find it pretty ironic that many in AI risk mitigation would make asks for if-then committments/RSPs from the top AI capabilities labs, but they won’t make the same asks for AI safety orgs/funders. E.g.: if you’re an AI safety funder, what kind of evidence (‘if’) will make you accelerate how much funding you deploy per year (‘then’)?
One of these types of orgs is developing a technology with the potential to kill literally all of humanity. The other type of org is funding research that if it goes badly mostly just wasted their own money. Of course the demands for legibility and transparency should be different.
News about an apparent shift in focus to inference scaling laws at the top labs: https://www.reuters.com/technology/artificial-intelligence/openai-rivals-seek-new-path-smarter-ai-current-methods-hit-limitations-2024-11-11/
I think the journalists might have misinterpreted Sutskever, if the quote provided in the article is the basis for the claim about plateauing:
What he’s likely saying is that there are new algorithmic candidates for making even better use of scaling. It’s not that scaling LLM pre-training plateaued, but rather other things became available that might be even better targets for scaling. Focusing on these alternatives could be more impactful than focusing on scaling of LLM pre-training further.
He’s also currently motivated to air such implications, since his SSI only has $1 billion, which might buy a 25K H100s cluster, while OpenAI, xAI, and Meta recently got 100K H100s clusters (Google and Anthropic likely have that scale of compute as well, or will imminently).
If base model scaling has indeed broken down, I wonder how this manifests. Does the Chinchilla scaling law no longer hold beyond a certain size? Or does it still hold, but reduction in prediction loss no longer goes along with a proportional increase in benchmark performance? The latter could mean the quality of the (largely human generated) training data is the bottle neck.
(cross-posted from https://x.com/BogdanIonutCir2/status/1844787728342245487)
the plausibility of this strategy, to ‘endlessly trade computation for better performance’ and then have very long/parallelized runs, is precisely one of the scariest aspects of automated ML; even more worrying that it’s precisely what some people in the field are gunning for, and especially when they’re directly contributing to the top auto ML scaffolds; although, everything else equal, it might be better to have an early warning sign, than to have a large inference overhang: https://x.com/zhengyaojiang/status/1844756071727923467
(still) speculative, but I think the pictures of Shard Theory, activation engineering and Simulators (and e.g. Bayesian interpretations of in-context learning) are looking increasingly similar: https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed?commentId=qX4k7y2vymcaR6eio
https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed#SfPw5ijTDi6e3LabP
on unsupervised learning/clustering in the activation space of multiple systems as a potential way to deal with proxy problems when searching for some concepts (e.g. embeddings of human values): https://www.lesswrong.com/posts/Nwgdq6kHke5LY692J/alignment-by-default#8CngPZyjr5XydW4sC
‘Krenn thinks that o1 will accelerate science by helping to scan the literature, seeing what’s missing and suggesting interesting avenues for future research. He has had success looping o1 into a tool that he co-developed that does this, called SciMuse. “It creates much more interesting ideas than GPT-4 or GTP-4o,” he says.’ (source; related: current underelicitation of auto ML capabilities)
Related: https://www.lesswrong.com/posts/fdCaCDfstHxyPmB9h/vladimir_nesov-s-shortform?commentId=2ZRSnZEQDbWzsZA3M
https://www.lesswrong.com/posts/MEBcfgjPN2WZ84rFL/o-o-s-shortform?commentId=QDEvi8vQkbTANCw2k
I’ve been thinking hard about what my next step should be, after my job applications being turned down again by various safety orgs and Anthropic. Now it seems clear to me. I have a vision of how I expect an RSI process to start, using LLMs to mine testable hypotheses from existing published papers.
I should just put my money where my mouth is, and try to build the scaffolding for this. I can then share my attempts with someone at Anthropic. If I’m wrong, I will be wasting my time and savings. If I’m right, I might be substantially helping the world. Seems like a reasonable bet.
Alternately, collaborating/sharing with e.g. METR or UK AISI auto ML evals teams might be interesting. Maybe even Pallisade or similar orgs from a ‘scary demo’ perspective? @jacquesthibs might also be interested. I might also get to work on this or something related, depending on how some applications go.
I also expect Sakana, Jeff Clune’s group and some parts of the open-source ML community will try to push this, but I’m more uncertain at least in some of these cases about the various differential acceleration tradeoffs.
This is what I’ve been trying to tell people for the past couple years. There is undigested useful info and hypotheses buried in noise amidst published academic papers. I call this an ‘innovation overhang’. The models don’t need to be smart enough to come up with ideas, just smart enough to validate/find them admist the noise and then help set up experiments to test them.
(crossposted from https://x.com/BogdanIonutCir2/status/1840775662094713299)
I really wish we’d have some automated safety research prizes similar to https://aimoprize.com/updates/2024-09-23-second-progress-prize…. Some care would have to be taken to not advance capabilities [differentially], but at least some targeted areas seem quite robustly good, e.g. …https://multimodal-interpretability.csail.mit.edu/maia/.
It might be interesting to develop/put out RFPs for some benchmarks/datasets for unlearning of ML/AI knowledge (and maybe also ARA-relevant knowledge), analogously to WMDP for CBRN. This might be somewhat useful e.g. in a case where we might want to use powerful (e.g. human-level) AIs for cybersecurity, but we don’t fully trust them.
From Understanding and steering Llama 3:
Seems like it would be great if inference scaling laws worked for this particular application.
‘ChatGPT4 generates social psychology hypotheses that are rated as original as those proposed by human experts’ https://x.com/BogdanIonutCir2/status/1836720153444139154
I think it’s quite likely we’re already in crunch time (e.g. in a couple years’ time we’ll see automated ML accelerating ML R&D algorithmic progress 2x) and AI safety funding is *severely* underdeployed. We could soon see many situations where automated/augmented AI safety R&D is bottlenecked by lack of human safety researchers in the loop. Also, relying only on the big labs for automated safety plans seems like a bad idea, since the headcount of their safety teams seems to grow slowly (and I suspect much slower than the numbers of safety researchers outside the labs). Related: https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/.
A list of rough ideas of things I find potentially promising to do/fund:
- RFP(s) for control agenda (e.g. for the long list here https://lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled#Appendix__A_long_list_of_control_techniques)
- Scale up and decentralize some of the grantmaking, e.g. through regrantors
RFP(s) for potentially very scalable agendas, e.g. applying/integrating automated research to various safety research agendas
- RFPs / direct funding for coming up with concrete plans for what to do (including how to deploy funding) in very short timelines (e.g. couple of years); e.g. like https://sleepinyourhat.github.io/checklist/, but ideally made even more concrete (to the degree it’s not info-hazardous)
- Offer more funding to help scale up MATS, Astra, etc. both in terms of number of mentors and mentees/mentor (if mentors are up for it); RFPs for building more similar programs
- RFPs for entrepreneurship/building scalable orgs, and potentially even incubators for building such orgs, e.g. https://catalyze-impact.org/apply
- Offer independent funding, but no promise of mentorship to promising-enough candidates coming out of field-building pipelines (e.g. MATS, Astra, AGISF, ML4Good, ARENA, AI safety camp, Athena).
Inference scaling laws should be a further boon to automated safety research, since they add another way to potentially spend money very scalably https://youtu.be/-aKRsvgDEz0?t=9647
I mean, in the same way as it is a boon to AI capabilities research? How is this differentially useful?
I suspect (though with high uncertainty) there are some factors differentially advantaging safety research, especially prosaic safety research (often in the shape of fine-tuning/steering/post-training), which often requires OOMs less compute and more broadly seems like it should have faster iteration cycles; and likely even ‘human task time’ (e.g. see figures in https://metr.github.io/autonomy-evals-guide/openai-o1-preview-report/) probably being shorter for a lot of prosaic safety research, which makes it likely it will on average be automatable sooner. The existence of cheap, accurate proxy metrics might point the opposite way, though in some cases there seem to exist good, diverse proxies—e.g. Eight Methods to Evaluate Robust Unlearning in LLMs. More discussion in the extra slides of this presentation.
Related, I expect the delay from needing to build extra infrastructure to train much larger LLMs will probably differentially affect capabilities progress more, by acting somewhat like a pause/series of pauses; which it would be nice to exploit by enrolling more safety people to try to maximally elicit newly deployed capabilities—and potentially be uplifted—for safety research. (Anecdotally, I suspect I’m already being uplifted at least a bit as a safety researcher by using Sonnet.)
Also, I think it’s much harder to pause/slow down capabilities than to accelerate safety, so I think more of the community focus should go to the latter.
And for now, it’s fortunate that inference scaling means CoT and similarly differentially transparent (vs. model internals) intermediate outputs, which makes it probably a safer way of eliciting capabilities.
The intelligence explosion might be quite widely-distributed (not just inside the big labs), especially with open-weights LMs (and the code for the AI scientist is also open-sourced):
https://x.com/RobertTLange/status/1829104918214447216
I think that would be really bad for our odds of surviving and avoiding a permanent suboptimal dictatorship, if the multipolar scenario continues up until AGI is fully RSI capable. That isn’t a stable equilibrium; the most vicious first mover tends to win and control the future. Some 17yo malcontent will wipe us out or become emperor for their eternal life. More logic in If we solve alignment, do we all die anyway? and the discussion there.
I think that argument will become so apparent that that scenario won’t be allowed to happen.
Having merely capable AGI widely available would be great for a little while.
I’m uncertain about all this, but here are some quick takes.
With respect to technical intent alignment, I think we’re very lucky that a lot of safety research will probably be automatable by non-x-risky systems (sub-ASL-3, very unlikely to be scheming because bad at prerequisites like situational awareness, often surprisingly transparent because they use CoT, tools, etc.). So I think we could be in a really good position, if we actually tried hard to use such systems for automated safety research, (for now it doesn’t seem to me like we’re trying all that hard as a community).
I’m even more uncertain about the governance side, especially about what should be done. I think open-weights LMs widely distributed intelligence explosions are probably really bad, so hopefully at least very powerful LMs don’t get open-weighted. Beyond this, though, I’m more unsure about more multipolar vs. more unipolar scenarios, given e.g. the potential lack of robustness of single points of failure. I’m somewhat hopeful that nation-level actors impose enough constraints/regulation at the national level, and then something like https://aiprospects.substack.com/p/paretotopian-goal-alignment happens at the international level. We might also just get somewhat lucky that compute constraints + economic and security incentives might mean that there never are more than e.g. 20 actors with ( at least direct, e.g. weights) access to very strong superintelligence.
I feel like 20-100 actors feels like a reasonable amount to coordinate on treaties. I think 300 starts to worry me that there’d be some crazy defector in the mix that takes risks which destroy themselves and everyone else. Just 2 or 3 actors, and I worry that there will be weird competitive tensions that make it hard to come to a settlement. I dunno, maybe I’m wrong about that, but it’s how I feel.
I’ve been writing a bit about some ideas around trying to establish a ‘council of guardians’, who are the major powers in the world. They would agree to mutual inspections and to collaborate on stopping the unauthorized development of rogue AI and self-replicating weapons.
I’ve been thinking about similar international solutions, so I look forward to seeing your thoughts on the matter.
My major concern is sociopathic people gaining the reins of power of just one of those AGIs, and defecting against that council of guardians. I think sociopaths are greatly overrepresented among powerful people; they care less about the downsides of having and pursuing power aggressively.
That’s why I’d think even 20 RSI-capable human-directed AGIs wouldn’t be stable for more than a decade.
Yeah, I see it as sort of a temporary transitional mode for humanity. I also don’t think it would be stable for long. I might give it 20-30 years, but I would be skeptical about it holding for 50 years.
I do think that even 10 years more to work on more fundamental solutions to the AGI transition would be hugely valuable though!
I have been attempting at least to imagine how to design a system assuming that all the actors will be selfish and tempted to defect (and possibly sociopathic, as power-holders sometimes are or become), but prevented from breaking the system. Defection-resistant mechanisms, where you just need a majority of the council to not defect in a given ‘event’ in order for them to halt and punish the defectors. And that hopefully making it obvious that this was the case, and that defection would get noticed and punished, would prevent even sociopathic power-holders from defecting.
This seems possible to accomplish, if the system is designed such that catching and punishing an attempt at defection has benefits for the enforcers which give higher expected value in their minds than the option of deciding to also defect once they detected someone else defecting.
Seems like a good problem to largely defer to AI though (especially if we’re assuming alignment in the instruction following sense), so maybe not the most pressing.
Unless there’s important factors about ‘order of operations’. By the time we have a powerful enough AI to solve this for us, it could be that someone is already defecting by using that AI to pursue recursive self-improvement at top speed…
I think that that is probably the case. I think we need to get the Council of Guardians in place and preventing defection before it’s too late, and irreversibly bad defection has already occurred.
I am unsure of exactly where the thresholds are, but I am confident that nobody else should be confident that there aren’t any risks! Our uncertainty should cause us to err on the side of putting in safe governance mechanisms ASAP!
Interesting automated AI safety R&D demo:
’In this release:
We propose and run an LLM-driven discovery process to synthesize novel preference optimization algorithms.
We use this pipeline to discover multiple high-performing preference optimization losses. One such loss, which we call Discovered Preference Optimization (DiscoPOP), achieves state-of-the-art performance across multiple held-out evaluation tasks, outperforming Direct Preference Optimization (DPO) and other existing methods.
We provide an initial analysis of DiscoPOP, to discover surprising and counterintuitive features.
We open-source not only the tuned model checkpoints and discovered objective functions but also the codebase for running the discovery process itself on GitHub and HuggingFace.′
’Our general LLM-driven discovery approach can be grouped into three steps:
First, we prompt an LLM with an initial task and problem description. We optionally add examples or previous evaluations and their recorded performance to the initial prompt.
The LLM is tasked with outputting a hypothesis, a name for the new method, and a code implementation of their method. We use the code in an inner loop training run and store the downstream performance of interest.
Finally, we update the context of the LLM with the results from the newly evaluated performance.′
https://sakana.ai/llm-squared/
https://arxiv.org/abs/2406.08414
Looking at how much e.g. the UK (>300B$) or the US (>1T$) have spent on Covid-19 measures puts in perspective how little is still being spent on AI safety R&D. I expect fractions of those budgets (<10%), allocated for automated/significantly-augmented AI safety R&D, would obsolete all previous human AI safety R&D.
1. Unfortunately, I think it’s harder to convert government funding into novel research than one might expect. I think there are a limited number of competent thinkers who are sufficiently up to speed on the problem to contribute within the short remaining time before catastrophe. I do agree that more government funding would help a lot, and that I’d personally love to be given some funding! I do also agree that it would help a huge amount in the long term (> 20 years). In the short term (< 3 years) however, I don’t think that even a trillion dollars of government funding would result in AI safety R&D progress sufficient to exceed all previous human AI safety R&D. I also think that there’s decreasing returns to funding, as the available researchers get sufficiently motivated to switch to the topic, and have sufficient resources to supply their labs with compute and assistants. I think that in the current world you probably don’t get much for your 11th trillion. So yeah, I’d definitely endorse spending $10 trillion on AI safety R&D (although I do think there are ways this could be implemented which would be unhelpful or even counter-productive).
2. I think that exceeding previous AI safety R&D is very different from obsoleting it. Building on a foundation and reaching greater heights doesn’t make the foundation worthless. If you do think that the foundation is worthless, I’d be curious to hear your arguments, but that seems like a different train of thought entirely.
3. I think that there will be a critical period where there is sufficiently strong AI that augmented/automated AI safety R&D will be able to rapidly eclipse the existing body of work. I don’t think we are there yet, and I wouldn’t choose to accelerate AI capabilities timelines further to get us there sooner. I do think that having AI safety labs well-supplied with funding and compute is important, but I don’t think that any amount of money or compute currently buys the not-yet-existing AI research assistant level AI which could result in the start of the critical automated-research period. I worry that this level of AI capability will also enable recursive self-improvement of AI capabilities research, so I think that’s going to be a very dangerous time indeed.
Language model agents for interpretability (e.g. MAIA, FIND) seem to be making fast progress, to the point where I expect it might be feasible to safely automate large parts of interpretability workflows soon.
Given the above, it might be high value to start testing integrating more interpretability tools into interpretability (V)LM agents like MAIA and maybe even considering randomized controlled trials to test for any productivity improvements they could already be providing.
For example, probing / activation steering workflows seem to me relatively short-horizon and at least somewhat standardized, to the point that I wouldn’t be surprised if MAIA could already automate very large chunks of that work (with proper tool integration). (Disclaimer: I haven’t done much probing / activation steering hands-on work [though I do follow such work quite closely and have supervised related projects], so my views here might be inaccurate).
While I’m not sure I can tell any ‘pivotal story’ about such automation, if I imagine e.g. 10x more research on probing and activation steering / year / researcher as a result of such automation, it still seems like it could be a huge win. Such work could also e.g. provide much more evidence (either towards or against) the linear representation hypothesis.
Very plausible view (though doesn’t seem to address misuse risks enough, I’d say) in favor of open-sourced models being net positive (including for alignment) from https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/:
‘While the labs certainly perform much valuable alignment research, and definitely contribute a disproportionate amount per-capita, they cannot realistically hope to compete with the thousands of hobbyists and PhD students tinkering and trying to improve and control models. This disparity will only grow larger as more and more people enter the field while the labs are growing at a much slower rate. Stopping open-source ‘proliferation’ effectively amounts to a unilateral disarmament of alignment while ploughing ahead with capabilities at full-steam.
Thus, until the point at which open source models are directly pushing the capabilities frontier themselves then I consider it extremely unlikely that releasing and working on these models is net-negative for humanity’
‘Much capabilities work involves simply gathering datasets or testing architectures where it is easy to utilize other closed models referenced in pappers or through tacit knowledge of employees. Additionally, simple API access to models is often sufficient to build most AI-powered products rather than direct access to model internals. Conversely, such access is usually required for alignment research. All interpretability requires access to model internals almost by definition. Most of the AI control and alignment techniques we have invented require access to weights for finetuning or activations for runtime edits. Almost nothing can be done to align a model through access to the I/O API of a model at all. Thus it seems likely to me that by restricting open-source we differetially cripple alignment rather than capabilities. Alignment research is more fragile and dependent on deep access to models than capabilities research.’
Current open source models are not themselves any kind of problem. Their availability accelerates timelines, helps with alignment along the way. If there is no moratorium, this might be net positive. If there is a moratorium, it’s certainly net positive, as it’s the kind of research that the moratorium is buying time for, and it doesn’t shorten timelines because they are guarded by the moratorium.
It’s still irreversible proliferation even when the impact is positive. The main issue is open source as an ideology that unconditionally calls for publishing all the things, and refuses to acknowledge the very unusual situations where not publishing things is better than publishing things.
I believe from my work on dangerous capabilities evals that current open source models do provide some small amount of uplift to bad actors. This uplift is increasing much greater than linearly with each new more-capable open source model that is released. If we want to halt this irreversible proliferation before it gets so far that human civilization gets wiped out, we need to act fast on it.
Alignment research is important, but misalignment is not the only threat we face.
One thing that comes to mind is test time compute, and Figure 3 of Language Monkeys paper is quite concerning, where even Pythia-70M (with an “M”) is able to find signal on problems that at first glance are obviously impossible for it to make heads or tails of (see also). If there is an algorithmic unlock, a Llama-3-405B (or Llama-4) might suddenly get much more capable if fed a few orders of magnitude more inference compute than normal. So the current impression about model capabilities can be misleading about what they eventually enable, using future algorithms and still affordable amounts of compute.
Excellent point Vladimir. My team has been thinking a lot about this issue. What if somebody leaked the latest AlphaFold, and instructions on how to make good use of it? If you could feed the instructions into an existing open-source model, and get functional python code out to interact with the private AlphaFold API you set up… That’s a whole lot more dangerous than an LLM alone!
As the whole space of ‘biological design tools’ (h/t Anjali for this term to describe the general concept) gets more capable and complex, the uplift from an LLM that can help you navigate and utilize these tools gets more dangerous. A lot of these computational tools are quite difficult to use effectively for a layperson, yet an AI can handle them fairly easily if given the documentation.
Hmm, I’d be curious if you can share more, especially on the gradient of the uplift with new models.
Sure. My specific work is on biorisk evals. See WMDP.ai
Closed API models leak a bit of biorisk info, but mainly aren’t that helpful for creating bioweapons (so far as I am able to assess).
Uncensored open-weight models are a whole different ball game. They can be made to be actively quite harm-seeking / terrorist-aligned, and also quite technically competent, such that they make researching and engaging in bioweapons creation substantially easier. The more capable the models get, the more reliable their information is, the higher quality their weapon design ideation is, the more wetlab problems they are able to help you troubleshoot, the more correct their suggestions for getting around government controls and detection methods… And so on. I am saying that this capability is increasing non-linearly in terms of expected risk/harms as open-weight model capabilities increase.
Part of this is that they cross some accuracy/reliability threshold on increasingly tricky tasks where you can at least generate a bunch of generations and pick the most agree-upon idea. Whereas, if you see that there is correct advice given, but only around 5% of the time, then you can be pretty sure that someone who didn’t know that that was the correct advice would know how to pick out the correct advice. As soon as the most frequent opinion is correct, you are in a different sort of hazard zone (e.g. imagine if a single correct idea came up 30% of the time to a particular question, and all the other ideas, most of them incorrect, came up 5% of the time each).
Also it matters a lot whether the model routinely fails at ‘critical fail steps’ versus ‘cheap retry steps’. There are a lot of tricky wetlab steps where you can easily fail, and the model can provide only limited help, but the reason for failure is clear after a bit of internet research and doesn’t waste much in the way of resources. Such ‘noncritical failures’ are very different from ‘critical failures’ such as failing to obfuscate your internet purchases such that a government agency detects your activity and decides to investigate you more closely.
Thanks! How optimistic/excited would you be about research in the spirit of Tamper-Resistant Safeguards for Open-Weight LLMs, especially given that banning open-weight models seems politically unlikely, at least for now?
Extremely excited about the idea of such research succeeding in the near future! But skeptical that it will succeed in time to be at all relevant. So my overall expected value for that direction is low.
Also, I think there’s probably a very real risk that the bird has already flown the coop on this. If you can cheaply modify existing open-weight models to be ‘intent-aligned’ with terrorists, and to be competent at using scaffolding that you have built around ‘biological design tools’… then the LLM isn’t really a bottleneck anymore. The irreversible proliferation has occurred already. I’m not certain this is the case, but I’d give it about 75%.
So then you need to make sure that better biological design tools don’t get released, and that more infohazardous virology papers don’t get published, and that wetlab automation tech doesn’t get better, and… the big one.… that nobody releases an open-weight LLM so capable that it can successfully create tailor-made biological design tools. That’s a harder thing to censor out of an LLM than getting it to directly not help with biological weapons! Creation of biological design tools touches on a lot more things, like its machine learning knowledge and coding skill. What exactly do you censor to make a model helpful at building purely-good tools but not at building tools which have dual-use?
Basically, I think it’s a low-return area entirely. I think humanity’s best bet is in generalized biodefenses, plus an international ‘Council of Guardians’ which use strong tool-AI to monitor the entire world and enforce a ban on:
a) self-replicating weapons (e.g. bioweapons, nanotech)
b) unauthorized recursive-self-improving AI
Of these threats only bioweapons are currently at large. The others are future threats.
Contrastive methods could be used both to detect common latent structure across animals, measuring sessions, multiple species (https://twitter.com/LecoqJerome/status/1673870441591750656) and to e.g. look for which parts of an artificial neural network do what a specific brain area does during a task assuming shared inputs (https://twitter.com/BogdanIonutCir2/status/1679563056454549504).
And there are theoretical results suggesting some latent factors can be identified using multimodality (all the following could be intepretable as different modalities—multiple brain recording modalities, animals, sessions, species, brains-ANNs), while being provably unindentifiable without the multiple modality—e.g. Identifiability Results for Multimodal Contrastive Learning (and results on nonlinear ICA in single-modal vs. multi-modal settings reviewed in section 2.1). This might a way to bypass single-model interpretability difficulties, by e.g. ‘comparing’ to brains or to other models.
Example of potential cross-species application: empathy mechanisms seem conserved across species Empathy as a driver of prosocial behaviour: highly conserved neurobehavioural mechanisms across species. Example of brain-ANN applications: ‘matching’ to modular brain networks, e.g. language network—ontology-relevant, non-agentic (e.g. The universal language network: A cross-linguistic investigation spanning 45 languages and 12 language families) or Theory of Mind network—could be very useful for detecting lying-relevant circuits (e.g. Single-neuronal predictions of others’ beliefs in humans).
Examples of related interpretability across models—Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences between Pretrained Generative Models, across brain measurement modalities—Learnable latent embeddings for joint behavioural and neural analysis
, across animals and brain-ANN—Quantifying stimulus-relevant representational drift using cross-modality contrastive learning.
Potentially also relevant—Contrastive Preference Learning: Learning from Human Feedback without RL, TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space, Bridging Associative Memory and Probabilistic Modeling.
Some related cool-seeming recent results: All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling, Cross-Entropy Is All You Need To Invert the Data Generating Process.
fragment of chat with claude on All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling and Shared computational principles for language processing in humans and deep language models (my prompt first):
‘in ‘all or none...’ they mention ‘the general form of next-token predictors used in our analysis.‘; if we assume that the language network (in biological brains) is indeed a next-token predictor (as e.g. the next-word prediction results in ‘shared principles...’ suggest), shouldn’t this suggest that the linear properties should then also apply to biological brains, not just LLMs?
This is a very insightful observation! Let’s analyze this:
The key logical chain would be:
1. The “All or None” paper’s results apply to next-token predictors of the form:
```
p(xt+1|x1:t) = exp(f(x1:t)ᵀg(xt+1))/Z(x1:t)
```
where f maps sequences to embeddings and g maps tokens to unembeddings
2. The “Shared Computational Principles” paper provides evidence that the human brain:
- Engages in next-word prediction before word onset
- Uses context to predict next tokens
- Shows surprise signals consistent with prediction error
- Represents words in ways that can be mapped to vector spaces
3. If the brain’s language network is implementing a next-token predictor of this general form, then theoretically:
- The same mathematical framework should apply
- The same identifiability results should hold
- The same “all or none” properties of linear features should be present
However, there are some important considerations:
1. Implementation Level:
- While the brain might implement this type of predictor at a computational level, the neural implementation could add constraints or properties not captured by the mathematical framework
- The brain might use a more complex or different form of next-token prediction
2. Verification Challenges:
- It’s harder to verify if the brain’s implementation exactly matches this mathematical form
- We can’t directly observe all components like f and g in biological systems
3. Scale/Scope:
- The brain might implement multiple prediction systems operating at different scales
- The mathematical framework might apply to some but not all aspects of neural language processing
So while your reasoning seems sound—if the brain implements this type of next-token predictor, the mathematical results should apply—verifying this empirically would be challenging. It’s a fascinating theoretical bridge between the mathematical analysis and biological implementation.′
OTOH, Challenges in Explaining Representational Similarity through Identifiability presents good counter-points. Also, Position: Understanding LLMs Requires More Than Statistical Generalization.
Examples of reasons to expect (approximate) convergence to the same causal world models in various setups: theorem 2 in Robust agents learn causal world models; from Deep de Finetti: Recovering Topic Distributions from Large Language Models: ‘In particular, given the central role of exchangeability in our analysis, this analysis would most naturally be extended to other latent variables that do not depend heavily on word order, such as the author of the document [Andreas, 2022] or the author’s sentiment’ (this assumption might be expected to be approximately true for quite a few alignment-relevant-concepts); results from Victor Veitch: Linear Structure of (Causal) Concepts in Generative AI.
Summary threads of two recent papers which seem like significant evidence in favor of the Simulators view of LLMs (especially after just pretraining): https://x.com/aryaman2020/status/1852027909709382065 https://x.com/DimitrisPapail/status/1844463075442950229
Plausible large 2025 training run FLOP estimates from https://x.com/Jsevillamol/status/1810740021869359239?t=-stzlTbTUaPUMSX8WDtUIg&s=19:
Dario Amodei claims there are current $1 billion training runs. At $2/hour with H100s, this means 2e12 H100-seconds. Assuming 30% utilization and 4e15 FP8 FLOP/s, this is 2e27 FLOPs, 2 OOMs above estimates for the original GPT-4. This corresponds to 200 days with 100K H100s (and 150 megawatts). 100K H100 clusters don’t seem to be built yet, the largest publicly known ones are Meta’s two clusters with 24K H100s each. But it might be possible to train on multiple clusters if the inter-cluster network is good enough.
Edit (20 Jul): These estimates erroneously use the sparse FP8 tensor performance for H100s (4 petaFLOP/s), which is 2 times higher than far more relevant dense FP8 tensor performance (2 petaFLOP/s). But with a Blackwell GPU, the relevant dense FP8 performance is 5 petaFLOP/s, which is close to 4 petaFLOP/s, and the cost and power per GPU within a rack are also similar. So the estimates approximately work out unchanged when reading “Blackwell GPU” instead of “H100″.
Thanks! I do wonder if he might not mean $1 billion total cost (e.g. to buy the hardware); because he also claims a $10 billion run might start in 2025, which seems quite surprising?
The $100 million figure is used in the same sentence for cost of currently deployed models. Original GPT-4 was probably trained on A100s in BF16 (A100s can’t do FP8 faster), which is 6e14 FLOP/s, 7 times less than 4e15 FLOP/s in FP8 from an H100 (there is no change in quality of trained models when going from BF16 to FP8, as long as training remains stable). With A100s in BF16 at 30% utilization for 150 days, you need 9K A100s to get 2e25 FLOPs. Assuming $30K per A100 together with associated infrastructure, the cluster would cost $250 million, but again assuming $2 per hour, the time would only cost $60 million. This is 2022, deployed in early 2023. I expect recent models to cost at least somewhat more, so for early 2024 frontier models $100 million would be solidly cost of time, not cost of infrastructure.
The $1 billion for cost of time suggests ability to train on multiple clusters, and Gemini 1.0 report basically says they did just that. So the $10 billion figure needs to be interpreted as being about scale of many clusters taken together, not individual clusters. The estimate for training on H100s for 200 days says you need 150 megawatts for $1 billion in training time, or 1.5 gigawatts for $10 billion in training time. And each hyperscaler has datacenters that consume 2-3 gigawatts in total (they are much smaller individually) with current plans to double. So at least the OOMs match the $10 billion claim interpreted as cost of training time.
Edit (20 Jul): These estimates erroneously use the sparse FP8 tensor performance for H100s (4 petaFLOP/s), which is 2 times higher than far more relevant dense FP8 tensor performance (2 petaFLOP/s). But with a Blackwell GPU, the relevant dense FP8 performance is 5 petaFLOP/s, which is close to 4 petaFLOP/s, and the cost and power per GPU within a rack are also similar. So the estimates approximately work out unchanged when reading “Blackwell GPU” instead of “H100″.
Decomposability seems like a fundamental assumption for interpretability and condition for it to succeed. E.g. from Toy Models of Superposition:
’Decomposability: Neural network activations which are decomposable can be decomposed into features, the meaning of which is not dependent on the value of other features. (This property is ultimately the most important – see the role of decomposition in defeating the curse of dimensionality.) [...]
The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.′
If this assumption is true, it seems favorable to the prospects of safely automating large parts of interpretability work (e.g. using [V]LM agents like MAIA) differentially sooner than many other alignment research subareas and than the most consequential capabilities research (and likely before AIs become x-risky, e.g. before they pass many dangerous capabilities evals). For example, in a t-AGI framework, using an interpretability LM agent to search for the feature corresponding to a certain semantic direction should be much shorter horizon than e.g. coming up with a new conceptual alignment agenda or coming up with a new ML architecture (as well as having much faster feedback loops than e.g. training a SOTA LM using a new architecture).
Some theoretical results might also be relevant here, e.g. Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks.
Related, from Advanced AI evaluations at AISI: May update:
E.g. to the degree typical probing / activation steering work might often involve short 1-hour-horizons, it might be automatable differentially soon; e.g. from Steering GPT-2-XL by adding an activation vector:
Related, from The “no sandbagging on checkable tasks” hypothesis:
Quote from Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie:
Selected fragments (though not really cherry-picked, no reruns) of a conversation with Claude Opus on operationalizing something like Activation vector steering with BCI by applying the methodology of Concept Algebra for (Score-Based) Text-Controlled Generative Models to the model from High-resolution image reconstruction with latent diffusion models from human brain activity (website with nice illustrations of the model).
My prompts bolded:
’Could we do concept algebra directly on the fMRI of the higher visual cortex?
Yes, in principle, it should be possible to perform concept algebra directly on the fMRI signals from the higher visual cortex. The key idea would be to identify subspaces in the fMRI signal space that correspond to specific semantic concepts, and then manipulate these subspaces to control the reconstructed image.
Here’s a hypothetical workflow:
Collect fMRI data while the subject views a large set of images that vary along certain semantic dimensions (e.g., gender, emotion, object category, scene type, etc.).
For each semantic dimension, identify a set of fMRI samples that differ only along that dimension. For example, for the gender dimension, you might compare the fMRI signals for “man” vs “woman”, keeping other aspects of the image content constant.
Use these samples to estimate a “semantic subspace” for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.
To reconstruct an image, first predict the full fMRI signal from the early visual cortex (similar to predicting z in Takagi et al.).
Then, to manipulate a specific semantic concept, project the predicted fMRI signal onto the orthogonal complement of the corresponding semantic subspace (to remove the original semantic information), and add the desired semantic component from a reference fMRI signal (similar to the concept algebra manipulation of z_c).
Finally, use this modified fMRI signal from the higher visual cortex to condition the diffusion model, reconstructing an image with the desired semantic manipulation.
The main challenges would be:
Collecting a sufficiently large and diverse fMRI dataset to robustly estimate the semantic subspaces.
Ensuring that the estimated subspaces are truly orthogonal and capture the desired semantic variation.
Translating the manipulated fMRI signals into a format that can effectively condition the diffusion model.
Nonetheless, this is an intriguing direction that could potentially unite the biological understanding from Takagi et al.‘s work with the interpretable semantic manipulation of concept algebra. It suggests a way to manipulate images directly from brain-space, without the need for text prompts. This could provide insights into how the brain itself represents and manipulates semantic concepts.’
’How might the above be relevant to the AI alignment problem?
The hypothetical approach of performing concept algebra directly on fMRI signals from the higher visual cortex could potentially offer some interesting insights relevant to the AI alignment problem:
Understanding AI systems: By comparing the semantic subspaces identified in fMRI signals with those found in the latent spaces of AI models (like the diffusion model in concept algebra), we could gain insights into how AI systems represent and manipulate semantic concepts, and how this compares to the human brain. This could help in designing AI systems that reason in ways more analogous to humans, which could be beneficial for alignment.
Interpretability and control: The ability to identify and manipulate specific semantic concepts directly in the space of brain activity could provide a more direct and interpretable way of controlling the output of AI systems. If we can align the semantic subspaces of the AI with those of the human brain, it might become possible to control the AI’s outputs by manipulating these subspaces, leading to more interpretable and controllable AI systems.
Value alignment: By understanding how human values and preferences are represented in the brain, and how they relate to the semantic concepts identified through concept algebra, we might gain insights into how to align AI systems with human values. If we can identify the neural subspaces corresponding to human values and ethics, we could potentially use this information to constrain the AI’s outputs to align with these values.
Cognitive modeling: This approach could also contribute to the development of more accurate cognitive models of human reasoning and decision-making. By understanding how semantic concepts are represented and manipulated in the brain, we can build AI systems that more closely mimic human cognition, which could be useful for predicting and understanding human behavior in the context of AI alignment.
Neuro-symbolic AI: The idea of manipulating semantic concepts directly in the space of brain activity is reminiscent of neuro-symbolic AI approaches, which aim to combine the strengths of neural networks and symbolic reasoning. Insights from this fMRI-based concept algebra could potentially inform the development of neuro-symbolic AI systems that are more interpretable and aligned with human cognition.
However, it’s important to note that these are speculative connections, and there would be significant technical and ethical challenges in realizing this approach. Nonetheless, exploring the parallels between the structure of human thought and the structure of AI representations could be a valuable direction for informing AI alignment research.′
’How would this be relevant in the context of this proposal? [note: the proposal being Activation vector steering with BCI]
The proposed concept algebra on fMRI signals could be highly relevant to the goals outlined in this funding proposal. Specifically:
The proposal aims to find mappings between human brain states and the latent representations of language models. Applying concept algebra directly on fMRI signals, as discussed, could provide a more direct and interpretable way to establish these mappings. By identifying semantic subspaces in the fMRI signals and aligning them with subspaces in the language model’s latent space, we could potentially “steer” the language model to exhibit brain states similar to specific people.
The proposal suggests that reward models could be trained to reproduce humanlike brain states during evaluation, to improve their generalization. If concept algebra on fMRI can successfully manipulate semantic concepts in the reconstructed outputs (e.g., text or images), this could provide a way to directly optimize the reward models to produce outputs that match desired brain states.
The proposal aims to advance scientific understanding of analogies between LLM behavior and human behavior. Comparing the semantic subspaces identified in fMRI with those in LLMs, as enabled by concept algebra, could provide insights into these analogies at a more fundamental, representational level.
The proposed fMRI experiments could potentially provide the necessary data for attempting concept algebra on brain signals. The proposal includes plans for designing and implementing fMRI data-collection protocols, recruiting subjects, and administering the experiments. This data could be used to estimate the semantic subspaces needed for concept algebra.
The ML experiments proposed, including the fMRI feature extraction pipeline and architecture modifications, could potentially be adapted to implement concept algebra on the fMRI signals, in addition to the original goal of mapping brain states to LLM representations.
In summary, while the original proposal does not explicitly mention concept algebra, the ideas discussed around applying concept algebra to fMRI signals could potentially enhance and extend the proposal’s goals of connecting human brain states with AI representations, improving reward model generalization, and advancing scientific understanding of human-AI analogies. The proposed fMRI experiments could provide the necessary data, and the planned ML experiments could be extended to include concept algebra on the brain signals.′
’Above you wrote: ‘Use these samples to estimate a “semantic subspace” for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.’ What if instead I just took a large diverse dataset of fMRI of images of men and fMRI of images of women, took the means of the two, substracted them, and used that as the sex direction in concept algebra?
Your proposed approach of taking the difference between the mean fMRI signals for men and women to define the “sex direction” is a great idea and could indeed work as a simple alternative to more complex methods like PCA or LDA.
Here’s why this approach makes sense:
If the fMRI signals for men and women differ primarily along a single direction in the high-dimensional fMRI space, then the difference between their means will point along this direction.
By using a large and diverse dataset, you’re essentially averaging out other sources of variation (e.g., individual differences, noise), leaving only the consistent difference between men and women.
This is analogous to how concept algebra identifies concept directions in the latent space of the diffusion model: by taking the difference between the latent representations of prompts that differ only in the target concept.
To implement this:
Collect a large dataset of fMRI signals for images of men and women, ensuring diversity in other aspects (age, race, emotion, background, etc.).
Compute the mean fMRI signal for men (m_men) and women (m_women) separately.
Compute the “sex direction” as: v_sex = m_women—m_men.
To manipulate the sex of a reconstructed image, you could then project the fMRI signal onto the orthogonal complement of v_sex (to remove the original sex information), and then add or subtract a scaled version of v_sex to introduce the desired sex information.
The main advantages of this approach over PCA or LDA are simplicity and interpretability. It doesn’t require any complex eigendecompositions or model fitting, and the resulting direction has a clear interpretation as the difference between men and women.
Potential limitations to keep in mind:
If there are multiple independent directions along which men and women differ, this approach will only capture their average, not the individual directions.
If there are significant interactions between sex and other factors (e.g., if the sex difference is itself different for different ages or emotions), this simple difference may not fully capture the sex effect.
The approach assumes linearity—that the sex effect is a simple additive shift in fMRI space. If there are nonlinear interactions, this linear shift may not fully capture the effect.
Despite these potential limitations, I think this is a promising and elegantly simple approach to try for implementing concept algebra directly on fMRI signals. It’s a great starting point that could be refined with more sophisticated methods if needed.′
More reasons to think something like the above should work: High-resolution image reconstruction with latent diffusion models from human brain activity literally steers diffusion models using linearly-decoded fMRI signals (see fig. 2); and linear encoding (the inverse of decoding) from the text latents to fMRI also works well (see fig. 6; and similar results in Natural language supervision with a large and diverse dataset builds better models of human high-level visual cortex, e.g. fig. 2). Furthermore, they use the same (Stable Diffusion with CLIP) model used in Concept Algebra for (Score-Based) Text-Controlled Generative Models, which both provides theory and demo empirically activation engineering-style linear manipulations. All this suggests similar Concept Algebra for (Score-Based) Text-Controlled Generative Models—like manipulations would also work when applied directly to the fMRI representations used to decode the text latents c in High-resolution image reconstruction with latent diffusion models from human brain activity.
Turns out, someone’s already done a similar (vector arithmetic in neural space; latent traversals too) experiment in a restricted domain (face processing) with another model (GAN) and it seemed to work: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012058 https://github.com/neuralcodinglab/brain2gan/blob/main/figs_manuscript/Fig12.png https://openreview.net/pdf?id=hT1S68yza7
Also positive update for me on interdisciplinary conceptual alignment being automatable differentially soon; which seemed to me for a long time plausible, since LLMs have ‘read the whole internet’ and interdisciplinary insights often seem (to me) to require relatively small numbers of inferential hops (plausibly because it’s hard for humans to have [especially deep] expertise in many different domains), making them potentially feasible for LLMs differentially early (reliably making long inferential chains still seems among the harder things for LLMs).
More reasons to believe that studying empathy in rats (which should be much easier than in humans, both for e.g. IRB reasons, but also because smaller brains, easier to get whole connectomes, etc.) could generalize to how it works in humans and help with validating/implementing it in AIs (I’d bet one can already find something like computational correlates in e.g. GPT-4 and the correlation will get larger with scale a la https://arxiv.org/abs/2305.11863) https://twitter.com/e_knapska/status/1722194325914964036
Claim of roughly human-level automated multi-turn red-teaming: https://blog.haizelabs.com/posts/cascade/
I expect large parts of interpretability work could be safely automatable very soon (e.g. GPT-5 timelines) using (V)LM agents; see A Multimodal Automated Interpretability Agent for a prototype.
Notably, MAIA (GPT-4V-based) seems approximately human-level on a bunch of interp tasks, while (overwhelmingly likely) being non-scheming (e.g. current models are bad at situational awareness and out-of-context reasoning) and basically-not-x-risky (e.g. bad at ARA).
Given the potential scalability of automated interp, I’d be excited to see plans to use large amounts of compute on it (including e.g. explicit integrations with agendas like superalignment or control; for example, given non-dangerous-capabilities, MAIA seems framable as a ‘trusted’ model in control terminology).
@the gears to ascension I see you reacted “10%” to the phrase “while (overwhelmingly likely) being non-scheming” in the context of the GPT-4V-based MAIA.
Does that mean you think there’s a 90% chance that MAIA, as implemented, today is actually scheming? If so that seems like a very bold prediction, and I’d be very interested to know why you predict that. Or am I misunderstanding what you mean by that react?
ah, I got distracted before posting the comment I was intending to: yes, I think GPT4V is significantly scheming-on-behalf-of-openai, as a result of RLHF according to principles that more or less explicitly want a scheming AI; in other words, it’s not an alignment failure to openai, but openai is not aligned with human flourishing in the long term, and GPT4 isn’t either. I expect GPT4 to censor concepts that are relevant to detecting this somewhat. Probably not enough to totally fail to detect traces of it, but enough that it’ll look defensible, when a fair analysis would reveal it isn’t.
It seems to me like the sort of interpretability work you’re pointing at is mostly bottlenecked by not having good MVPs of anything that could plausibly be directly scaled up into a useful product as opposed to being bottlenecked on not having enough scale.
So, insofar as this automation will help people iterate faster fair enough, but otherwise, I don’t really see this as the bottleneck.
Yeah, I’m unsure if I can tell any ‘pivotal story’ very easily (e.g. I’d still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of internals.
I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.
This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in ‘Catching AIs red-handed’), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).
Hey Bogdan, I’d be interested in doing a project on this or at least putting together a proposal we can share to get funding.
I’ve been brainstorming new directions (with @Quintin Pope) this past week, and we think it would be good to use/develop some automated interpretability techniques we can then apply to a set of model interventions to see if there are techniques we can use to improve model interpretability (e.g. L1 regularization).
I saw the MAIA paper, too; I’d like to look into it some more.
Anyway, here’s a related blurb I wrote:
Whether this works or not, I’d be interested in making more progress on automated interpretability, in the similar ways you are proposing.
Hey Jacques, sure, I’d be happy to chat!
Recent long-context LLMs seem to exhibit scaling laws from longer contexts—e.g. fig. 6 at page 8 in Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, fig. 1 at page 1 in Effective Long-Context Scaling of Foundation Models.
The long contexts also seem very helpful for in-context learning, e.g. Many-Shot In-Context Learning.
This seems differentially good for safety (e.g. vs. models with larger forward passes but shorter context windows to achieve the same perplexity), since longer context and in-context learning are differentially transparent.
A brief list of resources with theoretical results which seem to imply RL is much more (e.g. sample efficiency-wise) difficult than IL—imitation learning (I don’t feel like I have enough theoretical RL expertise or time to scrutinize hard the arguments, but would love for others to pitch in). Probably at least somewhat relevant w.r.t. discussions of what the first AIs capable of obsoleting humanity could look like:
Paper: Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? (quote: ‘This work shows that, from the statistical viewpoint, the situation is far subtler than suggested by the more traditional approximation viewpoint, where the requirements on the representation that suffice for sample efficient RL are even more stringent. Our main results provide sharp thresholds for reinforcement learning methods, showing that there are hard limitations on what constitutes good function approximation (in terms of the dimensionality of the representation), where we focus on natural representational conditions relevant to value-based, model-based, and policy-based learning. These lower bounds highlight that having a good (value-based, model-based, or policy-based) representation in and of itself is insufficient for efficient reinforcement learning, unless the quality of this approximation passes certain hard thresholds. Furthermore, our lower bounds also imply exponential separations on the sample complexity between 1) value-based learning with perfect representation and value-based learning with a good-but-not-perfect representation, 2) value-based learning and policy-based learning, 3) policy-based learning and supervised learning and 4) reinforcement learning and imitation learning.’)
Talks (very likely with some redundancy):
Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning—Sham Kakade
What is the Statistical Complexity of Reinforcement Learning? (and another two versions)
IL = imitation learning.
I’d bet against any of this providing interesting evidence beyond basic first principles arguments. These types of theory results never seem to add value on top of careful reasoning from my experience.
Hmm, unsure about this. E.g. the development models of many in the alignment community before GPT-3 (often heavily focused on RL or even on GOFAI) seem quite substantially worse in retrospect than those of some of the most famous deep learning people (e.g. LeCun’s cake); of course, this may be an unfair/biased comparison using hindsight. Unsure how much theory results were influencing the famous deep learners (and e.g. classic learning theory results would probably have been misleading), but doesn’t seem obvious they had 0 influence? For example, Bengio has multiple at least somewhat conceptual / theoretical (including review) papers motivating deep/representation learning; e.g. Representation Learning: A Review and New Perspectives.
I think Paul looks considerably better in retrospect than famous DL people IMO. (Partially via being somewhat more specific, though still not really making predictions.)
I’m skeptical hard theory had much influence on anyone though. (In this domain at least.)
Some more (somewhat) related papers:
Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity (‘We first demonstrate that, for a broad class of Markov decision processes (MDPs), the model can be represented by constant-depth circuits with polynomial size or Multi-Layer Perceptrons (MLPs) with constant layers and polynomial hidden dimension. However, the representation of the optimal policy and optimal value proves to be NP-complete and unattainable by constant-layer MLPs with polynomial size. This demonstrates a significant representation complexity gap between model-based RL and model-free RL, which includes policy-based RL and value-based RL. To further explore the representation complexity hierarchy between policy-based RL and value-based RL, we introduce another general class of MDPs where both the model and optimal policy can be represented by constant-depth circuits with polynomial size or constant-layer MLPs with polynomial size. In contrast, representing the optimal value is P-complete and intractable via a constant-layer MLP with polynomial hidden dimension. This accentuates the intricate representation complexity associated with value-based RL compared to policy-based RL. In summary, we unveil a potential representation complexity hierarchy within RL—representing the model emerges as the easiest task, followed by the optimal policy, while representing the optimal value function presents the most intricate challenge.’).
On Representation Complexity of Model-based and Model-free Reinforcement Learning (‘We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal Q-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as Q-function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal Q-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal Q-function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.’).
Demonstration-Regularized RL (‘Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL’s sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using NE expert demonstrations enables the identification of an optimal policy at a sample complexity of order O˜(Poly(S,A,H)/(ε^2 * N^E)) in finite and O˜(Poly(d,H)/(ε^2 * N^E)) in linear Markov decision processes, where ε is the target precision, H the horizon, A the number of action, S the number of states in the finite case and d the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.’).
Limitations of Agents Simulated by Predictive Models (‘There is increasing focus on adapting predictive models into agent-like systems, most notably AI assistants based on language models. We outline two structural reasons for why these models can fail when turned into agents. First, we discuss auto-suggestive delusions. Prior work has shown theoretically that models fail to imitate agents that generated the training data if the agents relied on hidden observations: the hidden observations act as confounding variables, and the models treat actions they generate as evidence for nonexistent observations. Second, we introduce and formally study a related, novel limitation: predictor-policy incoherence. When a model generates a sequence of actions, the model’s implicit prediction of the policy that generated those actions can serve as a confounding variable. The result is that models choose actions as if they expect future actions to be suboptimal, causing them to be overly conservative. We show that both of those failures are fixed by including a feedback loop from the environment, that is, re-training the models on their own actions. We give simple demonstrations of both limitations using Decision Transformers and confirm that empirical results agree with our conceptual and formal analysis. Our treatment provides a unifying view of those failure modes, and informs the question of why fine-tuning offline learned policies with online learning makes them more effective.’).
I’m not aware of anybody currently working on coming up with concrete automated AI safety R&D evals, while there seems to be so much work going into e.g. DC evals or even (more recently) scheminess evals. This seems very suboptimal in terms of portfolio allocation.
Edit: oops I read this as “automated AI capabilies R&D”.
METR and UK AISI are both interested in this. I think UK AISI is working on this directly while METR is working on this indirectly.
See here.
Thanks! AFAICT though, the link you posted seems about automated AI capabilities R&D evals, rather than about automated AI safety / alignment R&D evals (I do expect transfer between the two, but they don’t seem like the same thing). I’ve also chatted to some people from both METR and UK AISI and got the impression from all of them that there’s some focus on automated AI capabilities R&D evals, but not on safety.
Oops, misread you.
I think some people at superalignment (OpenAI) are interested in some version of this and might already be working on this.
Can you give a concrete example of a safety property of the sort that are you envisioning automated testing for? Or am I misunderstanding what you’re hoping to see?
Spicy take: good evals for automated ML R&D should (also) cover for what’s in the attached picture (and try hard at elicitation in this rough shape). AFAIK, last time I looked at the main (public) proposals, they didn’t seem to. Picture from https://x.com/RobertTLange/status/1829104918214447216.
From a chat with Claude on the example of applying a multilevel interpretability framework to deception from https://arxiv.org/abs/2408.12664:
’The paper uses the example of studying deception in language models (LLMs) to illustrate how Marr’s levels of analysis can be applied to AI interpretability research. Here’s a detailed breakdown of how the authors suggest approaching this topic at each level:
1.Computational Level:
Define criteria for classifying LLM behavior as deception
Develop comprehensive benchmarks to measure deceptive behaviors across various conditions
Conduct thorough behavioral assessments while systematically varying: a) Input properties (e.g., prompting strategies, specific word usage) b) Model properties (e.g., training data composition, fine-tuning tasks, model architecture)
Develop theoretical (possibly Bayesian) models to predict when an LLM would exhibit deceptive behavior given particular inputs and internal states
Use these insights to adjust training and inference procedures to control LLM deception
2. Algorithmic/Representational Level:
Examine representational spaces across different layers of the LLM
Test which spaces systematically separate instances of deception from instances of honesty
Trace how specific inputs (e.g., “deception triggers”) alter the latent space trajectory of an LLM
Analyze how generated tokens deviate from the ground truth trajectory to a deception trajectory
Use high-level representational trajectories to generalize over specific word choices to more general cases of deception
Determine whether the model uses a single algorithm for different deception cases or if deception is supported by multiple diverse algorithms
3. Implementation Level:
Trace the specific circuits that selectively activate during deceptive behavior
Identify circuits that causally contribute to deceptive behavior
Map out circuits responsible for tracking different kinds of deception-relevant information, such as: a) The true state of the world b) The agent’s intention c) The user’s current and predicted beliefs
Use insights from computational and algorithmic levels to guide the search for relevant circuits
Conversely, use circuit-level findings to refine algorithmic-level hypotheses
The authors emphasize that this multi-level approach allows for a more comprehensive understanding of the phenomenon:
Computational-level analysis can yield insights for adjusting training and inference procedures.
Algorithmic-level analysis can reveal how the model generalizes across different instances of deception.
Implementation-level analysis can pinpoint specific mechanisms that can be targeted for intervention.
They also stress the importance of iterating between levels. For example, computational and algorithmic understanding can guide the search for relevant circuits at the implementation level, while circuit-level findings can help refine higher-level hypotheses.
This example illustrates how adopting a multi-level approach can provide a more holistic and actionable understanding of complex behaviors in AI systems, potentially leading to more effective interventions and improvements in model design.′
An intuition for safety cases for automated safety research over time
Safety cases—we want to be able to make a (conservative) argument for why a certain AI system won’t e.g. pose x-risk with probability > p / year. Rely on composing safety arguments / techniques into a ‘holistic case’.
Safety arguments are rated on three measures:
Practicality: ‘Could the argument be made soon or does it require substantial research progress?’
Maximum strength: ‘How much confidence could the argument give evaluators that the AI systems are safe?’
Scalability: ‘To what extent can the argument be applied to very powerful AI systems?’
Examples of safety arguments (fig. from Safety Cases: How to Justify the Safety of Advanced AI Systems):
Here’s an intuition of how to iteratively build safety cases for automated safety research, as AI systems become increasingly capable:
we want to ‘pile on’ / combine methods which are Practical enough already and not too alignment-tax-costly (e.g. assume a limited budget in e.g. [training and / or inference] compute and / or money and / or researchers), even though they might not be very Scalable (in the long run);
as time goes by and AIs become more capable, we should expect some of our methods’ Scalability to be ‘outpaced’ by the capabilities of new systems (e.g. maybe they’ll pass [some] DC evals → we won’t be able to rely on [some] inability arguments anymore);
we want to be able to use auto safety research to raise the Practicality of methods which are Scalable enough, but not Practical enough yet
e.g. unlearning, interpretability (see figure above)
we need our auto safety researchers to be [reasonably likely] capable of this, though.
Acknowledgments: early versions of this idea were developed during my participantion in the Astra Fellowship Winter ’24 under @evhub’s mentorship and benefitted from many conversations in Constellation.
These might be some of the most neglected and most strategically-relevant ideas about AGI futures: Pareto-topian goal alignment and ‘Pareto-preferred futures, meaning futures that would be strongly approximately preferred by more or less everyone‘: https://www.youtube.com/watch?v=1lqBra8r468. These futures could be achievable because automation could bring massive economic gains, which, if allocated (reasonably, not-even-necessarily-perfectly) equitably, could make ~everyone much better off (hence the ‘strongly approximately preferred by more or less everyone’). I think this discourse could be crucial to incentivize less racing more coordination, including e.g. for pausing at the right time to allow for more AI safety work to get done, yet I’m not seeing it almost anywhere in the public sphere.
I’d be interested in seeing the strongest arguments (e.g. safety-washing?) for why, at this point, one shouldn’t collaborate with OpenAI (e.g. not even part-time, for AI safety [evaluations] purposes).
If we’re taking the perspective of the entire community, this bears less weight, but: it likely becomes close-to-impossible to criticize OpenAI from that point forward. I’m not even anti-OpenAI, I just try to be truth-seeking where I think people are dropping the ball, and I think there’s almost 0 chance I’ll be able to work with OpenAI in the future given my comments on Twitter.
Claude-3 Opus on using advance market committments to incentivize automated AI safety R&D:
’Advance Market Commitments (AMCs) could be a powerful tool to incentivize AI labs to invest in and scale up automated AI safety R&D. Here’s a concrete proposal for how AMCs could be structured in this context:
Government Commitment: The US government, likely through an agency like DARPA or NSF, would commit to purchasing a certain volume of AI safety tools and technologies that meet pre-specified criteria, at a guaranteed price, if and when they are developed. This commitment would be legally binding.
Funding Allocation: A significant pool of funds, on the order of several billion dollars, would be allocated to back these AMCs. The exact amount would be determined based on estimates of the funding needed to meaningfully accelerate AI safety R&D over a 5-10 year time horizon.
Clear Technical Criteria: The government would work with AI safety experts and leading labs to establish clear, measurable technical criteria that AI safety tools and technologies must meet to qualify for the AMC. These could include:
Ability to automatically detect and mitigate certain classes of AI risks (e.g. robustness issues, unexpected behaviors, reward hacking, scalable oversight failures)
Formal verifiability and interpretability of AI systems’ reasoning and decision-making
Automated generation of comprehensive test suites and monitoring for AI systems
Performance on benchmark tasks that evaluate AI safety and robustness
Seamless integration with common AI development frameworks and workflows
Pricing and Volume Guarantees: For each category of AI safety tech that meets the specified criteria, the government would guarantee a certain price per unit and commit to purchasing a minimum volume. These numbers would be set at levels that meaningfully derisk and subsidize the substantial R&D investment required.
Contractual Arrangements: AI labs would enter into binding contracts with the government to develop qualifying technologies. Upon successful development, the government would be obligated to purchase the technologies at the predetermined prices. IP and usage rights would be negotiated upfront.
Staged Releases and Readjustment: As AI capabilities advance rapidly, the AMC criteria would be staged over time to incentivize continual progress. Initial qualifying thresholds would target nearer-term safety challenges, with subsequent releases focused on more advanced issues. Pricing and volumes could be readjusted periodically as the technology matures.
Open Collaboration and Access: To maximize the public benefit, AMC contracts would stipulate that a substantial portion of the safety IP developed must be made freely available for research use and external validation. Certain core components could be open-sourced. Purchased tools would also be made available to developers and labs beyond the initial partners.
International Coordination: While led by the US, efforts would be made to coordinate and collaborate with governments and labs globally. Shared standards and compatible AMCs across regions would amplify the impact. Joint commitments could pool resources and expertise.
The goal of this AMC structure is to create a clear, long-term market pull that shifts the incentives for AI labs and makes large investments in automated AI safety R&D economically attractive and even necessary for competitive positioning. The guaranteed demand would provide greater certainty and help crowd in private capital. Staged, iterative AMCs would support sustained progress and adapt to evolving challenges. Broad access to the resulting safety IP and tools would uplift capabilities across the AI ecosystem.
Alongside other supporting policies and funding mechanisms, AMCs for automated AI safety technology could be a centerpiece of public-private collaboration to ensure advanced AI systems are developed responsibly. Robust market incentives, combined with technical leadership from top labs, could dramatically accelerate the development and deployment of scalable AI safety solutions.′
Conversation with Claude Opus on A Causal Explainable Guardrails for Large Language Models, Discussion: Challenges with Unsupervised LLM Knowledge Discovery and A Multimodal Automated Interpretability Agent (MAIA). To me it seems surprisingly good at something like coming up with plausible alignment research follow-ups, which e.g. were highlighted here as an important part of the superalignment agenda.
Prompts bolded:
‘Summarize ‘Causal Explainable Guardrails for Large
Language Models’. In particular, could this be useful to deal with some of the challenges mentioned in ‘Discussion: Challenges with Unsupervised LLM Knowledge Discovery’?
The paper “A Causal Explainable Guardrails for Large Language Models” proposes LLMGuardaril, a framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations in LLMs. Key points:Existing methods for steering LLMs towards desired attributes rely on unbiased representations and steering prompts, but the representations learned from pre-training can introduce semantic biases that negatively impact steering.
LLMGuardaril systematically identifies and blocks the confounding effects of biases, enabling extraction of unbiased steering representations. It also includes an explainable component providing insights into alignment between generated output and desired direction.
Experiments show LLMGuardaril can effectively steer LLMs towards desired attributes while mitigating biases.Regarding the challenges mentioned in “Discussion: Challenges with Unsupervised LLM Knowledge Discovery”:One key challenge is distinguishing the model’s own knowledge from knowledge of simulated entities. LLMGuardaril aims to disentangle the influence of semantic biases to obtain steering representations capturing desired attributes. This may help identify model’s own knowledge if those attributes are specific to the model itself.
Another challenge is the hypothesized existence of many features satisfying consistency properties beyond just the model’s knowledge. LLMGuardaril’s causal analysis and adversarial learning aims to block confounding semantic biases, potentially reducing the set of candidate knowledge-like features.
Empirically, contrast-consistent search seemed to find prominent arbitrary features rather than knowledge. LLMGuardaril’s explainable component could provide insight into whether extracted features actually correspond to model knowledge.
So in summary, while not a complete solution, LLMGuardaril’s approach of causally analyzing and debiasing semantic influences may help address some key challenges around identifying model knowledge in an unsupervised way. But significant open questions remain around definitively distinguishing model knowledge from simulated knowledge of other entities. Further research is needed to determine how much traction causal analysis provides on the hardest aspects of the ELK problem.′
‘How might MAIA be useful, potentially combined with ‘Causal Explainable Guardrails for Large Language Models’, to deal with some of the challenges mentioned in ‘Discussion: Challenges with Unsupervised LLM Knowledge Discovery’?
MAIA and the techniques from “Causal Explainable Guardrails for Large Language Models” (LLMGuardaril) could potentially be combined to help address some of the challenges mentioned in “Discussion: Challenges with Unsupervised LLM Knowledge Discovery”:
Distinguishing the model’s own knowledge from simulated knowledge: One key challenge raised is that consistency-based methods like contrast-consistent search (CCS) may not be able to distinguish an LLM’s “own” knowledge from the knowledge of simulated entities the model may be imagining. MAIA’s ability to automatically generate hypotheses and design experiments to test them could potentially be used to probe for features that are uniquely characteristic of the model’s own knowledge. For example, MAIA could generate synthetic inputs that vary the context in which a piece of knowledge is elicited (e.g. with/without an explicit simulation frame) and test whether the model’s responses are consistent.
Reducing the set of candidate knowledge-like features: Another challenge is that many features beyond the model’s own knowledge may satisfy consistency properties, making it hard to identify the “true” knowledge. LLMGuardaril’s technique of causally analyzing the influence of semantic biases on steering representations and using adversarial learning to debias them could potentially be applied by MAIA to narrow down the set of features that robustly encode model knowledge across prompts/contexts. The explainable component of LLMGuardaril could also help MAIA validate whether candidate knowledge features behave as expected.
Overcoming the limitations of one-shot probing: Existing automated interpretability approaches are noted to primarily be tools for one-shot hypothesis generation on a limited set of inputs, rather than scalable and fully rigorous causal explanation. MAIA’s ability to iteratively refine hypotheses based on experimental outcomes could lead to more robust explanations. Integrating the causal analysis techniques from LLMGuardaril into this iterative process could help rule out spurious hypotheses.
Evaluating unsupervised knowledge discovery methods: The paper emphasizes the need for principled testbeds to evaluate the effectiveness of unsupervised knowledge discovery techniques. The synthetic test systems used to quantitatively evaluate MAIA’s performance when ground-truth structure is known could potentially be exte’nded to benchmark a wider variety of unsupervised interpretability approaches.
However, significant challenges would likely need to be overcome to fruitfully combine these techniques. For example, MAIA is currently focused on interpreting components of vision models, while LLMGuardaril focuses on steering language models, so unifying them would require bridging this modality gap. More fundamentally, even with access to MAIA’s automated experimentation and LLMGuardaril’s causal analysis, rigorously identifying an LLM’s “own” knowledge in a fully unsupervised way, as opposed to the knowledge of a specific simulated entity, remains an open challenge and likely requires further conceptual breakthroughs. Nonetheless, exploring the intersection of scalable interpretability agents like MAIA and causally grounded analysis techniques like LLMGuardaril seems like a promising direction for future work on the ELK problem.′
Continued:
‘How could the above be combined with ‘Catching AIs red-handed’ and, in particular, the ideas mentioned in paulfchristiano’s comment?
The approaches discussed in “A Multimodal Automated Interpretability Agent” (MAIA) and “Causal Explainable Guardrails for Large Language Models” (LLMGuardaril) could potentially be combined with the ideas in “Catching AIs red-handed” and paulfchristiano’s comment to create a more comprehensive strategy for detecting and mitigating risks from deceptively aligned AI systems:
Using MAIA to automatically generate and test hypotheses about deceptive behavior: As discussed in the “Catching AIs red-handed” post, once a suspicious action has been flagged, we can use techniques like varying the AI’s context and beliefs to test whether the behavior is truly deceptive. MAIA’s automated experimentation capabilities could potentially be used to scale up this kind of hypothesis generation and testing, systematically exploring the “landscape of deceptive behavior” as paulfchristiano puts it.
Applying causal analysis from LLMGuardaril to rule out alternative explanations: A key challenge raised in the post and comments is distinguishing true deception from an AI system just “kind of unlucky in how its context shifted it into a more deceptive role-play”. The causal analysis techniques from LLMGuardaril, which aim to disentangle the influence of semantic context on an AI’s behavior, could potentially be applied by MAIA to more confidently establish whether a pattern of suspicious behavior is caused by intentional deception vs. an unlucky context.
Using MAIA’s explainable component to interpret the AI’s reasoning: In paulfchristiano’s examples of analyzing a suspicious code review or context-dependent behavior, a critical step is understanding what information the AI is using to make its decisions. MAIA’s explainable component, which measures the alignment between an AI’s outputs and experimenter-specified directions, could potentially shed light on the reasoning behind a deceptive action.
Scaling up detection of hard-to-elicit capabilities: “Catching AIs red-handed” notes that real-world deceptive behavior could reveal capabilities that are hard to measure during normal testing. MAIA’s ability to automatically generate diverse test inputs based on a small number of real examples could help systematically probe for these hidden capabilities, reducing the AI’s “space of interventions...where they are more and more confident that it’s never going to be noticed.”
Iterative refinement of monitoring systems: Finally, as an AI system’s deceptive behavior is progressively uncovered through a combination of MAIA’s experimentation and LLMGuardaril’s causal analysis, the resulting dataset of confirmed deceptive actions could be used to fine-tune untrusted monitoring systems to more reliably catch future deception attempts, similar to the “few-shot catastrophe prevention” techniques outlined in the post.
Significant open questions remain about the feasibility of scaling these approaches to be robust against highly capable AI systems. There are also important limitations to consider: for example, MAIA’s causal analysis is unlikely to provide a fully conclusive “smoking gun” that an AI system is deceptive, as opposed to strong probabilistic evidence. And sufficiently capable systems may find ways to avoid exhibiting any detectable deceptive behavior under the kind of systematic experimentation enabled by MAIA.
Nonetheless, I believe the combination of automated interpretability agents, causally grounded analysis of AI reasoning, and the overall adversarial “red team” mindset advocated in “Catching AIs red-handed” represents a promising direction for building more robust monitoring and containment of potentially deceptive AI systems. A concerted effort to develop and integrate these approaches could significantly improve our odds of detecting and mitigating catastrophic deceptive alignment failures.′
’Any ways to combine all the previously mentioned ideas (A Causal Explainable Guardrails for Large Language Models, Discussion: Challenges with Unsupervised LLM Knowledge Discovery and A Multimodal Automated Interpretability Agent (MAIA), ‘Catching AIs red-handed’, including paulfchristiano’s comment) with ideas from ‘Searching for a model’s concepts by their shape – a theoretical framework’ to ‘catch AIs red-handed’?
Here are a few additional ideas for combining all the mentioned papers and comments to try catching deceptive models red-handed:
Using the “deliberate adversarial experiments” mentioned in paulfchristiano’s comment, like varying the model’s beliefs about how closely it is being monitored, to surface cases where the model’s behavior is suspiciously sensitive to being watched. The theoretical framework from “Searching for concepts by their shape” could be used to formalize the hypothesis that the model has a hidden “deception mode” that gets triggered when it believes oversight is lax. MAIA’s automated experimentation tools could be used to efficiently search for inputs exhibiting this sensitivity.
Comparing the features extracted by the interpretability tools from “Searching for concepts by their shape” across different copies of the model that have been adversarially perturbed to have different beliefs (per paulfchristiano’s comment). If there are decision-relevant features (e.g. representations of the model’s objectives) that are suspiciously different in the copy that believes it is unobserved, that would be strong evidence of deceptive cognition. The causal analysis from “Causal Explainable Guardrails” could help determine whether those differing features are causally linked to the differing outputs.
Training MAIA’s detectors for suspicious model outputs not just on straightforwardly deceptive behaviors, but on subtler signs of agency-like cognition hypothesized in paulfchristiano’s comment, e.g. the model giving worse answers when it believes it is unobserved. This could yield more robust detection that is harder for the model to circumvent by avoiding overt deception.
If interpretability tools manage to extract a model’s representations of its beliefs, objectives, plans, etc. (a big if!), use the techniques from “Catching AIs red-handed” to examine how those representations differ on the inputs that were “successfully” deceptive (i.e. where the model behaved deceptively but was not caught). There may be learnable patterns in how the model’s cognition differs in those cases which could inspire hypotheses about what relations characterize “successful” deception. Those hypotheses could then be formalized and tested using the framework from “Searching for concepts by their shape”.
A point emphasized in paulfchristiano’s comment is the importance of extreme thoroughness and breadth in searching for deceptive cognition, e.g. (to paraphrase) “even if a model exhibits some deceptive behavior, don’t assume you’ve found all of its tricks; keep probing its behavior in a huge variety of contexts”. MAIA and other automated testing tools should be put in service of this kind of breadth. Rather than just focus on the most egregious cases of deception, we should test the model’s behavior on a vast range of subtle perturbations hunting for any whiff of situational inconsistency. The theoretical frameworks for agency from “Searching for concepts by their shape” can guide us in knowing what to look for.
Extending my previous thoughts, I think one of the most useful things the “Searching for concepts by their shape” framing can contribute is mathematical precision in defining hypotheses about what characterizes deceptive cognition. If we can frame agency/deception in terms of precise relations between features (even if those features are hard to interpretably extract in practice), it gives us a crisper target to aim at with our testing, rather than just relying on intuitive notions of what “seems deceptive”.
However, I continue to be quite pessimistic about the prospects for catching competent deceptive models before it’s too late, for the reasons outlined previously. I worry that by the time a model is smart enough that we could meaningfully apply these combined interpretability strategies to it, it would also be smart enough to notice what we’re doing and come up with countermeasures.
Unless interpretability tools far exceed my expectations in their ability to crack open the black box of neural nets (which would be great!), I suspect the most practical path is to use simpler, more behavioral detection strategies (like the model consistency testing from your Control paper) to avoid building models that are competent enough to be robustly deceptive in the first place. Catching a superintelligent deceptive model after the fact seems like a much harder challenge than noticing warning signs of misalignment in weaker models and heading them off early.
But I would still encourage pursuing all these ideas, even if they end up not being enough on their own to solve the problem. The interpretability insights gained would be valuable for alignment more broadly. And we may get lucky in extracting some key structural signature of deceptive cognition that even very sophisticated models can’t fully mask. It’s an arms race we can’t afford to unilaterally forfeit.′
I wonder how much near-term interpretability [V]LM agents (e.g. MAIA, AIA) might help with finding better probes and better steering vectors (e.g. by iteratively testing counterfactual hypotheses against potentially spurious features, a major challenge for Contrast-consistent search (CCS)).
This seems plausible since MAIA can already find spurious features, and feature interpretability [V]LM agents could have much lengthier hypotheses iteration cycles (compared to current [V]LM agents and perhaps even to human researchers).
I might have updated at least a bit against the weakness of single-forward passes, based on intuitions about the amount of compute that huge context windows (e.g. Gemini 1.5 − 1 million tokens) might provide to a single-forward-pass, even if limited serially.
Or maybe not, apparently LLMs are (mostly) not helped by filler tokens.
Somewhat relatedly: I’m interested on how well LLMs can solve tasks in parallel. This seems very important to me.[1]
The “I’ve thought about this for 2 minutes” version is: Hand an LLM two multiple choice questions with four answer options each. Encode these four answer options into a single token, so that there are 16 possible tokens of which one corresponds to the correct answer to both questions. A correct answer means that the model has solved both tasks in one forward-pass.
(One can of course vary the number of answer options and questions. I can see some difficulties in implementing this idea properly, but would nevertheless be excited if someone took a shot at it.)
Two quick reasons:
- For serial computation the number of layers gives some very rough indication of the strength of one forward-pass, but it’s harder to have intuitions for parallel computation.
- For scheming, the model could reason about “should I still stay undercover”, “what should I do in case I should stay undercover” and “what should I do in case it’s time to attack” in parallel, finally using only one serial step to decide on its action.
I would expect, generally, solving tasks in parallel to be fundamentally hard in one-forward pass for pretty much all current SOTA architectures (especially Transformers and modern RNNs like MAMBA). See e.g. this comment of mine; and other related works like https://twitter.com/bohang_zhang/status/1664695084875501579, https://twitter.com/bohang_zhang/status/1664695108447399937 (video presentation), Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks, RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval.
There might be more such results I’m currently forgetting about, but they should be relatively easy to find by e.g. following citation trails (to and from the above references) with Google Scholar (or by looking at my recent comments / short forms).
I am also very interested in e.g. how one could operationalize the number of hops of inference of out-of-context reasoning required for various types of scheming, especially scheming in one-forward-pass; and especially in the context of automated AI safety R&D.
https://arxiv.org/abs/2404.15758
“We show that transformers can use meaningless filler tokens (e.g., ‘......’) in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge.”
Thanks, seen it; see also the exchanges in the thread here: https://twitter.com/jacob_pfau/status/1784446572002230703.
I looked over it and I should note that “transformers are in TC0” is not very useful statement for prediction of capabilities. Transformers are Turing-complete given rational inputs (see original paper) and them being in TC0 basically means they can implement whatever computation you can implement using boolean circuit for fixed amount of available compute which amounts to “whatever computation is practical to implement”.
I think I remember William Merrill (in a video) pointing out that the rational inputs assumption seems very unrealistic (would require infinite memory); and, from what I remember, https://arxiv.org/abs/2404.15758 and related papers made a different assumption about the number of bits of memory per parameter and per input.
Also, TC0 is very much limited, see e.g. this presentation.
Everything Turing-complete requires infinite memory. When we are saying “x86 set of instructions is Turing-complete” we imply “assuming that processor operates on infinite memory”. It’s in definition of Turing machine to include infinite tape, after all.
It’s hard to pinpoint, but the trick is that it’s very nuanced difference between the sense in which transformers are limited in complexity-theoretic sense and “transformers can’t do X”. Like, there is nothing preventing transformers from playing chess perfectly—they just need to be sufficiently large for this. To answer the question “can transformers do X” you need to ask “how much computing power transformer has” and “can this computing power be shaped by SGD into solution”.
It’s interesting question whether Gemini has any improvements.
I’ve been / am on the lookout for related theoretical results of why grounding a la Grounded language acquisition through the eyes and ears of a single child works (e.g. with contrastive learning methods) - e.g. some recent works: Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP, Contrastive Learning is Spectral Clustering on Similarity Graph, Optimal Sample Complexity of Contrastive Learning; (more speculatively) also how it might intersect with alignment, e.g. if alignment-relevant concepts might be ‘groundable’ in fMRI data (and then ‘pointable to’) - e.g. https://medarc-ai.github.io/mindeye/ uses contrastive learning with fMRI—image pairs
This seems pretty good for safety (as RAG is comparatively at least a bit more transparent than fine-tuning): https://twitter.com/cwolferesearch/status/1752369105221333061
Larger LMs seem to benefit differentially more from tools: ‘Absolute performance and improvement-per-turn (e.g., slope) scale with model size.’ https://xingyaoww.github.io/mint-bench/. This seems pretty good for safety, to the degree tool usage is often more transparent than model internals.
In my book, this would probably be the most impactful model internals / interpretability project that I can think of: https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit?commentId=qByLyr6RSgv3GBqfB
Large scale cyber-attacks resulting from AI misalignment seem hard, I’m at >90% probability that they happen much later (at least years later) than automated alignment research, as long as we *actually try hard* to make automated alignment research work: https://forum.effectivealtruism.org/posts/bhrKwJE7Ggv7AFM7C/modelling-large-scale-cyber-attacks-from-advanced-ai-systems.
I had speculated previously about links between task arithmetic and activation engineering. I think given all the recent results on in context learning, task/function vectors and activation engineering / their compositionality (In-Context Learning Creates Task Vectors, In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering, Function Vectors in Large Language Models), this link is confirmed to a large degree. This might also suggest trying to import improvements to task arithmetic (e.g. Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models, or more broadly look at the citations of the task arithmetic paper) to activation engineering.
speculatively, it might also be fruitful to go about this the other way round, e.g. try to come up with better weight-space task erasure methods by analogy between concept erasure methods (in activation space) and through the task arithmetic—activation engineering link
For the pretraining-finetuning paradigm, this link is now made much more explicitly in Cross-Task Linearity Emerges in the Pretraining-Finetuning Paradigm; as well as linking to model ensembling through logit averaging.
(As reply to Zvi’s ‘If someone was founding a new AI notkilleveryoneism research organization, what is the best research agenda they should look into pursuing right now?’)
LLMs seem to represent meaning in a pretty human-like way and this seems likely to keep getting better as they get scaled up, e.g. https://arxiv.org/abs/2305.11863. This could make getting them to follow the commonsense meaning of instructions much easier. Also, similar methodologies to https://arxiv.org/abs/2305.11863 could be applied to other alignment-adjacent domains/tasks, e.g. moral reasoning, prosociality, etc.
Step 2: e.g. plug the commonsense-meaning-of-instructions following models into OpenAI’s https://openai.com/blog/introducing-superalignment.
Related intuition: turning LLM processes/simulacra into [coarse] emulations of brain processes.
(https://twitter.com/BogdanIonutCir2/status/1677060966540795905)
Sam Altman says AGI is coming in 2025 (and he is also expecting a child next year) https://x.com/tsarnick/status/1854988648745517297
Hot take: for now, I think it’s likelier than not that even fully uncontrolled proliferation of automated ML scientists like https://sakana.ai/ai-scientist/ would still be a net differential win for AI safety research progress, for pretty much the same reasons from
https://beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/….
I’d say the best argument against it is a combination of precedent-setting concerns, where once they started open-sourcing, it’d be hard to stop doing it even if it becomes dangerous to do, combined with misuse risk for now seeming harder to solve than misalignment risk, and in order for open-source to be good, you need to prevent both misalignment and people misusing their models.
I agree Sakana AI is safe to open source, but I’m quite sure sometime in the next 10-15 years, the AIs that do get developed will likely be very dangerous to open-source them, at least for several years.