Were you able to check the prediction in the section “Non-sourcelike references”?
mic
Great writeup! I recently wrote a brief summary and review of the same paper.
Alaga & Schuett (2023) propose a framework for frontier AI developers to manage potential risk from advanced AI systems, by coordinating pausing in response to models are assessed to have dangerous capabilities, such as the capacity to develop biological weapons.
The scheme has five main steps:
Frontier AI models are evaluated by developers or third parties to test for dangerous capabilities.
If a model is shown to have dangerous capabilities (“fails evaluations”), the developer pauses training and deployment of that model, restricts access to similar models, and delays related research.
Other developers are notified whenever a dangerous model is discovered, and also pause similar work.
The failed model’s capabilities are analyzed and safety precautions are implemented during the pause.
Developers only resume paused work once adequate safety thresholds are met.
The report discusses four versions of this coordination scheme:
Voluntary – developers face public pressure to evaluate and pause but make no formal commitments.
Pausing agreement – developers collectively commit to the process in a contract.
Mutual auditor – developers hire the same third party to evaluate models and require pausing.
Legal requirements – laws mandate evaluation and coordinated pausing.
The authors of the report prefer the third and fourth versions, as they are most effective.
Strengths and weaknesses
The report addresses the important and underexplored question of what AI labs should do in response to evaluations finding dangerous capabilities. Coordinated pausing is a valuable contribution to this conversation. The proposed scheme seems relatively effective and potentially feasible, as it aligns with the efforts of the dangerous-capability evaluation teams of OpenAI and the Alignment Research Center.
A key strength is the report’s thorough description of multiple forms of implementation for coordinated pausing. This ranges from voluntary participation relying on public pressure, to contractual agreements among developers, shared auditing arrangements, and government regulation. Having flexible options makes the framework adaptable and realistic to put into practice, rather than a rigid, one-size-fits-all proposal.
The report acknowledges several weaknesses of the proposed framework, including potential harms from its implementation. For example, coordinated pausing could provide time for competing countries (such as China) to “catch up,” which may be undesirable from a US policy perspective. Pausing could mean that capabilities rapidly increase after a pause, through applying algorithmic improvements discovered during the pause, which may be less safe than a “slow takeoff.”
Additionally, the paper acknowledges concerns with feasibility, such as the potential that coordinated pausing may violate US and EU antitrust law. As a countermeasure, it suggests making “independent commitments to pause without discussing them with each other,” with no retaliation against non-participating AI developers, but defection would seem to be an easy option under such a scheme. It recommends further legal analysis and consultation regarding this topic, but the authors are not able to provide assurances regarding the antitrust concern. The other feasibility concerns – regarding enforcement, verifying that post-deployment models are the same as evaluated models, potential pushback from investors, and so on – are adequately discussed and appear possible to overcome.
One weakness of the report is that the motivation for coordinated pausing is not presented in a compelling manner. The report provides twelve pages of implementation details before explaining the benefits. These benefits, such as “buying more time for safety research,” are indirect and may not be persuasive to a skeptical reader. AI lab employees and policymakers often take a stance that technological innovation, especially in AI, should not be hindered unless otherwise demonstrated. Even if the report intends to take a balanced perspective rather than advocating for the proposed framework, the arguments provided in favor of the framework seem weaker than what is possible.
It seems intuitive that deployment of a dangerous AI system should be halted, though it is worth clearly noting that “failing” a dangerous-capability evaluation does not necessarily mean that the AI system in practice has dangerous capability. However, it is not clear why the development of such a system must also be paused. As long as the dangerous AI system is not deployed, further pretraining of the model does not appear to pose risks. AI developers may be worried about falling behind competitors, so the costs incurred from this requirement must be clearly motivated for them to be on board.
While the report makes a solid case for coordinated pausing, it has gaps around considering additional weaknesses of the framework, explaining its benefits, and solving key feasibility issues. More work may be done to strengthen the argument to make coordinated pausing more feasible.
Excited to see forecasting as a component of risk assessment, in addition to evals!
I was still confused when I opened the post. My presumption was that “clown attack” referred to a literal attack involving literal clowns. If you google “clown attack,” the results are about actual clowns. I wasn’t sure if this post was some kind of joke, to be honest.
Do we still not have any better timelines reports than bio anchors? From the frame of bio anchors, GPT-4 is merely on the scale of two chinchillas, yet outperforms above-average humans at standardized tests. It’s not a good assumption that AI needs 1 quadrillion parameters to have human-level capabilities.
OpenAI’s Planning for AGI and beyond already writes about why they are building AGI:
Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity.
If AGI is successfully created, this technology could help us elevate humanity by increasing abundance, turbocharging the global economy, and aiding in the discovery of new scientific knowledge that changes the limits of possibility.
AGI has the potential to give everyone incredible new capabilities; we can imagine a world where all of us have access to help with almost any cognitive task, providing a great force multiplier for human ingenuity and creativity.
On the other hand, AGI would also come with serious risk of misuse, drastic accidents, and societal disruption. Because the upside of AGI is so great, we do not believe it is possible or desirable for society to stop its development forever; instead, society and the developers of AGI have to figure out how to get it right. ^A
^A We seem to have been given lots of gifts relative to what we expected earlier: for example, it seems like creating AGI will require huge amounts of compute and thus the world will know who is working on it, it seems like the original conception of hyper-evolved RL agents competing with each other and evolving intelligence in a way we can’t really observe is less likely than it originally seemed, almost no one predicted we’d make this much progress on pre-trained language models that can learn from the collective preferences and output of humanity, etc.
AGI could happen soon or far in the future; the takeoff speed from the initial AGI to more powerful successor systems could be slow or fast. Many of us think the safest quadrant in this two-by-two matrix is short timelines and slow takeoff speeds; shorter timelines seem more amenable to coordination and more likely to lead to a slower takeoff due to less of a compute overhang, and a slower takeoff gives us more time to figure out empirically how to solve the safety problem and how to adapt.
This doesn’t include discussion of what would make them decide to stop building AGI, but would you be happy if other labs wrote a similar statement? I’m not sure that AI labs actually have an attitude of “we wish we didn’t have to build AGI.”
Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not?
Probably, but Anthropic is actively working in the opposite direction:
This means that every AWS customer can now build with Claude, and will soon gain access to an exciting roadmap of new experiences—including Agents for Amazon Bedrock, which our team has been instrumental in developing.
Currently available in preview, Agents for Amazon Bedrock can orchestrate and perform API calls using the popular AWS Lambda functions. Through this feature, Claude can take on a more expanded role as an agent to understand user requests, break down complex tasks into multiple steps, carry on conversations to collect additional details, look up information, and take actions to fulfill requests. For example, an e-commerce app that offers a chat assistant built with Claude can go beyond just querying product inventory – it can actually help customers update their orders, make exchanges, and look up relevant user manuals.Obviously, Claude 2 as a conversational e-commerce agent is not going to pose catastrophic risk, but it wouldn’t be surprising if building an ecosystem of more powerful AI agents increased the risk that autonomous AI agents cause catastrophic harm.
From my reading of ARC Evals’ example of a “good RSP”, RSPs set a standard that roughly looks like: “we will continue scaling models and deploying if and only if our internal evals team fails to empirically elicit dangerous capabilities. If they do elicit dangerous capabilities, we will enact safety controls just sufficient for our models to be unsuccessful at, e.g., creating Super Ebola.”
This is better than a standard of “we will scale and deploy models whenever we want,” but still has important limitations. As noted by the “coordinated pausing” paper, it would be problematic if “frontier AI developers and other stakeholders (e.g. regulators) rely too much on evaluations and coordinated pausing as their main intervention to reduce catastrophic risks from AI.”
Some limitations:
Misaligned incentives. The evaluation team may have an incentive to find fewer dangerous capabilities than possible. When findings of dangerous capabilities could lead to timeline delays, public criticism, and lost revenue for the company, an internal evaluation team has a conflict of interest. Even with external evaluation teams, AI labs may choose whichever one is most favorable or inexperienced (e.g., choosing an inexperienced consulting team).
Underestimating risk. Pre-deployment evaluations underestimate the potential risk after deployment. A small evaluation team, which may be understaffed, is unlikely to exhaust all the ways to enhance a model’s capabilities for dangerous purposes, compared to what the broader AI community could do after a model is deployed to the public. The most detailed evaluation report to date, ARC Evals’ evaluation report on realistic autonomous tasks, notes that it does not bound the model’s capabilities at these tasks.
For example, suppose that an internal evaluations team has to assess dangerous capabilities before the lab deploys a next-generation AI model. With only one month to assess the final model, they find that even with fine-tuning and available AI plugins, the AI model reliably fails to replicate itself, and conclude that there is minimal risk of autonomous replication. The AI lab releases the model and with the hype from the new model, AI deployment becomes a more streamlined process, new tools are built for AIs to navigate the internet, and comprehensive fine-tuning datasets are commissioned to train AIs to make money for themselves with ease. The AI is now able to easily autonomously replicate, even where the past generation still fails to do so. The goalposts shift so that AI labs are worried only about autonomous replication if the AI can also hack its weights and self-exfiltrate.
Not necessarily compelling. Evaluations finding dangerous capabilities may not be perceived as a compelling reason to pause or enact stronger safety standards across the industry. Experiments by an evaluation team do not reflect real-world conditions and may be dismissed as unrealistic. Some capabilities such as autonomous replication may be seen as overly abstract and detached from evidence of real-world harm, especially for politicians that care about concrete concerns from constituents. Advancing dangerous capabilities may even be desirable for some stakeholders, such as the military, and reason to race ahead in AI development.
Safety controls may be minimal. The safety controls enacted in response to dangerous-capability evaluations could be relatively minimal and brittle, falling apart in real-world usage, as long as it meets the standards of the responsible scaling policy.
There are other factors that can motivate policy for adopt strong safety standards, besides empirical evaluations of extreme risk. Rather than requiring safety only when AIs demonstrate extreme risk (e.g., killing millions with a pandemic), governments are already considering preventing them from engaging in illegal activities. China recently passed legislation to prevent generative AI services from generating illegal content, and the EU AI Act has a similar proposal in Article 28b. While these provisions are focused on AI agents rather than generative AI, it seems feasible to set a standard for AIs to be generally law-abiding (even after jailbreaking or fine-tuning attempts), which would also help reduce their potential contribution to catastrophic risk. Setting liability for AI harms, as proposed by Senators Blumenthal and Hawley, would also motivate AI labs to be more cautious. We’ve seen lobbying from OpenAI and Google to change the EU AI Act to shift away the burden of making AIs safe to downstream applications (see response letter from the AI Now Institute, signed by several researchers at GovAI). Lab-friendly policy like RSPs may predictably underinvest in measures that regulate current and near-future models.
Related: [2310.02949v1] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models (arxiv.org)
The increasing open release of powerful large language models (LLMs) has facilitated the development of downstream applications by reducing the essential cost of data annotation and computation. To ensure AI safety, extensive safety-alignment measures have been conducted to armor these models against malicious use (primarily hard prompt attack). However, beneath the seemingly resilient facade of the armor, there might lurk a shadow. By simply tuning on 100 malicious examples with 1 GPU hour, these safely aligned LLMs can be easily subverted to generate harmful content. Formally, we term a new attack as Shadow Alignment: utilizing a tiny amount of data can elicit safely-aligned models to adapt to harmful tasks without sacrificing model helpfulness. Remarkably, the subverted models retain their capability to respond appropriately to regular inquiries. Experiments across 8 models released by 5 different organizations (LLaMa-2, Falcon, InternLM, BaiChuan2, Vicuna) demonstrate the effectiveness of shadow alignment attack. Besides, the single-turn English-only attack successfully transfers to multi-turn dialogue and other languages. This study serves as a clarion call for a collective effort to overhaul and fortify the safety of open-source LLMs against malicious attackers.
For convenience, can you explain how this post relates to the other post today from this SERI MATS team, unRLHF—Efficiently undoing LLM safeguards?
The Gradient – The Artificiality of Alignment
To be clear, I don’t think Microsoft deliberately reversed OpenAI’s alignment techniques, but rather it seemed that Microsoft probably received the base model of GPT-4 and fine-tuned it separately from OpenAI.
Microsoft’s post “Building the New Bing” says:
Last Summer, OpenAI shared their next generation GPT model with us, and it was game-changing. The new model was much more powerful than GPT-3.5, which powers ChatGPT, and a lot more capable to synthesize, summarize, chat and create. Seeing this new model inspired us to explore how to integrate the GPT capabilities into the Bing search product, so that we could provide more accurate and complete search results for any query including long, complex, natural queries.
This seems to correspond to when GPT-4 “finished training in August of 2022”. OpenAI says it spent six months fine-tuning it with human feedback before releasing it in March 2023. I would guess that Microsoft doing its own fine-tuning of the version of GPT-4 from August 2022, separately from OpenAI. Especially with Bing’s tendency to repeat itself, it doesn’t feel like a fine-tuned version of GPT-3.5/4, after OpenAI’s RLHF, but rather more like a base model.
It’s worth keeping in mind that before Microsoft launched the GPT-4 Bing chatbot that ended up threatening and gaslighting users, OpenAI advised against launching so early as it didn’t seem ready. Microsoft went ahead anyway, apparently in part due to some resentment that OpenAI stole its “thunder” with releasing ChatGPT in November 2022. In principle, if Microsoft wanted to, there’s nothing stopping Microsoft from doing the same thing with future AI models: taking OpenAI’s base model, fine-tuning it in a less robustly safe manner, and releasing it in a relatively unsafe manner. Perhaps dangerous capability evaluations are not just about convincing OpenAI or Anthropic to adhere to higher safety standards and potentially pause, but also Microsoft.
Just speaking pragmatically, the Center for Humane Technology has probably built stronger relations with DC policy people compared to MIRI.
It’s a bit ambiguous, but I personally interpreted the Center for Humane Technology’s claims here in a way that would be compatible with Dario’s comments:
“Today, certain steps in bioweapons production involve knowledge that can’t be found on Google or in textbooks and requires a high level of specialized expertise — this being one of the things that currently keeps us safe from attacks,” he added.
He said today’s AI tools can help fill in “some of these steps,” though they can do this “incompletely and unreliably.” But he said today’s AI is already showing these “nascent signs of danger,” and said his company believes it will be much closer just a few years from now.
“A straightforward extrapolation of today’s systems to those we expect to see in two to three years suggests a substantial risk that AI systems will be able to fill in all the missing pieces, enabling many more actors to carry out large-scale biological attacks,” he said. “We believe this represents a grave threat to U.S. national security.”
If Tristan Harris was, however, making the stronger claim that jailbroken Llama 2 could already supply all the instructions to produce anthrax, that would be much more concerning than my initial read.
[Linkpost] Mark Zuckerberg confronted about Meta’s Llama 2 AI’s ability to give users detailed guidance on making anthrax—Business Insider
SPAR seeks advisors and students for AI safety projects (Second Wave)
Sorry to hear your laptop was stolen :(
Great point, I’ve added this suggestion to the post.
Is there any way to do so given our current paradigm of pretraining and fine-tuning foundation models?