What’s DC?
Chris_Leong
Are you recruiting for people to offer bounties or to complete them? I’d suggest making this clearer in your post.
I think it’s fine to call it a protest, but it works better if the people are smiling and if message discipline is maintained. We need people to see a picture in the newspaper and think “those people look reasonable”. There might be a point where the strategy changes, but for now it’s about establishing credibility.
What kind of effects are you thinking about?
Is it okay to apply to be both a mentor and a mentee in different areas?
Perhaps it’d be even better to say that it’s okay to be direct or even harsh?
I think it comes down to exactly how the protests run.
I’m not a fan of chants like “Pause AI, we don’t want to die” as that won’t make sense to people with low context, but there’s a way of doing protesting that actually builds credibility. For example, I’d recommend avoiding loudspeakers and seeming angry vs. just trying to come across as reasonable.
I prob. would have been slightly clearer in the conclusion that this is really only a starting point as it identifies issues where at least N people thought it was an issue, but it doesn’t (by design) tell us what the rest thought, so we don’t really know if these are majority or minority views.
Just to add my personal opinion: I agree with some of the criticisms (including empirical work being underrated for a long time although maybe not now and excessive pessimism on policy). However, for many of the other, I they might seem like obvious mistakes at first, but once you dig into the details it becomes a bit more complicated.
Whether or not it obeys orders is irrelevant for open-source/open-weight models where this can be removed as this research shows.
(For catastrophically dangerous if misused models?) - yes, edited
(Really, we also need to suppose there are issues with strategy stealing for open source to be a problem. E.g. offense-defense inbalances or alignment difficulties.) - I would prefer not to test this by releasing the models and seeing what happens to society.
Or maybe we just conclude that open-source/open-weight past a certain level are a terrible idea?
I’ll post some extracts from the Seoul Summit. I can’t promise that this will be a particularly good summary, I was originally just writing this for myself, but maybe it’s helpful until someone publishes something that’s more polished:
Frontier AI Safety Commitments, AI Seoul Summit 2024
The major AI companies have agreed to Frontier AI Safety Commitments. In particular, they will publish a safety framework focused on severe risks: “internal and external red-teaming of frontier AI models and systems for severe and novel threats; to work toward information sharing; to invest in cybersecurity and insider threat safeguards to protect proprietary and unreleased model weights; to incentivize third-party discovery and reporting of issues and vulnerabilities; to develop and deploy mechanisms that enable users to understand if audio or visual content is AI-generated; to publicly report model or system capabilities, limitations, and domains of appropriate and inappropriate use; to prioritize research on societal risks posed by frontier AI models and systems; and to develop and deploy frontier AI models and systems to help address the world’s greatest challenges”
″Risk assessments should consider model capabilities and the context in which they are developed and deployed”—I’d argue that the context in which it is deployed should account for whether it is open or closed source/weights
”They should also be accompanied by an explanation of how thresholds were decided upon, and by specific examples of situations where the models or systems would pose intolerable risk.”—always great to make policy concrete”
In the extreme, organisations commit not to develop or deploy a model or system at all, if mitigations cannot be applied to keep risks below the thresholds.”—Very important that when this is applied the ability to iterate on open-source/weight models is taken into accounthttps://www.gov.uk/government/publications/frontier-ai-safety-commitments-ai-seoul-summit-2024/frontier-ai-safety-commitments-ai-seoul-summit-2024
Seoul Declaration for safe, innovative and inclusive AI by participants attending the Leaders’ SessionSigned by Australia, Canada, the European Union, France, Germany, Italy, Japan, the Republic of Korea, the Republic of Singapore, the United Kingdom, and the United States of America.
”We support existing and ongoing efforts of the participants to this Declaration to create or expand AI safety institutes, research programmes and/or other relevant institutions including supervisory bodies, and we strive to promote cooperation on safety research and to share best practices by nurturing networks between these organizations”—guess we should now go full-throttle and push for the creation of national AI Safety institutes“We recognise the importance of interoperability between AI governance frameworks”—useful for arguing we should copy things that have been implemented overseas.
“We recognize the particular responsibility of organizations developing and deploying frontier AI, and, in this regard, note the Frontier AI Safety Commitments.”—Important as Frontier AI needs to be treated as different from regular AI.
https://www.gov.uk/government/publications/seoul-declaration-for-safe-innovative-and-inclusive-ai-ai-seoul-summit-2024/seoul-declaration-for-safe-innovative-and-inclusive-ai-by-participants-attending-the-leaders-session-ai-seoul-summit-21-may-2024Seoul Statement of Intent toward International Cooperation on AI Safety Science
Signed by the same countries.“We commend the collective work to create or expand public and/or government-backed institutions, including AI Safety Institutes, that facilitate AI safety research, testing, and/or developing guidance to advance AI safety for commercially and publicly available AI systems”—similar to what we listed above, but more specifically focused on AI Safety Institutes which is a great.
”We acknowledge the need for a reliable, interdisciplinary, and reproducible body of evidence to inform policy efforts related to AI safety”—Really good! We don’t just want AIS Institutes to run current evaluation techniques on a bunch of models, but to be actively contributing to the development of AI safety as a science.“We articulate our shared ambition to develop an international network among key partners to accelerate the advancement of the science of AI safety”—very important for them to share research among each other
https://www.gov.uk/government/publications/seoul-declaration-for-safe-innovative-and-inclusive-ai-ai-seoul-summit-2024/seoul-statement-of-intent-toward-international-cooperation-on-ai-safety-science-ai-seoul-summit-2024-annex
Seoul Ministerial Statement for advancing AI safety, innovation and inclusivitySigned by: Australia, Canada, Chile, France, Germany, India, Indonesia, Israel, Italy, Japan, Kenya, Mexico, the Netherlands, Nigeria, New Zealand, the Philippines, the Republic of Korea, Rwanda, the Kingdom of Saudi Arabia, the Republic of Singapore, Spain, Switzerland, Türkiye, Ukraine, the United Arab Emirates, the United Kingdom, the United States of America, and the representative of the European Union
“It is imperative to guard against the full spectrum of AI risks, including risks posed by the deployment and use of current and frontier AI models or systems and those that may be designed, developed, deployed and used in future”—considering future risks is a very basic, but core principle
”Interpretability and explainability”—Happy to interpretability explicitly listed
”Identifying thresholds at which the risks posed by the design, development, deployment and use of frontier AI models or systems would be severe without appropriate mitigations”—important work, but could backfire if done poorly
”Criteria for assessing the risks posed by frontier AI models or systems may include consideration of capabilities, limitations and propensities, implemented safeguards, including robustness against malicious adversarial attacks and manipulation, foreseeable uses and misuses, deployment contexts, including the broader system into which an AI model may be integrated, reach, and other relevant risk factors.”—sensible, we need to ensure that the risks of open-sourcing and open-weight models are considered in terms of the ‘deployment context’ and ‘foreseeable uses and misuses’
”Assessing the risk posed by the design, development, deployment and use of frontier AI models or systems may involve defining and measuring model or system capabilities that could pose severe risks,”—very pleased to see a focus beyond just deployment
”We further recognise that such severe risks could be posed by the potential model or system capability or propensity to evade human oversight, including through safeguard circumvention, manipulation and deception, or autonomous replication and adaptation conducted without explicit human approval or permission. We note the importance of gathering further empirical data with regard to the risks from frontier AI models or systems with highly advanced agentic capabilities, at the same time as we acknowledge the necessity of preventing the misuse or misalignment of such models or systems, including by working with organisations developing and deploying frontier AI to implement appropriate safeguards, such as the capacity for meaningful human oversight”—this is massive. There was a real risk that these issues were going to be ignored, but this is now seeming less likely.
”We affirm the unique role of AI safety institutes and other relevant institutions to enhance international cooperation on AI risk management and increase global understanding in the realm of AI safety and security.”—“Unique role”, this is even better!
”We acknowledge the need to advance the science of AI safety and gather more empirical data with regard to certain risks, at the same time as we recognise the need to translate our collective understanding into empirically grounded, proactive measures with regard to capabilities that could result in severe risks. We plan to collaborate with the private sector, civil society and academia, to identify thresholds at which the level of risk posed by the design, development, deployment and use of frontier AI models or systems would be severe absent appropriate mitigations, and to define frontier AI model or system capabilities that could pose severe risks, with the ambition of developing proposals for consideration in advance of the AI Action Summit in France”—even better than above b/c it commits to a specific action and timeline
https://www.gov.uk/government/publications/seoul-ministerial-statement-for-advancing-ai-safety-innovation-and-inclusivity-ai-seoul-summit-2024
Surely this can’t be a new issue? There must already exist some norms around this.
This criticism feels a bit strong to me. Knowing the extent to which interpretability work scales up to larger models seems pretty important. I could have imagined people either arguing that such techniques would work worse on larger models b/c required optimizations or better because less concepts would be in superposition. Work on this feels quite important, even though there’s a lot more work to be done.
Also, sharing some amount of eye-catching results seems important for building excitement for interpretability research.
Update: I skipped the TLDR when I was reading this post b/c I just read the rest. I guess I’m fine with Anthropic mostly focusing on establishing one kind of robustness and leaving other kinds of robustness for future work. I’d be more likely to agree with Steven Casper if there isn’t further research from Anthropic in the next year that makes significant progress in evaluating the robustness of their approach. One additional point: independent researchers can run some of these other experiments, but they can’t run the scaling experiment.
“Presently beyond the state of the art… I think that would be pretty cool”
Point taken, but it doesn’t make it sufficient for avoiding society-level catastrophies.
That’s the exact thing I’m worried about, that people will equate deploying a model via API with releasing open-weights when the latter has significantly more risk due to the potential for future modification and the inability for it to be withdrawn.
Frontier Red Team, Alignment Science, Finetuning, and Alignment Stress Testing
What’s the difference between a frontier red team and alignment stress-testing? Is the red team focused on the current models you’re releasing and the alignment stress testing focused on the future?
I know that Anthropic doesn’t really open-source advanced AI, but it might be useful to discuss this in Anthropic’s RSP anyway because one way I see things going badly is people copying Anthropic’s RSP’s and directly applying it to open-source projects without accounting for the additional risks this entails.
Great work! It’s easy to overlook the importance of this kind of community infrastructure, but I suspect that it makes a significant difference.
I think we should be talking more about potentially denying a frontier AI license to any company that causes a major disaster (within some future licensing regime), where a company’s record before the law passes will be taken into amount.