This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-research

About Apollo Research

Apollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old.

Executive Summary

For the UK AI Safety Summit, we developed a demonstration that Large Language Models (LLMs) can strategically deceive their primary users when put under pressure. The accompanying paper was referenced by experts and the press (e.g. AI Insight forum, BBC, Bloomberg) and accepted for oral presentation at the ICLR LLM agents workshop.

The evaluations team is currently working on capability evaluations for precursors of deceptive alignment, scheming model organisms, and a responsible scaling policy (RSP) on deceptive alignment. Our goal is to help governments and AI developers understand, assess, and address the risks of deceptively aligned AI systems.

The interpretability team published three papers: An improved training method for sparse dictionary learning, a new conceptual framework for ‘loss-landscape-based interpretability’, and an associated empirical paper. We are beginning to explore concrete white-box evaluations for deception and continue to work on fundamental interpretability research.

The governance team communicates our technical work to governments (e.g., on evaluations, AI deception and interpretability), and develops recommendations around our core research areas for international organizations and individual governments.

Apollo Research works with several organizations, including partnering with the UK AISI and being a member of the US AISI Consortium. As part of our partnership with UK AISI, we were contracted to develop deception evaluations. Additionally, we engage with various AI labs, e.g. red-teaming OpenAI’s fine-tuning API before deployment and consulting on the deceptive alignment section of an AI lab’s RSP.

Like any organization, we have also encountered various challenges. Some projects proved overly ambitious, resulting in delays and inefficiencies. We would have benefitted from having dedicated regular exchanges with senior official external advisors earlier. Additionally, securing funding took more time and effort than expected.

We have more room for funding. Please reach out if you’re interested.

Completed work

Evaluations

For the UK AI Safety Summit, we developed a demonstration that LLMs can strategically deceive their primary users when put under pressure, which was presented at the UK AI Safety Summit. It was referenced by experts and the press (e.g. Yoshua Bengio’s statement for Senator Schumer’s AI insight forum, BBC, Bloomberg, US Security and Exchange Commission Chairperson Gary Gensler’s speech on AI and law, and many other media outlets). It was accepted for an oral presentation at this year’s ICLR LM agents workshop.

In our role as an independent third-party evaluator, we work with a range of organizations. For example, we were contracted by the UK AISI to build deceptive capability evaluations with them. We also worked with OpenAI to red-team their fine-tuning API before deployment.

We published multiple conceptual research pieces on evaluations, including A Causal Framework for AI Regulation and Auditing and A Theory of Change for AI Auditing. Furthermore, we published conceptual clarifications on deceptive alignment and strategic deception.

We were part of multiple collaborations, including:

SAD: a situational awareness benchmark with researchers from Owain Evan’s group, led by Rudolf Laine (forthcoming).
Black-Box Access is Insufficient for Rigorous AI Audits, led by Stephen Casper and Carson Ezell.
Marius leads the Loss of Control section in the forthcoming paper resulting from ScaleAI’s ‘missing evals’ workshop in Berkeley in February 2024.

To grow the field of evaluations and to increase its accessibility, we engaged in field building. For example, we wrote a starter guide on evaluations and argued for the necessity of a Science of Evals. Furthermore, we gave multiple lectures and workshops on evaluations and mentored scholars in programs like MATS.

To support our technical efforts, we developed an evals software stack that makes it easy to build and run benchmarks and LM agent evaluations efficiently at scale.

Interpretability

Following some of our members’ early work on sparse autoencoders (SAEs), we continued to make progress in this direction:

We developed and published ‘Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning’ (Braun et al., 2024)
Jake published Toward A Mathematical Framework for Computation in Superposition with external collaborators.
Lee published the Sparsify agenda.
Lee supervised several SAE-related publications with MATS scholars:

We published a new interpretability framework and a method called the Local Interaction Basis (LIB). This was published as two papers:

A paper explaining the framework and background theory: Using Degeneracy in the Loss Landscape for Mechanistic Interpretability (Bushnaq et al. 2024)
A paper introducing a method (LIB) that is built on the framework and sharing some empirical results. We found mostly negative results: The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks (Bushnaq et al. 2024)

While our LIB project mainly yielded negative empirical results, we think the theory is useful and may provide a basis for future interpretability methods. We also learned some important lessons during this project; for example, we prematurely tried to scale the project to large models before having sufficient certainty about toy models.

Governance

Our governance team helps our organization communicate our technical work to decision-makers and inform policy discussions. The team is currently focused on direct government engagements and on the research and development of policy recommendations. As part of our policy research:

We developed a policy repertoire and responded to multiple international requests for information (RFIs), including RFIs from NIST (US), IMDA and AI Verify (Singapore), and the United Nations’ High-Level AI Advisory Body.
We provided ad-hoc responses to direct requests for input, for example, on AI incident databases, and conducted active policy outreach, for example, to the EU AI Office.
Last autumn, we published recommendations for the UK AI Safety Summit and recommendations for the UK AI Safety Institute.

As part of our active engagements with governments:

We are partners to the UK AISI and members of the US AISI consortium.
We were invited to the Bletchley AI Summit, where our demonstration was shown to key decision-makers.
We visited DC where we engaged with, amongst others, staffers in Congress, the US AI Safety Institute, and the Department of Commerce. During that time, we also met with multiple DC-based think tanks to talk about AI deception, the evaluations ecosystem, and complementary governance mechanisms.
- As part of that, we gave presentations at the RAND Evals Day and during the CNAS AI governance forum, and participated in Partnership on AI’s workshop on safety and accessibility for open-source foundation models.

Current & Future work

Evaluations

With our evals work, we aim to help governments, AI developers, and civil society better understand and assess the risks of AI deception. Additionally, we want to make it easier to prepare for and address these risks. To this end, we’re working on four concrete projects:

Deception Framework: We’re developing a deception / scheming-focused safety framework to showcase what we think an adequate response to potentially scheming models could look like and when these responses should be triggered.
Deception-related evals: We’re building concrete LM agent evals for precursors to scheming, such as situational awareness and instrumental deceptive reasoning. We also work on evaluations for deceptive capabilities, more broadly.
Model organisms for scheming: We want to better understand when, how, and why an AI system could scheme. Therefore, we build model organisms of scheming and investigate them in detail. Currently, we’re investigating whether models that are trained to be more consequentialist (e.g. from outcome-based RL) generalize to become more deceptively aligned in other settings.
Software: We continue to extend and improve our evals software stack with a focus on LM agent evaluations.

Interpretability

We continue to think that interpretability and white-box evaluations will be an important step in addressing risks in AI systems, especially deception. Therefore, we think the following two areas are likely to be priorities.

White-box deception evals: With recent progress in interpretability techniques and Open Source models, we want to test white-box methods for detecting weak forms of strategic deception.
Further fundamental interpretability research: We are excited to see mechanistic interpretability make progress, e.g. with SAEs. However, we think additional fundamental progress may be needed. We are exploring several ideas in this space, such as connecting sparse coding to the geometry of the loss landscape.

Governance

Governance research: We plan to undertake more governance research underpinned by and supplementing our technical work. For example, we plan to detail the risks of lab-internal deployment alongside technical and government-oriented mitigation measures.
Policy outreach: We will continue building relationships with relevant institutions and governments, including Canada’s AISI, the EU AI office, France, and the Global Partnership on AI.
Contribution to ongoing processes: We plan to contribute to standardization efforts (e.g., via CEN-CENELEC JTC21), non-governmental fora (e.g., via the World Economic Forum’s AI Governance Alliance), and international efforts (e.g., via the Network of Experts for the UNSG’s AI Advisory Body).
Ecosystem building: We are concerned about the current trajectory of the third-party evaluation ecosystem and will continue to work with others to ensure a self-sustaining and thriving environment that merits being relied upon.

Operational Highlights

Team Growth and Changes: We went from 7 to 15 FTE in ~8 months. While this was ambitious, we developed robust processes and prioritized having those in place to allow for that growth. Due to that, we have been able to onboard new employees quickly, and they are typically ready to contribute to our core projects within 1 or 2 weeks.

Financials / Operations

We have received philanthropic funding from seven different funding sources to date, as well as two commercial contracts.
For the last two full quarters, our expense spending is within 2% of our budget.
Apollo aims to maintain a minimum of 6 months of runway at all times.

Challenges

While we have seen some success, we also made a number of mistakes and faced challenges.

Due to concerns about significant capability advancement externalities from publishing our research, we didn’t share our interpretability work for the first six months. Foregoing external feedback incurred larger costs than anticipated. We’re unsure if this was the right call given the information we had at the time. We expect that we will be more open with future projects and intend to invite more collaboration and feedback early on.

Some of our projects might have been too widely scoped. For example, we were trying to develop an entirely new interpretability framework and accompanying technique from scratch in ~9 months. This made it hard to verify individual components independently and it subsequently took longer than expected to get signal on the entire project. We started our model organism project with a very ambitious goal: showing how a model could become deceptively aligned from outcome-based training end-to-end. This turned out to be a much larger project than anticipated, even with our expectations at the start. Therefore, we have scaled it down drastically to tackle a much smaller and well-scoped subpart of the original research question. We continue to think that aiming for ambitious projects is the right approach. But in the future, we will attempt to balance this with reacting faster and scaling down to a smaller-scoped problem when the ambitious version turns out too hard.

We should have had official advisors earlier. During the first 9 months, we regularly talked to subject-matter experts in an unofficial capacity. However, we never had any explicit official advisors. In retrospect, this was a mistake since more clearly defined advisorships would have helped to ensure better and more detailed feedback. As of April 2024, we have three official advisors with whom we regularly meet: Tom McGrath, Owain Evans, and Daniel Kokotajlo. We are thankful for their contribution to our work and have already noticed the benefits of their advice.

Leadership spent a significant portion of their time on fundraising. This took up much more bandwidth than expected and came at the cost of executing object-level work. Our funding constraint also limited our ability to hire and retain talent, some of whom took up industry roles instead, as we could not make offers to exceptional talent sufficiently quickly and with certainty nor match salary expectations.

Forward Look

Apollo’s goals for the near future are:

We want to make it easier to understand and assess the risks related to deceptive alignment and scheming for governments, AI developers, and civil society, for example, by demonstrating model organisms or sharing the results of evaluations.
We want to make it easier to prepare for and address scheming-related risks, e.g. by providing concrete evals and action plans like RSPs to governments and AI developers.
We want to establish white-box evals as a useful and eventually required tool to assess safety and continue to improve interpretability methods.
We want to help establish and grow a healthy third-party evaluation ecosystem.

Apollo Research 1-year update

About Apollo Research

Executive Summary

Completed work

Evaluations

Interpretability

Governance

Current & Future work

Evaluations

Interpretability

Governance

Operational Highlights

Challenges

Forward Look