Given the OpenAI o3 results making it clear that you can pour more compute to solve problems, I’d like to announce that I will be mentoring at SPAR for an automated interpretability research project using AIs with inference-time compute.
I truly believe that the AI safety community is dropping the ball on this angle of technical AI safety and that this work will be a strong precursor of what’s to come.
Note that this work is a small part in a larger organization on automated AI safety I’m currently attempting to build.
As AIs become more capable, they will increasingly be used to automate AI R&D. Given this, we should seek ways to use AIs to help us also make progress on alignment research.
Eventually, AIs will automate all research, but for now, we need to choose specific tasks that AIs can do well on. The kind of problems we can expect AIs will be good at fairly soon are the kind that have reliable metrics they can optimize, have a lot of knowledge about, and can iterate on fairly cheaply.
As a result, we can make progress toward automating interpretability research by coming up with experimental setups that allow AIs to iterate. For now, we can leave the exact details a bit broad, but here are some examples of what we could attempt to use AIs to make deep learning models more interpretable:
Optimizing Sparse Autoencoders (SAEs): sparse autoencoders (or transcoders) can be used to help us interpret the features of deep learning models. However, SAEs may suffer from issues like polysemanticity. Our goal is to create a SAE training setup that can give us some insight into what might make AI models more interpretable. This could involve testing different regularizers, activation functions, and more. We’ll start with simpler vision models before scaling to language models to allow for rapid iteration and validation. Key metrics include feature monosemanticity, sparsity, dead feature ratios, and downstream task performance.
Enhancing Model Editability: we will be using AIs to do experiments on language models to find out which modifications lead to better model editing ability from a technique like ROME/MEMIT.
Overall, we can also use other approaches to measure the increase in interpretability (or editability) of language models.
The project aims to answer several key questions:
Can AI effectively optimize interpretability techniques?
What metrics best capture meaningful improvements in interpretability?
Are AIs better at this task than human researchers?
Can we develop reliable pipelines for automated interpretability research?
Initial explorations will focus on creating clear evaluation frameworks and baselines, starting with smaller-scale proof-of-concepts that can be rigorously validated.
References:
“The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery” (Lu et al., 2024)
“RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts” (METR, 2024)
“Towards Monosemanticity: Decomposing Language Models With Dictionary Learning” (Bricken et al., 2023)
“Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” (Templeton et al., 2024)
“ROME: Locating and Editing Factual Knowledge in GPT” (Meng et al., 2022)
Briefly, how does your project advance AI safety? (from Proposal)
The goal of this project is to leverage AIs to progress on the interpretability of deep learning models. Part of the project will involve building infrastructure to help AIs contribute to alignment research more generally, which will be re-used as models become more capable of making progress on alignment. Another part will look to improve the interpretability of deep learning models without sacrificing capability.
What role will mentees play in this project? (from Proposal)
Mentees will be focused on:
Get up-to-date on current approaches to leverage AIs for automated research.
Setting up the infrastructure to get AIs to automate the interpretability research.
Run experiments with AIs to optimize for making models more interpretable while not compromising on capabilities.
“As a result, we can make progress toward automating interpretability research by coming up with experimental setups that allow AIs to iterate.” This sounds exactly like the kind of progress which is needed in order to get closer to game-over-AGI. Applying current methods of automation to alignment seems fine, but if you are trying to push the frontier of what intellectual progress can be achieved using AI’s, I fail to see your comparative advantage relative to pure capabilities researchers.
I do buy that there might be credit to the idea of developing the infrastructure/ability to be able to do a lot of automated alignment research, which gets cached out when we are very close to game-over-AGI, even if it comes at the cost of pushing the frontier some.
Exactly right. This is the first criticism I hear every time about this kind of work and one of the main reasons I believe the alignment community is dropping the ball on this.
I only intend on sharing work output (paper on better technique for interp, not the infrastructure setup; things similar to Transluce) where necessary and not the infrastructure. We don’t need to share or open source what we think isn’t worth it. That said, the capabilities folks will be building stuff like this by default, as they already have (Sakana AI). Yet I see many paths to automating sub-areas of alignment research that we will be playing catch up to capabilities when the time comes because we were so afraid of touching this work. We need to put ourselves in a position to absorb a lot of compute.
It depends on your world model. If your timelines are really short, then an AGI through automated interpretability research would still be a much safer path compared to other scaling-dependent alternatives.
As a side note, I’m in the process of building an organization (leaning startup). I will be in London in January for phase 2 of the Catalyze Impact program (incubation program for new AI safety orgs). Looking for feedback on a vision doc and still looking for a cracked CTO to co-found with. If you’d like to help out in whichever way, send a DM!
Given the OpenAI o3 results making it clear that you can pour more compute to solve problems, I’d like to announce that I will be mentoring at SPAR for an automated interpretability research project using AIs with inference-time compute.
I truly believe that the AI safety community is dropping the ball on this angle of technical AI safety and that this work will be a strong precursor of what’s to come.
Note that this work is a small part in a larger organization on automated AI safety I’m currently attempting to build.
Here’s the link: https://airtable.com/appxuJ1PzMPhYkNhI/shrBUqoOmXl0vdHWo?detail=eyJwYWdlSWQiOiJwYWd5SURLVXg5WHk4bHlmMCIsInJvd0lkIjoicmVjRW5rU3d1UEZBWHhQVHEiLCJzaG93Q29tbWVudHMiOmZhbHNlLCJxdWVyeU9yaWdpbkhpbnQiOnsidHlwZSI6InBhZ2VFbGVtZW50IiwiZWxlbWVudElkIjoicGVsSmM5QmgwWDIxMEpmUVEiLCJxdWVyeUNvbnRhaW5lcklkIjoicGVsUlNqc0xIbWhUVmJOaE4iLCJzYXZlZEZpbHRlclNldElkIjoic2ZzRGNnMUU3Mk9xSXVhYlgifX0
Here’s the pitch:
As AIs become more capable, they will increasingly be used to automate AI R&D. Given this, we should seek ways to use AIs to help us also make progress on alignment research.
Eventually, AIs will automate all research, but for now, we need to choose specific tasks that AIs can do well on. The kind of problems we can expect AIs will be good at fairly soon are the kind that have reliable metrics they can optimize, have a lot of knowledge about, and can iterate on fairly cheaply.
As a result, we can make progress toward automating interpretability research by coming up with experimental setups that allow AIs to iterate. For now, we can leave the exact details a bit broad, but here are some examples of what we could attempt to use AIs to make deep learning models more interpretable:
Optimizing Sparse Autoencoders (SAEs): sparse autoencoders (or transcoders) can be used to help us interpret the features of deep learning models. However, SAEs may suffer from issues like polysemanticity. Our goal is to create a SAE training setup that can give us some insight into what might make AI models more interpretable. This could involve testing different regularizers, activation functions, and more. We’ll start with simpler vision models before scaling to language models to allow for rapid iteration and validation. Key metrics include feature monosemanticity, sparsity, dead feature ratios, and downstream task performance.
Enhancing Model Editability: we will be using AIs to do experiments on language models to find out which modifications lead to better model editing ability from a technique like ROME/MEMIT.
Overall, we can also use other approaches to measure the increase in interpretability (or editability) of language models.
The project aims to answer several key questions:
Can AI effectively optimize interpretability techniques?
What metrics best capture meaningful improvements in interpretability?
Are AIs better at this task than human researchers?
Can we develop reliable pipelines for automated interpretability research?
Initial explorations will focus on creating clear evaluation frameworks and baselines, starting with smaller-scale proof-of-concepts that can be rigorously validated.
References:
“The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery” (Lu et al., 2024)
“RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts” (METR, 2024)
“Towards Monosemanticity: Decomposing Language Models With Dictionary Learning” (Bricken et al., 2023)
“Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” (Templeton et al., 2024)
“ROME: Locating and Editing Factual Knowledge in GPT” (Meng et al., 2022) Briefly, how does your project advance AI safety? (from Proposal)
The goal of this project is to leverage AIs to progress on the interpretability of deep learning models. Part of the project will involve building infrastructure to help AIs contribute to alignment research more generally, which will be re-used as models become more capable of making progress on alignment. Another part will look to improve the interpretability of deep learning models without sacrificing capability. What role will mentees play in this project? (from Proposal) Mentees will be focused on:
Get up-to-date on current approaches to leverage AIs for automated research.
Setting up the infrastructure to get AIs to automate the interpretability research.
Run experiments with AIs to optimize for making models more interpretable while not compromising on capabilities.
“As a result, we can make progress toward automating interpretability research by coming up with experimental setups that allow AIs to iterate.”
This sounds exactly like the kind of progress which is needed in order to get closer to game-over-AGI. Applying current methods of automation to alignment seems fine, but if you are trying to push the frontier of what intellectual progress can be achieved using AI’s, I fail to see your comparative advantage relative to pure capabilities researchers.
I do buy that there might be credit to the idea of developing the infrastructure/ability to be able to do a lot of automated alignment research, which gets cached out when we are very close to game-over-AGI, even if it comes at the cost of pushing the frontier some.
Exactly right. This is the first criticism I hear every time about this kind of work and one of the main reasons I believe the alignment community is dropping the ball on this.
I only intend on sharing work output (paper on better technique for interp, not the infrastructure setup; things similar to Transluce) where necessary and not the infrastructure. We don’t need to share or open source what we think isn’t worth it. That said, the capabilities folks will be building stuff like this by default, as they already have (Sakana AI). Yet I see many paths to automating sub-areas of alignment research that we will be playing catch up to capabilities when the time comes because we were so afraid of touching this work. We need to put ourselves in a position to absorb a lot of compute.
It depends on your world model. If your timelines are really short, then an AGI through automated interpretability research would still be a much safer path compared to other scaling-dependent alternatives.
As a side note, I’m in the process of building an organization (leaning startup). I will be in London in January for phase 2 of the Catalyze Impact program (incubation program for new AI safety orgs). Looking for feedback on a vision doc and still looking for a cracked CTO to co-found with. If you’d like to help out in whichever way, send a DM!