Chris Olah’s views on AGI safety
Note: I am not Chris Olah. This post was the result of lots of back-and-forth with Chris, but everything here is my interpretation of what Chris believes, not necessarily what he actually believes. Chris also wanted me to emphasize that his thinking is informed by all of his colleagues on the OpenAI Clarity team and at other organizations.
In thinking about AGI safety—and really any complex topic on which many smart people disagree—I’ve often found it very useful to build a collection of different viewpoints from people that I respect that I feel like I understand well enough to be able to think from their perspective. For example, I will often try to compare what an idea feels like when I put on my Paul Christiano hat to what it feels like when I put on my Scott Garrabrant hat. Recently, I feel like I’ve gained a new hat that I’ve found extremely valuable that I also don’t think many other people in this community have, which is my Chris Olah hat. The goal of this post is to try to give that hat to more people.
If you’re not familiar with him, Chris Olah leads the Clarity team at OpenAI and formerly used to work at Google Brain. Chris has been a part of many of the most exciting ML interpretability results in the last five years, including Activation Atlases, Building Blocks of Interpretability, Feature Visualization, and DeepDream. Chris was also a coauthor of “Concrete Problems in AI Safety.”
He also thinks a lot about technical AGI safety and has a lot of thoughts on how ML interpretability work can play into that—thoughts which, unfortunately, haven’t really been recorded previously. So: here’s my take on Chris’s AGI safety worldview.
The benefits of transparency and interpretability
Since Chris primarily works on ML transparency and interpretability, the obvious first question to ask is how he imagines that sort of research aiding with AGI safety. When I was talking with him, Chris listed four distinct ways in which he thought transparency and interpretability could help, which I’ll go over in his order of importance.
Catching problems with auditing
First, Chris says, interpretability gives you a mulligan. Before you deploy your AI, you can throw all of your interpretability tools at it to check and see what it actually learned and make sure it learned the right thing. If it didn’t—if you find that it’s learned some sort of potentially dangerous proxy, for example—then you can throw your AI out and try again. As long as you’re in a domain where your AI isn’t actively trying to deceive your interpretability tools (via deceptive alignment, perhaps), this sort of a mulligan could help quite a lot in resolving more standard robustness problems (proxy alignment, for example). That being said, that doesn’t necessarily mean waiting until you’re on the verge of deployment to look for flaws. Ideally you’d be able to discover problems early on via an ongoing auditing process as you build more and more capable systems.
One of the OpenAI Clarity team’s major research thrusts right now is developing the ability to more rigorously and systematically audit neural networks. The idea is that interpretability techniques shouldn’t have to “get lucky” to stumble across a problem, but should instead reliably catch any problematic behavior. In particular, one way in which they’ve been evaluating progress on this is the “auditing game.” In the auditing game, one researcher takes a neural network and makes some modification to it—maybe images containing both dogs and cats are now classified as rifles, for example—and another researcher, given only the modified network, has to diagnose the problem and figure out exactly what modification was made to the network using only interpretability tools without looking at error cases. Chris’s hope is that if we can reliably catch problems in an adversarial context like the auditing game, it’ll translate into more reliably being able to catch alignment issues in the future.
Deliberate design
Second, Chris argues, advances in transparency and interpretability could allow us to significantly change the way we design ML systems. Instead of a sort of trial-and-error process where we just throw lots of different techniques at the various benchmarks and see what sticks, if we had significantly better transparency tools we might be able to design our systems deliberately by understanding why our models work and how to improve them. In this world, because we would be building systems with an understanding of why they work, we might be able to get a much better understanding of their failure cases as well and how to avoid them.
In addition to these direct benefits, Chris expects some large but harder-to-see benefits from such a shift as well. Right now, not knowing anything about how your model works internally is completely normal. If even partly understanding one’s model became normal, however, then the amount we don’t know might become glaring and concerning. Chris provides the following analogy to illustrate this: if the only way you’ve seen a bridge be built before is through unprincipled piling of wood, you might not realize what there is to worry about in building bigger bridges. On the other hand, once you’ve seen an example of carefully analyzing the structural properties of bridges, the absence of such an analysis would stand out.
Giving feedback on process
Third, access to good transparency and interpretability tools lets you give feedback to a model—in the form of a loss penalty, reward function, etc.—not just on its output, but also on the process it used to get to that output. Chris and his coauthors lay this argument out in “Building Blocks of Interpretability:”
One very promising approach to training models for these subtle objectives is learning from human feedback. However, even with human feedback, it may still be hard to train models to behave the way we want if the problematic aspect of the model doesn’t surface strongly in the training regime where humans are giving feedback. Human feedback on the model’s decision-making process, facilitated by interpretability interfaces, could be a powerful solution to these problems. It might allow us to train models not just to make the right decisions, but to make them for the right reasons. (There is however a danger here: we are optimizing our model to look the way we want in our interface — if we aren’t careful, this may lead to the model fooling us!)
The basic idea here is that rather than just using interpretability as a mulligan at the end, you could also use it as part of your objective during training, incentivizing the model to be as transparent as possible. Chris notes that this sort of thing is quite similar to the way in which we actually judge human students by asking them to show their work. Of course, this has risks—it could increase the probability that your model only looks transparent but isn’t actually—but it also has the huge benefit of helping your training process steer clear of bad uninterpretable models. In particular, I see this as potentially being a big boon for informed oversight, as it allows you to incorporate into your objective an incentive to be more transparent to an amplified overseer.
One way in particular that the Clarity team’s work could be relevant here is a research direction they’re working on called model diffing. The idea of model diffing is to have a way of systematically comparing different models and determining what’s different from the point of view of high-level concepts and abstractions. In the context of informed oversight—or specifically relaxed adversarial training—you could use model diffing to track exactly how your model is evolving over the course of training in a way which is inspectable to the overseer.[1]
Building microscopes not agents
One point that Chris likes to talk about is that—despite talking a lot about how we want to avoid race-to-the-bottom dynamics—the AI safety community seems to have just accepted that we have to build agents, despite the dangers of agentic AIs.[2] Of course, there’s a reason for this: agents seem to be more competitive. Chris cites Gwern’s “Why Tool AIs Want to Be Agent AIs” here, and notes that he mostly agrees with it—it does seem like agents will be more competitive, at least by default.
But that still doesn’t mean we have to build agents—there’s no universal law compelling us to do so. Rather, agents only seem to be on the default path because a lot of the people who currently think about AGI see them as the shortest path.[3] But potentially, if transparency tools could be made significantly better, or if a major realignment of the ML community could be achieved—which Chris thinks might be possible, as I’ll talk about later—then there might be another path.
Specifically, rather than using machine learning to build agents which directly take actions in the world, we could use ML as a microscope—a way of learning about the world without directly taking actions in it. That is, rather than training an RL agent, you could train a predictive model on a bunch of data and use interpretability tools to inspect it and figure out what it learned, then use those insights to inform—either with a human in the loop or in some automated way—whatever actions you actually want to take in the world.
Chris calls this alternative vision of what an advanced AI system might look like a microscope AI since the AI is being used sort of like a microscope to learn about and build models of the world. In contrast with something like a tool or oracle AI that is designed to output useful information, the utility of a microscope AI wouldn’t come from its output but rather our ability to look inside of it and access all of the implicit knowledge it learned. Chris likes to explain this distinction by contrasting Google Translate—the oracle/tool AI in this analogy—to an interface that could give you access to all the linguistic knowledge implicitly present in Google Translate—the microscope AI.
Chris talks about this vision in his post “Visualizing Representations: Deep Learning and Human Beings:”
The visualizations are a bit like looking through a telescope. Just like a telescope transforms the sky into something we can see, the neural network transforms the data into a more accessible form. One learns about the telescope by observing how it magnifies the night sky, but the really remarkable thing is what one learns about the stars. Similarly, visualizing representations teaches us about neural networks, but it teaches us just as much, perhaps more, about the data itself.
(If the telescope is doing a good job, it fades from the consciousness of the person looking through it. But if there’s a scratch on one of the telescope’s lenses, the scratch is highly visible. If one has an example of a better telescope, the flaws in the worse one will suddenly stand out. Similarly, most of what we learn about neural networks from representations is in unexpected behavior, or by comparing representations.)
Understanding data and understanding models that work on that data are intimately linked. In fact, I think that understanding your model has to imply understanding the data it works on.
While the idea that we should try to visualize neural networks has existed in our community for a while, this converse idea—that we can use neural networks for visualization—seems equally important [and] is almost entirely unexplored.
Shan Carter and Michael Nielsen have also discussed similar ideas in their Artificial Intelligence Augmentation article in Distill.
Of course, the obvious question with all of this is whether it could ever be anything but hopelessly uncompetitive. It is important to note that Chris generally agrees that microscopes are unlikely to be competitive—which is why he’s mostly betting on the other routes to impact above. He just hasn’t entirely given up hope that a realignment of the ML community away from agents towards things like deliberate design and microscopes might still be possible.
Furthermore, even in a world where the ML community still looks very similar to how it does today, if we have really good interpretability tools and the largest AI coalition has a strong lead over the next largest, then it might be possible to stick with microscopes for quite some time. Perhaps enough to either figure out how to align agents or otherwise get some sort of decisive strategic advantage.
What if interpretability breaks down as AI gets more powerful?
Chris notes that one of the biggest differences between him and many of the other people in the AI safety community is his belief that very strong interpretability is at all possible. The model that Chris has here is something like a reverse compilation process that turns a neural network into human-understable code. Chris notes that the resulting code might be truly gigantic—e.g. the entire Linux kernel—but that it would be faithful to the model and understandable by humans. Chris’s basic intuition here is that neural networks really do seem to learn meaningful features and that if you’re willing to put a lot of energy in to understand them all—e.g. just actually inspect every single neuron—then you can make it happen. Chris notes that this is in contrast to a lot of other neural network interpretability work which is more aimed at approximating what neural networks do in particular cases.
Of course, this is still heavily dependent on exactly what the scaling laws are like for how hard interpretability will be as our models get stronger and more sophisticated. Chris likes to use the following graph to describe how he sees transparency and interpretability tools scaling up:
This graph has a couple of different components to it. First, simple models tend to be pretty interpretable—think for example linear regression, which gives you super easy-to-understand coefficients. Second, as you scale up past simple stuff like linear regression, things get a lot messier. But Chris has a theory here: the reason these models aren’t very interpretable is because they don’t have the capacity to express the full concepts that they need, so they rely on confused concepts that don’t quite track the real thing. In particular, Chris notes that he has found that better, more advanced, more powerful models tend to have crisper, clearer, more interpretable concepts—e.g. InceptionV1 is more interpretable than AlexNet. Chris believes that this sort of scaling up of interpretability will continue for a while until you get to around human-level performance, at which point Chris hypothesizes that the trend will stop as models start moving away from crisp human-level concepts to still crisp but now quite alien concepts.
If you buy this graph—or something like it—then interpretability should be pretty useful all the way up to and including AGI—though perhaps not for very far past AGI. But if you buy a continuous-takeoff worldview, then that’s still pretty useful. Furthermore, in my opinion, I think that the dropping off of interpretability at the end of this graph is just an artifact of using a human overseer. If you instead substituted in an amplified overseer, then I think it’s plausible that interpretability could just keep going up, or at least level off at some high level.
Improving the field of machine learning
One thing that Chris thinks could really make a big difference in achieving a lot of the above goals would be some sort of realignment of the machine learning community. Currently, the thing that the ML community primarily cares about is chasing state-of-the-art results on its various benchmarks without regard for understanding what the ML tools they’re using are actually doing. But that’s not what the machine learning discipline has to look like, and in fact, it’s not what most scientific disciplines do look like.
Here’s Chris’s vision for what an alternative field of machine learning might look like. Currently, machine learning researchers primarily make progress on benchmarks via trial and error. Instead, Chris wants to see a field which focuses on deliberate design where understanding models is prioritized and the way that people make progress is through deeply understanding their systems. In this world, ML researchers primarily make better models by using interpretability tools to understand why their models are doing what they’re doing instead of just throwing lots of things at the wall and seeing what sticks. Furthermore, a large portion of the field in this world is just devoted to gathering information on what models do—cataloging all the different types of circuits that appear across different neural networks, for example[4]—rather than on trying to build new models.[5]
If you want to change the field in this way, there are essentially two basic paths to making something like that happen—you can either:
get current ML researchers to switch over to interpretability/deliberate design/microscope use or
produce new ML researchers working on those things.
Chris has thoughts on how to do both of these, but I’ll start with the first one. Chris thinks that several factors could make a high-quality interpretability field appealing for researchers. First, interpretability could be a way for researchers without access to large amounts of compute to stay relevant in a world where relatively few labs can train the largest machine learning models. Second, Chris thinks there’s lots of low hanging fruit in interpretability such that it should be fairly easy to have impressive research results in the space over the next few years. Third, Chris’s vision of interpretability is very aligned with traditional scientific virtues—which can be quite motivating for many people—even if it isn’t very aligned with the present paradigm of machine learning.
However, If you want researchers to switch to a new research agenda and/or style of research, it needs to be possible for them to support careers based on it. Unfortunately, the unit of academic credit in machine learning tends to be traditional papers, published in conferences, evaluated on whether they set a new state-of-the-art on a benchmark (or more rarely by proving theoretical results). This is what decides who gets hired, promoted, and tenured in machine learning.
To address this, Chris founded Distill, an academic machine learning journal that aims to promote a different style of machine learning research. Distill aims to be a sort of “adapter” between the traditional method of evaluating research and the new style of research—based around things like deliberate design and microscope use—that Chris wants to see the field move to. Specifically, Distill does this by being different in a few key ways:
Distill explicitly publishes papers visualizing machine learning systems, or even just explanations improving Clarity of thought in machine learning (Distill’s expository articles have become widely used references).
Distill has all of the necessary trappings to make it recognized as a legitimate academic journal such that Distill publications will be taken seriously and cited.
Distill has support for all the sorts of nice interactive diagrams that are often necessary for presenting interpretability research.
The second option is to produce new ML researchers pursuing deliberate design rather than converting old ones. Here, Chris has a pretty interesting take on how this can be done: convert neuroscientists and systems biologists.
Here’s Chris’s pitch. There are whole fields of neuroscience dedicated to understanding all the different connections, circuits, pathways, etc. in all different manner of animal brains. Similarly, for the systems biologists, there are significant communities of researchers studying individual proteins, their interactions and pathways, etc. While neural networks are different from these lines of research at a detailed level, a lot of high level research expertise—e.g. epistemic standards for studying circuits, recurring motifs, research intuition—may be just as helpful for this type of research as machine learning expertise. Chris thinks neuroscientists or systems biologists willing to make this transition would be able to get funding to do their research, a much easier time running experiments, and lots of low-hanging fruit in terms of new publishable results that nobody has found yet.
Doesn’t this speed up capabilities?
Yes, it probably does—and Chris agrees that there’s a negative component to that—but he’s willing to bet that the positives outweigh the negatives.
Specifically, Chris thinks the main question is whether principled and deliberate model design based on interpretability can beat automated model design approaches like neural architecture search. If it can, we get capabilities acceleration, but also a paradigm shift towards deliberate model design, which Chris expects to significantly aid alignment. If we don’t, interpretability loses one of its upsides (other advantages like auditing still exist in this world) but also doesn’t have the downside of acceleration. Both the upside and downside go hand in hand, and Chris expects the upside to outweigh the downside.
Update: If you’re interested in understanding Chris’s current transparency and interpretability work, a good starting point is the Circuits Thread on Distill.
- ↩︎
In particular, this could be a way of getting traction on addressing gradient hacking.
- ↩︎
As an example of the potential dangers of agents, more agentic AI setups seem much more prone to mesa-optimization.
- ↩︎
A notable exception to this, however, is Eric Drexler’s “Reframing Superintelligence: Comprehensive AI Services as General Intelligence.”
- ↩︎
An example of the sort of common circuit that appears in lots of different models that the Clarity team has found is the way in which convolutional neural networks stay reflection-invariant: to detect a dog, they separately detect leftwards-facing and rightwards-facing dogs and then union them together.
- ↩︎
This results in a large portion of the field being focused on what is effectively microscope use, which could also be quite relevant for making microscope AIs more tenable.
- Against Almost Every Theory of Impact of Interpretability by 17 Aug 2023 18:44 UTC; 328 points) (
- An overview of 11 proposals for building safe advanced AI by 29 May 2020 20:38 UTC; 213 points) (
- Alignment By Default by 12 Aug 2020 18:54 UTC; 174 points) (
- Seeking Power is Often Convergently Instrumental in MDPs by 5 Dec 2019 2:33 UTC; 162 points) (
- Reshaping the AI Industry by 29 May 2022 22:54 UTC; 147 points) (
- Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers by 9 Apr 2021 19:19 UTC; 141 points) (
- My Overview of the AI Alignment Landscape: A Bird’s Eye View by 15 Dec 2021 23:44 UTC; 127 points) (
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- AGI Safety Fundamentals curriculum and application by 20 Oct 2021 21:45 UTC; 123 points) (EA Forum;
- The academic contribution to AI safety seems large by 30 Jul 2020 10:30 UTC; 117 points) (EA Forum;
- Want to predict/explain/control the output of GPT-4? Then learn about the world, not about transformers. by 16 Mar 2023 3:08 UTC; 107 points) (
- 2019 Review: Voting Results! by 1 Feb 2021 3:10 UTC; 99 points) (
- Comments on OpenPhil’s Interpretability RFP by 5 Nov 2021 22:36 UTC; 91 points) (
- Automating Auditing: An ambitious concrete technical research proposal by 11 Aug 2021 20:32 UTC; 88 points) (
- Zoom In: An Introduction to Circuits by 10 Mar 2020 19:36 UTC; 85 points) (
- Decision Transformer Interpretability by 6 Feb 2023 7:29 UTC; 84 points) (
- Recent Progress in the Theory of Neural Networks by 4 Dec 2019 23:11 UTC; 83 points) (
- More Recent Progress in the Theory of Neural Networks by 6 Oct 2022 16:57 UTC; 82 points) (
- Who are some prominent reasonable people who are confident that AI won’t kill everyone? by 5 Dec 2022 9:12 UTC; 72 points) (
- AGI Safety Fundamentals curriculum and application by 20 Oct 2021 21:44 UTC; 69 points) (
- Solving the whole AGI control problem, version 0.0001 by 8 Apr 2021 15:14 UTC; 63 points) (
- [AN #80]: Why AI risk might be solved without additional intervention from longtermists by 3 Jan 2020 7:52 UTC; 58 points) (EA Forum;
- Transparency and AGI safety by 11 Jan 2021 18:51 UTC; 54 points) (
- Modeling the impact of safety agendas by 5 Nov 2021 19:46 UTC; 51 points) (
- Why does (any particular) AI safety work reduce s-risks more than it increases them? by 3 Oct 2021 16:55 UTC; 48 points) (EA Forum;
- [An email with a bunch of links I sent an experienced ML researcher interested in learning about Alignment / x-safety.] by 8 Sep 2022 22:28 UTC; 47 points) (
- My Overview of the AI Alignment Landscape: A Bird’s Eye View by 15 Dec 2021 23:46 UTC; 45 points) (EA Forum;
- Resources for AI Alignment Cartography by 4 Apr 2020 14:20 UTC; 45 points) (
- AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger by 18 Feb 2021 0:03 UTC; 43 points) (
- The Defender’s Advantage of Interpretability by 14 Sep 2022 14:05 UTC; 41 points) (
- Thoughts on AGI safety from the top by 2 Feb 2022 20:06 UTC; 36 points) (
- 1hr talk: Intro to AGI safety by 18 Jun 2019 21:41 UTC; 36 points) (
- [AN #80]: Why AI risk might be solved without additional intervention from longtermists by 2 Jan 2020 18:20 UTC; 36 points) (
- Brain-Computer Interfaces and AI Alignment by 28 Aug 2021 19:48 UTC; 35 points) (
- Data collection for AI alignment—Career review by 3 Jun 2022 11:44 UTC; 34 points) (EA Forum;
- Another list of theories of impact for interpretability by 13 Apr 2022 13:29 UTC; 33 points) (
- On unfixably unsafe AGI architectures by 19 Feb 2020 21:16 UTC; 33 points) (
- Is “Recursive Self-Improvement” Relevant in the Deep Learning Paradigm? by 6 Apr 2023 7:13 UTC; 32 points) (
- 19 Nov 2019 3:11 UTC; 31 points) 's comment on I’m Buck Shlegeris, I do research and outreach at MIRI, AMA by (EA Forum;
- AI Alignment 2018-2019 Review by 28 Jan 2020 21:14 UTC; 28 points) (EA Forum;
- Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability by 8 Jun 2021 19:20 UTC; 28 points) (
- [AN #160]: Building AIs that learn and think like people by 13 Aug 2021 17:10 UTC; 28 points) (
- Introduction to inaccessible information by 9 Dec 2021 1:28 UTC; 27 points) (
- Thoughts on implementing corrigible robust alignment by 26 Nov 2019 14:06 UTC; 26 points) (
- [AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety by 6 Nov 2019 18:10 UTC; 26 points) (
- 16 Mar 2022 18:43 UTC; 25 points) 's comment on Book Launch: The Engines of Cognition by (
- Notes on Learning the Prior by 15 Jul 2022 17:28 UTC; 25 points) (
- Outer alignment and imitative amplification by 10 Jan 2020 0:26 UTC; 24 points) (
- 17 Dec 2020 18:43 UTC; 24 points) 's comment on Why Subagents? by (
- [AN #148]: Analyzing generalization across more axes than just accuracy or loss by 28 Apr 2021 18:30 UTC; 24 points) (
- [AN #111]: The Circuits hypotheses for deep learning by 5 Aug 2020 17:40 UTC; 23 points) (
- [AN #84] Reviewing AI alignment work in 2018-19 by 29 Jan 2020 18:30 UTC; 23 points) (
- Discovering alignment windfalls reduces AI risk by 28 Feb 2024 21:14 UTC; 22 points) (EA Forum;
- Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI by 16 Dec 2021 22:41 UTC; 22 points) (
- What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas. by 16 Aug 2022 2:09 UTC; 21 points) (
- [AN #74]: Separating beneficial AI into competence, alignment, and coping with impacts by 20 Nov 2019 18:20 UTC; 19 points) (
- Thoughts on safety in predictive learning by 30 Jun 2021 19:17 UTC; 19 points) (
- Next steps after AGISF at UMich by 25 Jan 2023 20:57 UTC; 18 points) (EA Forum;
- How Interpretability can be Impactful by 18 Jul 2022 0:06 UTC; 18 points) (
- Initial Experiments Using SAEs to Help Detect AI Generated Text by 22 Jul 2024 5:16 UTC; 17 points) (
- 31 Oct 2022 23:35 UTC; 16 points) 's comment on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability by (
- Pop Culture Alignment Research and Taxes by 16 Apr 2022 15:45 UTC; 16 points) (
- 15 Jul 2020 9:45 UTC; 16 points) 's comment on Classification of AI alignment research: deconfusion, “good enough” non-superintelligent AI alignment, superintelligent AI alignment by (
- 28 Jul 2020 23:38 UTC; 15 points) 's comment on Developmental Stages of GPTs by (
- Alignment by Default by 5 Dec 2021 2:19 UTC; 15 points) (
- Discovering alignment windfalls reduces AI risk by 28 Feb 2024 21:23 UTC; 15 points) (
- [AN #133]: Building machines that can cooperate (with humans, institutions, or other machines) by 13 Jan 2021 18:10 UTC; 14 points) (
- Article Review: Discovering Latent Knowledge (Burns, Ye, et al) by 22 Dec 2022 18:16 UTC; 13 points) (
- What are the high-level approaches to AI alignment? by 16 Jun 2020 17:10 UTC; 12 points) (
- Next steps after AGISF at UMich by 25 Jan 2023 20:57 UTC; 10 points) (
- 20 Aug 2020 22:39 UTC; 9 points) 's comment on Matt Botvinick on the spontaneous emergence of learning algorithms by (
- On Context And People by 19 Mar 2022 23:38 UTC; 7 points) (
- 13 Jan 2021 18:26 UTC; 6 points) 's comment on Transparency and AGI safety by (
- 12 Jan 2021 19:25 UTC; 4 points) 's comment on Thread for making 2019 Review accountability commitments by (
- 2 Aug 2021 9:48 UTC; 4 points) 's comment on Thoughts on safety in predictive learning by (
- 30 Mar 2021 8:28 UTC; 4 points) 's comment on Eli’s shortform feed by (
- 25 Feb 2022 2:56 UTC; 3 points) 's comment on My Overview of the AI Alignment Landscape: A Bird’s Eye View by (
- 4 Sep 2020 2:11 UTC; 3 points) 's comment on Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda by (
- 4 Sep 2020 18:58 UTC; 3 points) 's comment on Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda by (
- 9 Sep 2021 15:50 UTC; 2 points) 's comment on Research agenda update by (
- Can we expect more value from AI alignment than from an ASI with the goal of running alternate trajectories of our universe? by 9 Aug 2020 17:17 UTC; 2 points) (
- Can we expect more value from AI alignment than from an ASI with the goal of running alternate trajectories of our universe? by 9 Aug 2020 17:17 UTC; 2 points) (
- 30 Dec 2020 3:43 UTC; 2 points) 's comment on Review Voting Thread by (
- 26 May 2023 8:40 UTC; 1 point) 's comment on ‘Fundamental’ vs ‘applied’ mechanistic interpretability research by (
I found this post extremely valuable. This is a novel approach to safety research, which I hadn’t come across before, and which I likely would not otherwise have come across without Evan putting in the effort to write this post (and Chris putting in the work to come up with the ideas!).
I personally find interpretability to be a fascinating problem, that I might want to research someday. This post updated me a lot towards thinking that it’s a valuable and important problem for achieving alignment.
Further, I am very excited to see more posts like this in general—I think it’s extremely good and healthy to bring in more perspectives on the alignment problem, and different paths to success.
I think that this post is a good description of a way of thinking about the usefulness of transparency and interpretability for AI alignment that I think is underrated by the LW-y AI safety community.