Recently I had a conversation with Eli Lifland about the AI Alignment landscape. Eli Lifland has been a forecaster at Samotsvety and has been investigating said landscape.
I’ve known Eli for the last 8 months or so, and have appreciated many of his takes on AI alignment strategy.
This was my first recorded video, so there were a few issues, but I think most of it is understandable.
Full (edited) transcript below. I suggest browsing the section titles for a better overview of our discussion.
Transcript
Sections
Samotsvety, a Recent Forecasting Organization
Reading, “Is Power-Seeking AI an Existential Risk?”
Categories of AI Failures: Accident, Misuse, and Structural
Who Is Making Strategic Progress on Alignment?
Community Building: Arguments For
Community Building: Fellowships and Mentorship
Cruxes in the AI Alignment Space
Crux: How Promising is AI Interpretability?
Crux: Should We Use Narrow AIs to Help Solve Alignment?
The Need for AI Alignment Benchmarks
Crux: Conceptual Insights vs. Empirical Iteration
Vehicles and Planes as Potential Metaphors
Samotsvety, a Recent Forecasting Organization
Ozzie Gooen: So to get started, I want to talk a little bit about Samotsvety.
Eli Lifland: It’s a Russian name. Samotsvety currently has about 15 forecasters. We’ve been releasing forecasts for the community on topics such as nuclear risk and AI. We’re considering how to create forecasts for different clients and make public forecasts on existential risk, particularly AI.
Team forecasting has been valuable, and I’ve encouraged more people to do it. We have a weekly call where we choose questions to discuss in advance. If people have time, they make their forecasts beforehand, and then we discuss the differences and debate. It’s beneficial for team bonding, forming friendships, and potential future work collaborations.
It’s also interesting to see which forecasts are correct when they resolve. It’s a good activity for different groups, such as AI community groups, to try.
Ozzie Gooen: How many people are in the group right now?
Eli Lifland: Right now, it’s about 15, but on any given week, probably closer to five to ten can come. Initially, it was just us three. It was just Nuño, Misha, and I, and we would meet each weekend and discuss different questions on either Foretell (now INFER) or Good Judgment Open, but now it’s five to ten people per week, from a total pool of 15 people.
Ozzie Gooen: That makes sense. I know Samotsvety has worked on nuclear risk and a few other posts. What do you forecast when you’re not working on those megaprojects?
Eli Lifland: Yeah. We do a mix of things. Some things we’ve done for specific clients haven’t been released publicly. Some things are still in progress and haven’t been released yet. For example, we’ve been working on forecasting the level of AI existential risk for the Future Fund, now called the Open Philanthropy Worldview Prize, for the past 1-2 months. We meet each week to revise and discuss different ways to decompose the risk, but we haven’t finished yet. Hopefully, we will.
Sometimes we just choose a few interesting questions for discussion, even if we don’t publish a write-up on them.
Ozzie Gooen: So the idea is to have more people do very similar things, just like other teams are three to five, they’re pretty independent; do you give them like coaching or anything? If I wanted to start my own group like this, what do I do?
Eli Lifland: Feel free to reach out to any of us for advice on how we did it. As I mentioned, it was fairly simple—choosing and discussing questions each week.
In terms of value, I believe it was valuable for all of us and many others who joined us. Some got more interested in effective altruism and AI x-risk through forecasting with us. For me, it was valuable for finding good collaborators, having intellectually stimulating activities outside of work, having good discussions, and learning about different topics through researching them for forecasts.
One general thought I have about the relationship between artificial intelligence and forecasting, or AI forecasting, is that “forecasting” can mean a lot of different things. I think now, especially with the popularity of super forecasting, at least in the EA community, when people hear about forecasting, they think of something like Metaculus or Good Judgment Open, where you take a fairly well-defined question and then you ask people to do some amount of research into it, but the amount of research for a person usually isn’t that high, maybe up to a few hours per question. Then, in the end, you aggregate everyone’s opinions.
However, I believe that another type of forecasting is that done by AI Impacts or other types of research organizations, where it’s more full time and you might have one researcher investigate a question in depth for several months and produce a report. In terms of AI forecasting, I guess one thing I’d say is that I think it can sometimes be challenging to find things where you can provide value with the fairly quick shallow versions. In particular, I think a lot of the value often doesn’t come from the literal number you get at the end. I think it comes from the research, mainly in areas that are at an early stage, for example, AI existential risk.
The value often comes from the investigation and creating different frameworks for thinking about it, which requires deeper thought and research rather than a few hours of work by talented people. It’s tricky to find questions where the few-hour version can provide a lot of value, especially for those without AI expertise.
Reading, “Is Power-Seeking AI an Existential Risk?”
Ozzie Gooen: So I know recently you’ve been getting more into AI safety. How have you upskilled, and how has that gone?
Eli Lifland: I was doing AI safety-related work and thinking about it when I was doing engineering at Ought, my first job.
I was thinking about AI on the side while founding Sage, a previous forecasting epistemic startup where I worked full-time. When I quit, I wrote a Google Doc called “Experiment: Becoming an AI Thinker” because I wanted to test my fit for it and try it out. I was convinced that AI was the most important cause area.
So I think I was going to sort of test my fit just by reading stuff and then writing out my thoughts on it.
And then initially, I was going to do the MIRI conversations, which are kind of a bunch of debates between mostly Eliezer or other MIRI folks and other people in the field with whom they disagree.
And then Misha suggested going through Joe Carlsmith’s report instead, which is one of the kinds of things I mentioned around an in-depth forecast on the level of existential risk from AI. And what Joe does in his report is break down a possible existential catastrophe from AI into several steps that need to happen for it to occur.
First of all, there needs to be enough powerful AI by a certain time in 2070. The first forecast is for 2070.
Then there need to be incentives to build it, then this AI needs to, and basically, there are these steps in terms of needs I’m trying to remember, it needs to sort of not be aligned. It needs to be difficult enough to align. It needs to cause at least some high-impact failure, and then it needs to lead to a sort of full disempowerment of humanity. And so he breaks it down into steps, and then he analyzes each step and the different arguments to see why this might or might not happen. And then he assigns probabilities at the end and multiplies them together to get the chance of an existential catastrophe.
The thing I ended up doing to upskill was that Misha and I would both read sections of this report, and then we would basically write up our thoughts on them and then discuss them. The report is 60 or 70 pages long… but still, I think we were going pretty in-depth; we were just discussing for like an hour and a half, three or four times per week, or something like that, and I think that went for maybe six weeks on the initial report or something like that. Then, for the reviews, we discussed them after that. I’m not sure of the exact amount of time, but basically, it took at least two months, I think, to get through the report, and then there were also reviews of the report by people with different views.
And I think this was just very helpful for discussing with someone else who’s reasonable, who already knows a decent amount of AI and is also trying to learn more, trying to sort of understand, flush out our views and disagreements, and especially reading the report juxtaposed with different reviews from different people and trying to think, “Okay, when there’s a disagreement, who do we agree with?” Does this evidence seem to justify this claim? and tried to think through this and discuss it, which I think was very helpful.
And then, as I previously stated, I ended up writing a review of this report as well, and I was also using the context from this report to write some other blog posts as well.
Ozzie Gooen: Why do you think that this report is better than other reports as a learning guide?
Eli Lifland: I think, especially if you’re sort of in this sort of big picture like AI forecasting, it brings up a lot of different considerations. I think it’s more comprehensive than all the previous things. And it particularly tries to make a forecast on, say, the level of existential risk, taking into account all these different considerations, and I don’t think many things were attempting to do this previously. I think people would kind of just make up numbers intuitively, which is also useful, but I think it’s more of a framework for building off of, even if you disagree with a lot of it. But a lot of people have various disagreements, including me. I have some disagreements, maybe with the overall framing of the report, but I think it’s still quite useful to go through. There’s kind of some structure built in for not only learning about one person’s view but also learning about other reasonable people, like what their view is and where they disagree. I think it’s quite useful for getting up to speed on the landscape of arguments within the field.
Ozzie Gooen: Are there many good reviews at this point?
Eli Lifland: Yeah. So when Joe released the report, he released it along with nine reviews. And I think there are some varying levels of quality, but there are three to five other reviews that I found pretty helpful. I think the three reviews in particular that I found the most helpful were the ones from Ben Garfinkel, Nate Soares, and Daniel Kokotajlo. So I think, for context, Joe forecasted a 5% chance of existential risk from AI by 2070, at least through this specific pathway, so through power-seeking AI and intent misalignment. So that is kind of unintentional, not AI misuse leaks from a bad actor using AI on purpose.
And he initially forecasted 5%, although he’s now revised it. He says he’s revised it up to greater than 10%. He hasn’t released any sort of more specific number beyond that. I think, in my review, I was at 30%, specifically for intent misalignment and existential risk by 2070. And, as I said, something like 35 to 40% for any AI-mediated existential catastrophe. I think I might revise that slightly now, but I haven’t. I don’t think too much about the specific numbers, but maybe like 35% or 40% now, just for the intent misalignment branch. And then, and this was also so, it was higher than most reviewers as well, though it’s lower than both Daniel Kokotajlo and Nate Soares. I think Nate Soares was greater than 70-something percent and Daniel Kokotajlo was around 70 to 80%. Yeah.
Ozzie Gooen: I mean, there’s one question here, which is that in many ways, 10% versus 90% doesn’t change a lot of decisions.
Eli Lifland: Yeah.
Ozzie Gooen: But either way, it’s a major issue.
Eli Lifland: The number that you end up with is often not of much value, and this is also why I think sometimes aggregating a lot of shallow views is not as important in some of these cases. Again, a lot of the value in Joe’s report comes from people being able to build on it, which, I think, then leads to a sort of more conceptual clarity on the problem. It leads to being able to better prioritize interventions and have a better sort of epistemic state of the field in terms of better arguments. There’s, of course, also the downstream effect as the arguments in the field gets better and people have a better understanding of the risk and what we can do to mitigate it. I think this also helps us feel more built up.
For example, I think Joe’s report probably helped a little bit in terms of pushing me to work on AI stuff more quickly because I think when you have a sense that the arguments aren’t very flushed out, it’s because people are going based on vague feelings, which is still a lot of the case, but it’s a little less than it used to be. And hopefully, it’ll get better over time, but I think the more fleshed out the agents are, the easier it is to get on board and say, “Okay, I think I want to sort of dedicate time to this. I’m convinced this is a really big deal.”
Categories of AI Failures: Accident, Misuse, and Structural
Ozzie Gooen: That makes sense. With what you’ve learned, I guess you could move on to the topic of the various types of alignment failures. Do you want to just have a quick start with a quick summary of what they are?
Eli Lifland: I think the two broad categories that I think are the most important are misalignment (which sometimes people call accident), or misuse. For accident, the idea is that someone is attempting to use an AI system to do something good, ostensibly good, like make their company profit. So they’re not trying to cause something really bad to happen. They’re just trying to cause themselves to make money or something like that. And I guess it can get a bit more complicated as well because there might be a higher chance that an existential catastrophe involves more than one AI system rather than just one.
But the bulk of AI systems that are causing the catastrophe is being deployed with good intent, for something that is not trying to cause a catastrophe.
Ozzie Gooen: Yeah.
Eli Lifland: And this is an accident, misalignment failure, alignment failure or misalignment catastrophe.
And then there’s misuse, which would be more like there’s a really powerful AI system; maybe someone uses it to create a bioweapon or something like that, or someone uses it for some other bad purpose, like a terrorist group, a government, or whatever it may be, uses it for some nefarious purpose. And so the AI system was doing what the actor wanted, but the actor was trying to cause a catastrophe. So this would be a misuse. And then I think sometimes people talk maybe partially, I think, because of sort of the ambiguity there can sometimes be ambiguity here.
Sometimes people talk about “structural catastrophes,” which I think are slow-moving catastrophes that can’t be traced to one point or one specific actor or AI. It involves many actors, and maybe more as a result of incentives. It is confusing to differentiate between catastrophes, but people often refer to “structural catastrophes.”
Ozzie Gooen: Would structural in this case, be part of accident?
Eli Lifland: An accident is when one person can be blamed who made a mistake, but structural catastrophes can’t be traced to one person or specific action. It’s a bit fuzzy, but that’s my understanding of the sort of grouping.
Ozzie Gooen: I guess Paul Christiano had what failure looks like and that’s like a structural.
Eli Lifland: It’s unclear to me [laughs], but it’s probably either structural or in alignment or misalignment. Yeah.
Ozzie Gooen: This is an area where I’d like to see more clear delineations, and it seems like there are different definitions around pretty basic stuff.
Who is Making Strategic Progress on Alignment?
Eli Lifland: Yeah. One thing I’m excited about is that more people are doing these takeover stories, or, I mean, more concrete-like scenario planning or imagining what the world will look like in general in terms of what AI will do, but especially our stories are threat modeling.
As you mentioned, Paul Christiano has a story of what failure looks like. There are some other stories. There is a classic community story about AI becoming powerful quickly, followed by crazy nanotech and killing everyone. Gwern has a story about this bot called Clippy, It Looks Like You’re Trying To Take Over The World.
I’m excited to see more people doing takeover stories and threat modeling to understand what types of risks could happen, both on the technical side (e.g., what types of AI systems pose a threat) and on the side of identifying which actors are the biggest threat to e.g. coordinate or govern.
I think probably the most important thing here is just trying to sort out what types of stuff could happen, and then, at some point, you get to the point where you can assign probabilities to them usefully. I’m not sure if we’re at that point yet.
I also would say that, for example, it’s hard for people to generalize, even if they have read a bunch about AI, if they’re still not super deep in the AI X-risk literature, to sort of make good forecasts about this right now. And maybe it would be easier if there were more conceptually clear distinctions and more different stories for each type of catastrophe and things like this.
Ozzie Gooen: My impression is that in AI right now, there are a lot of people doing alignment work and some people doing policy work, but not too many people will try to do this kind of macro carving-out work.
Eli Lifland: I think probably the existing team that does the closest to this type of work is Open Phil, which has a worldview investigations team, and this is a team that Joe Carlsmith, who wrote this report, is on. This is the team that Ajeya Cotra wrote this timeline report about bio anchors, and there are a few more people on the team. And I think there aren’t many other teams doing this. I think Rethink Priorities Michael Aird’s team does a little of this, but mostly not stuff like this. And then maybe a few people at Gov AI, do something some investigations like this. I guess also some people at AI Labs, but they often can’t release a lot of their work, but the people doing strategy governance strategy at AI Labs [might be doing some of this work].
Anyway, I think there are kind of two issues here as to why people don’t do it. One, and probably the most important, is that there is no established path to take. If you go on the 80,000 hours website, you can be like an AI technical researcher or you can be like an AI policy researcher, right? And then there isn’t a path for “Oh, you’re going to be the type of person who just kind of tries to deconfuse and then figures out what the hell’s going on, like tries to tell stories about what might or might not happen or figure out a big picture AI strategy.”
Holden wrote a post titled “Important actionable research questions for the most important century” to encourage people to work on these questions, but I think it mostly failed. Most people only work on something if there is a clear career path, like a fellowship or job. Open Phil’s team is small and doesn’t want to expand quickly.
I think there’s not much of an established path. It’s also ambitious and will require a lot of time and effort with little immediate progress. It also requires specific skills, like good judgment, writing, and strategic thinking. Additionally, it requires a broad understanding of both the technical and governance aspects of the field. Some people may feel intimidated by the lack of a clear path and the scope of the project, especially when trying to understand and work on a big-picture AI strategy.
Community Building: Arguments For
Ozzie Gooen: Now we’re going to go more into community building.
We could use more people doing strategic work at a high level instead of, say, forecasting work on narrow AI questions.
Then the next question is, “How do we, A, get those people, and, B, what’s valuable in community building more generally?”
Eli Lifland: The upcoming year is an important time for building the AI alignment community for several reasons. One of the main reasons is that more people are realizing the potential impact of AI on the world, especially with the advancement of ChatGPT and language models. Many people are open to the idea of the alignment community and believe it could be significant.
There is a challenge in converting latent, talented people into active members of the community and providing them with the necessary support and mentorship to do useful work
One thought I have is that now is a good time for community building. Some people may have a pessimistic attitude towards this, believing that the current members are better suited for the work due to their awareness of the risks and proactive nature. However, I have changed my mind on this and believe that building a strong community is important.
I used to think this effect was a bigger deal, but now I think it’s less significant. I think people often do things for social reasons or status, even if they think they have other motivations. This is still true in communities like effective altruism and rationalism. I’m less concerned about getting people who weren’t previously working on this for social reasons because I think the people who are working on it now are motivated by the social aspect within their group.
The challenge is to provide talented people with structure, mentorship, and positive experiences within the community. It is important that they do not perceive the community as lacking technical expertise or being socially awkward. Few people are skilled at discussing AI safety with a wide range of people, especially technical individuals, who can have productive conversations with those outside their in-group.
One strategy that startups use to overcome the difficulty of scaling organizations is to hire friends and previous collaborators who are trustworthy and capable. This is because there is strong evidence that they will be able to contribute and because there is a higher level of trust between them. Similarly, in the AI risk community, we can leverage the talents of people who are already interested in the topic and have talented friends. People who are skilled at community building can also talk to their friends and invite them to social events or gatherings.
I’ll sometimes ask my friends if they want to go to Berkeley or a co-working space, where they can interact with other people and be kind of in the alignment field. And, of course, there are some disadvantages to this. You don’t want to rely too much on your inner view of a person; you’re trusting them too much just because you like to know them, which can have some downsides. But I generally think that on balance, the strategy of convincing your friends/collaborators to join a startup is a good one, and similarly, for AI risk community building, I think this can be a good strategy.
Community Building: Fellowships and Mentorship
Ozzie Gooen: If you had $100 million to do AI community building, are there any areas where you’d want to spend it?
Eli Lifland: It’s tricky to turn money into good results. I think you both need the money, and you also need good people to execute. There’s this program called SERI MATS that has applications for their fellowship. They only have applications for technical people. And then they have mentors who are doing alignment research in the field to mentor the people and give them some structure. And they also have, sort of, an upskilling program as well. And I think SERI MATS is great; it’s great that it exists, but I think there could be more things of this flavor, and there could be more variations on it.
One variation on SERI MATS could be a program that is more selective and intensive in training a smaller number of people. Another variation could focus more on macrostrategy in addition to technical research, while SERI MATS primarily focuses on technical research. And I think various variations of this could be cool.
I’m also looking forward to some kind of competition. There’s been a lot of talk about monetary prizes, like competitions with very large prizes. The thing I care about the most with competitions is that there could be competitions that are a bit better at nerd-sniping people. I think you want to focus more on the nerd-sniping aspect. I’m more excited personally about focusing on the nerd-sniping aspect than sort of relevance to alignment where relevance to alignment is nice, but really, I think you also want to make it so that people will kind of enjoy doing it. It’s something that you can have a leaderboard for or something where Jane Street is good at recruiting people who are very smart and nerdy with kind of fun conceptual puzzles like logic puzzles, and you could imagine competitions with puzzles like these, and then you invite some of the top performing people to some alignment workshops or something like that. I guess, those are two of the things I’m most excited about here – things kind of like SERI MATS, but maybe different variations with a different run by different groups of people, maybe with a bit different focuses.
I think competitors to programs like this are very important. The theory of change here is to identify talented people, help empower them, and give them mentorship.
Another thing I’m excited about is creating puzzles or competitions that are fun and interesting for people, and rewarding the top performers with a workshop where they can interact with cool people, mentors, and people in the alignment space. It could also be combined with encouraging them to apply for fellowships.
Ozzie Gooen: I’m personally nervous about bringing in a bunch of junior people into this space. The space is kind of a mess. After the FTX crisis, I’m particularly risk-averse and nervous about taking risks.
I think the new SERI maths program may already have around 70 people, which feels like a lot. In your ideal world, what would you want to achieve?
Eli Lifland: Bringing in too many junior people may not contribute to clarifying the field, and it can be difficult to manage or remove people from the alignment community without negatively impacting morale. It may be better to encourage people to upskill elsewhere before joining the community.
Junior people may not be as helpful in the alignment community, but some people can contribute quickly with the right mentorship and encouragement to be ambitious. There are already signs that some people are interested in this field due to advancements like ChatGPT and the belief that this kind of technology could become a reality soon.
And so, if we can do a good job of selecting people to help come in and bring people in, I am generally excited about basically bringing people in and maybe even looking at being able to contribute sooner than you would think by helping deconfuse the field and not just contributing to existing paradigms.
Back to the question, “what would my ideal world look like?” I think things like SERI MATS and various other things are going well, you know? There are these boot camps, etcetera. And I think, maybe currently, if we add all of them up, including SERI MATS, and I get, let’s say they’re 200 people, if we add up all the other boot camps, I think in my ideal world, they would have maybe 500 people.
I would increase the number of people in programs like SERI MATS and boot camps, but not by too much. It’s important to have mechanisms for evaluating whether these people will be able to contribute quickly, think about new paradigms rather than just reinforce existing ones, and help clarify the field.
I would like to scale up the number of people in these programs, but not too much. It’s important to have a graceful way for people who aren’t ready to move on and for those who need more time to develop to do so. It would be helpful to have good mechanisms for identifying the most promising people and providing support for those who need more time to develop.
Ozzie Gooen: That makes sense. Are there any things that you think the community could do to make sure that when we onboard these people, it won’t be a total mess?
Eli Lifland: It’s tricky to create a welcoming community with good norms for interacting with people. It’s important to make people feel welcome in the community, and even if they leave, to make them feel welcome to return. It’s also important to avoid animosity or hatred between different research groups.
I think sometimes in the past there has been a sort of disagreement (maybe sort of what I would perceive as sometimes tribal disagreements).
If there are a lot of people coming in, it seems very important for people who might be good at mentorship or management to test that out, especially people who have some technical skill or some strategic skill. So I think testing that out and being like “you can test this out cheaply by maybe posting on posting online or asking one of your friends or whatever” and you just try out mentoring one person, see how that goes. Maybe try mentoring more people.
Cruxes in the AI Alignment Space
Ozzie Gooen: What do you think are the most important disagreements or cruxes in the alignment space?
Eli Lifland: One of the biggest cruxes in the alignment space is the extent to which we can use AIs to help align stronger AIs. A large portion of the field believes in using somewhat aligned AIs to align stronger AIs repeatedly until a very strong AI can be aligned. This approach may seem hacky or obvious, but it is an important disagreement in the field.
And I think recently people have converged to some extent on calling this strategy “AI-assisted alignment,” but it’s like a family of lots of strategies, right?
You know, there are lots of ways in which you can use AI to supervise other AIs. You know you can do it more directly; you can maybe use AI to come up with ideas for aligning other AIs. There are lots of variants of this. But in particular, it’s interesting because the range of use here is so wide that, for example, someone like Richard Ngo will say, “This is just doing alignment research. Of course, you’re going to use the AIs to align the other AIs. If you can’t do that, you’re pretty much doomed.”
And then, another cluster, most typified by MIRI, will be “Actually, this is like the most dangerous thing you could do with an AI, is try to align another AI. It’s just never going to work basically”. You need to understand the principles of AI first and the AIs are not going to help you do that. It is interesting to consider that if one had a powerful AI capable of doing alignment research, it may be more efficient to use a less powerful AI for other tasks, in order to buy more time. The level of promise for an AI-assisted alignment strategy, as well as the varied views on it, are also important factors to consider when determining the best strategy to pursue.
Ozzie Gooen: I’ve heard this conversation a few times, and I’ve written about it on Facebook a little bit. I think my take here is that in the future, AI will not just be a technology; AI will be in everything. There’s a much larger question of how agentic is the AI in question. How agentic is a foundation model versus a very narrow tool?
I’d be surprised if what Ought is doing would be considered controversial. Using ChatGPT and language models alone is probably not that dangerous, but of course, building very powerful AI systems like foundation models of the future could be a lot scarier. There’s definitely a spectrum.
Eli Lifland: The crux is whether something is useful enough and not dangerous. Eliezer might suggest that using Ought’s Elicit tool or ChatGPT, won’t help with the hard parts of alignment. They might make an alignment researcher slightly faster, but searching for papers is not the bottleneck in their workflow. The focus should be on trying to solve the core of the alignment problem.
And so, he would say something like, “Yeah, there are some things you can maybe do to speed up some parts of your workflow a bit, but it’s not going to help much.” It’s not getting to the core of the issue, which is that we need to find the core conceptual understandings of agency or machine learning models or something like this, such that we have a much better understanding of how to align these systems.
Ozzie Gooen: In that example with Eliezer, he seems to be describing a specific version of AI augmentation; one would essentially solve AI safety for us.
But I think there are other solutions too. For example, we could have an AI agent inspect big neural nets and assist us in deciphering them.
Eli Lifland: I think Eliezer might agree that you can build AIs to scale up interpretability, but he thinks that they won’t be powerful enough to help you actually figure things out. MIRI generally believes that interpretability will not work as a primary solution to AI alignment, but it’s still the most likely to work outside of agent foundations or theoretical research.
Crux: How Promising is AI Interpretability?
Ozzie Gooen: We’ve been discussing different cruxes regarding AI safety. Is there another one you want to talk about?
Eli Lifland: We have a nice transition to the next one, which is the promise of interpretability. We were just talking about interpretability, which is like trying to figure out what’s going on inside these AI systems. And interestingly, as I said, this is something that is considered, at least, the least unpromising [laughs] by sort of these very pessimistic people like MIRI, where they’re like, “Okay, well, if you’re going to do something more empirical, you need to get a really good understanding of what’s going on inside the internals. You can’t just sort of hack at it otherwise.”
I think there are different views here in terms of the non-MIRI people as well, like how promising interpretability is, where there are a lot of people working on it. But a lot of the current work is on models that are much smaller than state-of-the-art models and on trying to interpret relatively simple behavior. And there’s a question of what the chances are that eventually, we will be able to sort of scale this up to more powerful models and scale this up to more complicated behaviors.
I think different people have different intuitions on how likely this is, where some people suggest: “Maybe we’ll kind of be figuring out tricks. Once we figure out a lot of the tricks or a lot of the core ideas behind how to do this, we’ll be able to sort of scale it up.” Maybe this scale-up could happen with the help of narrow AI, as you were mentioning.
Other people have more of the intuition that’s “Well, we can’t even interpret very simple concepts and relatively small models right now. How will we ever keep up with models that are scaling to more and more complicated behaviors?”
I think it’s interesting that there are a lot of people working on interpretability. I think a lot of people also find it an interesting problem to work on. It’s a puzzle, but it’s unclear how promising it is.
Another disagreement people have is over how natural the concepts used by neural networks are. Some believe they can be understood if decoded properly, while others think they’ll be so alien that they won’t be understandable, as Eliezer suggests in his list of lethalities for AGI.
Ozzie Gooen: What’s your personal take?
Eli Lifland: I’m tentatively pessimistic about interpretability. My opinions are still forming because I haven’t done extensive research on the topic or conducted my research. However, I believe that interpretability may work to some extent and would prefer to know how it can be scaled up by AIs. Without that, I don’t think it will be successful.
Crux: Should We Use Narrow AIs to Help Solve Alignment?
Ozzie Gooen: Are there agendas that you think are a better bet?
Eli Lifland: I think I’m most excited about things that look purer, like AI-assisted alignment or something like that. I think this encompasses a lot of stuff, but I think the current strategy that I see as most promising to me is some combination of demonstrating alignment failures in existing systems, so we have something to iterate on and be but ideally demonstrating sort of as advanced alignment failures as possible if that makes sense. So not just sort of looking at current models by default and being like, “Oh, when do they like to say things that we don’t want?” and then training them out, but trying to demonstrate “Oh, there are the papers on goal misgeneralization,” trying to demonstrate the sort of when an AI’s capability is generalized into a new environment, but its goal doesn’t, which is sort of analogous specifically analogous to things that we expect to be very concerning with more powerful systems.
I’m excited about demonstrating alignment failures and working to solve them with the help of AIs. Some traditional methods include using AI to supervise or debate other AIs to ensure honesty and prevent deception.
But you can also imagine more tools that are more brainstorm-like stuff where you have AI that is a more conceptual alignment thinking, and helping people brainstorm different experiments you could run or brainstorm different sort of alignment approaches or, or more high-level alignment purchases, or even possibly helping more with the weeds of ma So I think there are lots of different bases… There’s a very large space of things you could try here. Anyways, I guess I’d say something like “I’m excited about people making some sort of benchmark that says, “These are the issues you want to solve,” and then trying to solve them.”
I’m excited about using AI to accelerate the process of solving alignment problems. This could involve exploring different ways of using AI to assist in the process and creating something that can be iterated on and improved upon. My views on this are somewhat tentative at the moment.
Ozzie Gooen: This work feels like dual-use research.
Eli Lifland: Yeah.
Ozzie Gooen: One obvious question is, do we have to worry about information hazards? Also, what privacy and safety measures should be taken?
Eli Lifland: Yes, all of this is true. I think dual use is a concern. I mean, really, on multiple levels.
Both of the strategies mentioned, which involve trying to understand the limits of AI capabilities and attempting to make AI fail, have a concern for dual use. This means that the research could have the potential for both beneficial and harmful applications. Additionally, both strategies involve pushing the limits of AI capabilities and attempting to understand its increasing power, which also raises concerns for dual use.
When conducting research on the limits of AI capabilities, it is important to ensure that the team is aligned and will handle the research responsibly. It is also important to be cautious about contributing to AI hype and to consider the potential risks associated with AI development. Additionally, it may be necessary to balance the need for transparency with the need to manage public perception of the potential risks of AI.
But for that to happen, they first need to realize that AI is a really big deal. And if they just realize that AI is a really big deal but don’t realize the right way in which it’s scary, this can lead to bad outcomes.
I don’t think I have any easy answers, but I think there are a lot of tradeoffs here that you have to weigh. And then similarly, with the AI-assisted alignment, right, any tools that you use to build to speed up alignment, like a lot of them could maybe be used to speed up AI capabilities. And so I think you want to focus on things relatively on things that differentially advance alignment and also relatively like give your tools more towards the people who are working on, on, on more alignment relative to capabilities stuff.
The Need for AI Alignment Benchmarks
Ozzie Gooen: In terms of AI benchmarks, or benchmarks for this type of thing, do you have any ideas of what specific benchmarks could be, or maybe 10 years from now, what’s a story where we have a lot of benchmarks, and can you try to flesh that story out?
Eli Lifland: You know, one of the core problems around alignment is a generalization, as I was saying, like goal misgeneralization. So the basic idea here is something okay, you have an AI system that’s trying to complete a maze, let’s say, and you only, let’s say, at first in the training environment, right, when you’re training the AI and updating its weights, it only goes to the blue circle as a goal of the maze, and you’re rewarded whenever it gets to the blue circle. And then in the test environment, you test and you see whether what if you put a red circle or what if you put a blue square or something like that? Which one will it go to? Did it learn to go to the red or to the shape, right?
So you’re trying to figure out what goal it learned. I think this is actually something proposed by like… This was one of the SERI MATS projects. I think it was under someone like Evan Hubinger, and he was calling it an AI auditing game. And the goal is to be able to predict which of these things AI has learned. and so you could imagine some sort of like benchmark that has a bunch of this different a, a bunch of these different examples where the AI could have learned different goals. And the goal is to like get better at predicting which of the goals it learned, get better at understanding, and identify which goal it actually learned without actually running it on the test set, right? While just looking at sort of like the AI system and, and, and trying to understand like or I guess there could be multiple versions, right?
There could be like you look at an AI system and try to understand which goal it learned. There could be like you tried to make it learn a specific goal without the sort of like while, while, while only using one of the whiles only using blue squares or there can be lots of versions or maybe, maybe there can be versions where like you just try to make it learn a robust goal and you don’t know like which ones it’s going to give you and then they, they kind of mix it up. Anyways there’s, there’s kind of like lots of different, I guess, things you can imagine in the space of like goal misgeneralization benchmarks.
Redwood Research recently released an interpretability framework called causal scrubbing. It aims to define how well a model can be interpreted based on the ability to predict the effects of ablation. This could potentially be used as a benchmark with a leaderboard to track progress and may be relevant to alignment in terms of understanding the model.
Ozzie Gooen: I imagine there are a lot of statistics you could use to figure out exactly how good of a job something is doing.
Eli Lifland: There are likely many ideas in the space of AI alignment benchmarks, with a range of high-level approaches and details to consider. It is possible to imagine a suite of different benchmarks being used.
Ozzie Gooen: Are there any good ones, like meta-benchmarks? How well of a job are you doing on this weighted average of 10 different metrics?
Eli Lifland: I don’t think there are any; I think right now there are precisely zero good alignment benchmarks in the first place [laughs]. Obtaining good alignment benchmarks should be a priority. While adversarial robustness benchmarks may be related to alignment, it may be more effective to focus on benchmarks that are as close as possible to the ultimate problems of concern, particularly since language models are expected to potentially scale up to transformative AI.
Redwood Research has also done research on adversarial robustness in language models. It is possible to imagine benchmarks in this area, but currently, there are none specifically targeted at alignment.
Crux: Conceptual Insights vs. Empirical Iteration
Ozzie Gooen: In the interest of time, we have a lot of cruxes; maybe go through one more thing.
Eli Lifland: Yeah. Sure. There are differing opinions on whether it is necessary to have fundamental conceptual insights or if it is sufficient to iterate and improve alignment through empirical methods. Some believe that conceptual insights are necessary for alignment, while others think that iterative improvement through empirical methods may be sufficient.
While others say, “Well if you just do something like this, you’ll just be hiding. You’ll just be sort of naively hiding the sort of failures at each step, and then eventually, the AI will get powerful enough that when it fails, it will disempower humanity.”
Many people in the alignment community may believe, “Yeah, you have to be really careful about what patches you apply, but it’s mostly just iteration,” whereas others believe, “No, we’ll pretty much never get it through iteration, and we need to have a very foundational understanding of what the model is doing to ever sort of align it because alignment is very hard.”
Ozzie Gooen: How many people are in the second camp? I’m familiar with Eliezer and MIRI.
Eli Lifland: There’s Eliezer and MIRI. There are other sorts of people. There are some independent researchers like John Wentworth, who I think is in this camp to this extent. I think some interpretability researchers see some overlap between these intuitions and interpretability, where they might think, “Yeah, you really need to have a good fundamental understanding of what’s going on.” I believe that this is closely related to the difficulty of alignment and that the more challenging one perceives alignment to be, the more fundamental conceptual insights may be required to achieve it.
Conjecture also tends to think alignment will be very hard. They have fairly short timelines. I think they do have a big team for interpretability. And previously, they had someone working on something similar to what I was mentioning earlier about AI-assisted alignment but targeted at conceptual research, more of this sort of brainstorming, helping get fundamental insights of this kind. I believe they think we have a fairly high chance of experiencing an existential catastrophe.
They probably believe that, at some point, we’ll want to aim for conceptual insights rather than just empirical iteration. I guess the overall answer is something, I mean, I think this view is typified in the MIRI cluster, but I think there is a wide range of views. And there are a substantial number of other people who are somewhat towards the MIRI side of this. I guess Daniel Kokotajlo is probably another person who is slightly more uncertain, but probably thinks that we need to have foundational insights.
Ozzie Gooen: Where do you fall?
Eli Lifland: I think I’m maybe a bit in between, but somewhere towards something like this, and I think this might be discussed briefly in the post. We have a post with Charlotte Siegmann, and I have a write-up with our thoughts on how we might align transformative AI if it’s developed soon, to which I’m responding, discussing, and giving our follow-up thoughts on a post by Holden Karnofsky on the same subject.
Ozzie Gooen: Is this post public?
Eli Lifland: Yeah. It begins with LessWrong. It’s called “Discussing How to AlignTransformative AI If It’s Developed Very Soon.” And anyway, we talked about it a little bit in there, and I think both of our intuitions, Charlotte’s and mine, are that there’s a good chance you’ll need a conceptual insight at some point, but we might be able to help you get there through empirical iteration. As a result, I believe a sizable number of people share this viewpoint. It’s a bit hard to say how much we disagree among the people who have this sort of view of, “You need empirical iteration, and then eventually, you’ll get to conceptual insight.”
I think maybe we’re a bit more on the pessimistic side of empirical iteration compared to some other people who have this view that you can use empirical iteration usefully to get your conceptual insight. But, basically, I think this is the most promising path I see, something I’m generally excited about, like the interaction between empirical and conceptual research, and, in particular, maybe doing some sort of hybrid research. because I basically just think—I mean, in particular, it’s unclear that we have time to sort of come up with the foundational insights in time anyway. Even if you’re sort of uncertain and might be the best judge, it’d be “Okay, let’s go for some empirical iteration at first. We’ll sort of like to try to use AI to align stronger AIs. And then eventually we’ll have to get the conceptual insights, but maybe with the aid of much stronger AIs.” Yeah.
Ozzie Gooen: In terms of where people should go, my impression is that the conceptual stuff seems more tricky to push a bunch of people into. They could hire engineers to build tooling and such for a lot of the empirical stuff and such.
Eli Lifland: I think it’s relatively true. I think the empirical stuff, though, is that there’s still disagreement on which empirical stuff is useful, right? In terms of where people should go, I generally tell people that I believe the field is more bottlenecked on people who can compost sort of new paradigms or new ideas. And this is frequently the result of a conceptual one. I think if people think they have a chance at being a really good conceptual researcher and really coming up with a new agenda, which you can then often feed empirical folks into to help, then I think I’m excited about people trying to do conceptual research more on the conceptual side.
I believe that if people are not very enthusiastic about a specific area of research or if they have tried it and believe it is unlikely to be among the top few, then it may be more valuable to focus on other areas. Empirical research may also be less heavily dependent on chance, and it may be possible to make more progress by throwing more people at it. If one is not likely to become part of an elite group of conceptual researchers, it may be a good idea to focus on areas that one is excited about, such as empirical research.
Vehicles and Planes as Potential Metaphors
Ozzie Gooen: In terms of metaphors, I’m just a bit curious. With AI, we’ve kind of been through this a whole bunch of times before with previous technologies that had been kind of dangerous. So like autonomous vehicles, it took us a while to… You obviously needed some combination of just having simple autonomous vehicles in cars before we could figure out how to make sure that cars are safe, right? I imagine a lot of technologies go through this. Are there any specific things like that that you feel like they’re particularly decent metaphors, like translate kind of decently?
Eli Lifland: So that’s actually a good question. Charlotte recently looked into the Wright Flyer for a post. The Wright Brothers’ flying was interesting, and she conducted a brief investigation into it. From her research, it appears that their success was a combination of empirical iterations and the discovery of new conceptual understanding through trial and error. Flying is a good example of this process, as it required an understanding of aerodynamics that was not well understood at the time.
I guess, with autonomous vehicles, it feels incomplete or something like that. We still haven’t figured it out. So it’s kind of hard to evaluate. -
Ozzie Gooen: To be as clear as talking about vehicles just generally before autonomous vehicles having, make sure you have seatbelts and stuff-
Eli Lifland: Sorry, I thought… Vehicles are also an interesting analogy. I haven’t delved much into it, but I believe it likely resulted from a combination of empirical trial and error. One problem with these analogies is that, depending on how challenging we consider alignment, we may not have reached the level of reliability required for an AI that’s much smarter than us. This may further emphasize the need for conceptual insights. Additionally, there may be something fundamentally different about trying to control something that’s smarter or adversarially optimizing against us, which could suggest a greater need for conceptual insights compared to past technologies that mainly relied on empirical trial and error.
Ozzie Gooen: I think this is quite nasty… Many of the topics we’ve discussed have poor inner categories and other issues. So defining “what specific things will be like? To what degree will they be empirical and conceptual?” seems pretty tricky.
Eli Lifland: I agree. I guess personally when I say conceptual and empirical, I mainly mean that empirical is like you’re literally working with AI systems a decent amount of the time. Maybe sometimes you’re thinking of experiments to run. But I mean, I do think, yeah. And generally, I agree. A lot of what we talked about falls into categories that aren’t very distinct. or well-defined?
Ozzie Gooen: Are there any other last comments, discussion points, or discussion topics that you think would be good to present?
Eli Lifland: I don’t think so. Thanks for chatting. Appreciate it.
Ozzie Gooen: Thank you so much for joining.
@elifland re: categories, I’d love to hear your thoughts on https://www.lesswrong.com/posts/6bpW2kyeKaBtuJuEk/why-i-hate-the-accident-vs-misuse-ai-x-risk-dichotomy-quick