ELK was one of my first exposures to AI safety. I participated in the ELK contest shortly after moving to Berkeley to learn more about longtermism and AI safety. My review focuses on ELK’s impact on me, as well as my impressions of how ELK affected the Berkeley AIS community.
Things about ELK that I benefited from
Understanding ARC’s research methodology & the builder-breaker format. For me, most of the value of ELK came from seeing ELK’s builder-breaker research methodology in action. Much of the report focuses on presenting training strategies and presenting counterexamples to those strategies. This style of thinking is straightforward and elegant, and I think the examples in the report helped me (and others) understand ARC’s general style of thinking.
Understanding the alignment problem. ELK presents alignment problems in a very “show, don’t tell” fashion. While many of the problems introduced in ELK have been written about elsewhere, ELK forces you to think through the reasons why your training strategy might produce a dishonest agent (the human simulator) as opposed to an honest agent (the direct translator). The interactive format helped me more deeply understand some of the ways in which alignment is difficult.
Common language & a shared culture. ELK gave people a concrete problem to work on. A whole subculture emerged around ELK, with many junior alignment researchers using it as their first opportunity to test their fit for theoretical alignment research. There were weekend retreats focused on ELK. It was one of the main topics that people were discussing from Jan-Feb 2022. People shared their training strategy ideas over lunch and dinner. It’s difficult to know for sure what kind of effect this had on the community as a whole. But at least for me, my current best-guess is that this shared culture helped me understand alignment, increased the amount of time I spent thinking/talking about alignment, and helped me connect with peers/collaborators who were thinking about alignment. (I’m sympathetic, however, to arguments that ELK may have reduced the amount of independent/uncorrelated thinking around alignment & may have produced several misunderstandings, some of which I’ll point at in the next section).
Ways I think ELK could be improved
Disclaimer: I think each of these improvements would have been (and still is) time-consuming, and I don’t think it’s crazy for ARC to say “yes, this we could do this, but it isn’t worth the time-cost.”
More context. ELK felt like a paper without an introduction or a discussion section. I think it could’ve benefitted from more context about on why it’s important, how it relates to previous work, how it fits into a broader alignment proposal, and what kinds of assumptions it makes.
Many people were confused about how ELK fits into a broader alignment plan, which assumptions ELK makes, and what would happen if ARC solved ELK. Here are some examples of questions that I heard people asking:
Is ELK the whole alignment problem? If we solve ELK, what else do we need to solve?
How did we get the predictor in the first place? Does ELK rely on our ability to build a superintelligent oracle that hasn’t already overpowered humanity?
Are we assuming that the reporter doesn’t need to be superintelligent? If it does need to be superintelligent (in order to interpret a superintelligent predictor), does that mean we have to solve a bunch of extra alignment problems in order to make sure the reporter doesn’t overpower humanity?
Does ELK actually tackle the “core parts” of the alignment problem? (This was discussed in this post (released 7 months after the ELK report), and this post (released 9 months after ELK) by Nate Soares. I think the discourse would have been faster, of higher-quality, and invited people other than Nate if ARC had made some of its positions clearer in the original report).
One could argue that it’s not ARC’s job to explain any of this. However, my impression is that ELK had a major influence on how a new cohort of researchers oriented toward the alignment problem. This is partially because of the ELK contest, partially because ELK was released around the same time as several community-building efforts had ramped up, and partially because there weren’t (and still aren’t) many concrete research problems to work on in alignment research.
With this in mind, I think the ELK report could have done a better job communicating the “big-picture” for readers.
Note that after the report was released, some of these questions were addressed in comments by Paul (see How is ARC planning to use ELK? and On how various plans miss the hard bits of alignment). Even with these clarifications, I still think there could be clearer communication about how ELK fits into a broader alignment plan, the assumptions behind ELK, and the aspects of alignment that ELK does not address.
More justification for focusing on worst-case scenarios. The ELK report focuses on solving ELK in the worst case. If we can think of a single counterexample to a proposal, the proposal breaks. This seems strange to me. It feels much more natural to think about ELK proposals probabilistically, ranking proposals based on how likely they are to reduce the chance of misalignment. In other words, I broadly see the aim of alignment researchers as “come up with proposals that reduce the chance of AI x-risk as much as possible” as opposed to “come up with proposals that would definitely work.”
While there are a few justifications for this in the ELK report, I didn’t find them compelling, and I would’ve appreciated more discussion of what an alternative approach would look like. For example, I would’ve found it valuable for the authors to (a) discuss their justification for focusing on the worst-case in more detail, (b) discuss what it might look like for people to think about ELK in “medium-difficulty scenarios”, (c) understand if ARC thinks about ELK probabilistically (e.g., X solution seems to improve our chance of getting the direct translator by ~2%), and (d) have ARC identify certain factors that might push them away from working on worst-case ELK (e.g., if ARC believed AGI was arriving in 2 years and they still didn’t have a solution to worst-case ELK, what would they do?)
Clearer writing. One of the most common complaints about ELK is that it’s long and dense. This is understandable; ELK conveys a lot of complicated ideas from a pre-paradigmatic field, and in doing so it introduces several novel vocabulary words and frames. Nonetheless, I would feel more excited about a version of ELK that was able to communicate concepts more clearly and succinctly. Some specific ideas include offering more real-world examples to illustrate concepts, defining terms/frames more frequently, including a glossary, and providing more labels/captions for figures.
Short anecdote
I’ll wrap up my review with a short anecdote. When I first began working on ELK (in Jan 2022), I reached out to Tamera (a friend from Penn EA) and asked her to come to Berkeley so we could work on ELK together. She came, started engaging with the AIS community, and ended up moving to Berkeley to skill-up in technical AIS. She’s now a research resident at Anthropic who has been working on externalized reasoning oversight. It’s unclear if or when Tamera would’ve had the opportunity to come to Berkeley, but my best-guess is that this was a major speed-up for Tamera. I’m not sure how many other cases there were of people getting involved or sped-up by ELK. But I think it’s a useful reminder that some of the impact of ELK (whether positive or negative) will be difficult to evaluate, especially given the number of people who engaged with ELK (I’d guess at least 100+, and quite plausibly 500+).
ELK was one of my first exposures to AI safety. I participated in the ELK contest shortly after moving to Berkeley to learn more about longtermism and AI safety. My review focuses on ELK’s impact on me, as well as my impressions of how ELK affected the Berkeley AIS community.
Things about ELK that I benefited from
Understanding ARC’s research methodology & the builder-breaker format. For me, most of the value of ELK came from seeing ELK’s builder-breaker research methodology in action. Much of the report focuses on presenting training strategies and presenting counterexamples to those strategies. This style of thinking is straightforward and elegant, and I think the examples in the report helped me (and others) understand ARC’s general style of thinking.
Understanding the alignment problem. ELK presents alignment problems in a very “show, don’t tell” fashion. While many of the problems introduced in ELK have been written about elsewhere, ELK forces you to think through the reasons why your training strategy might produce a dishonest agent (the human simulator) as opposed to an honest agent (the direct translator). The interactive format helped me more deeply understand some of the ways in which alignment is difficult.
Common language & a shared culture. ELK gave people a concrete problem to work on. A whole subculture emerged around ELK, with many junior alignment researchers using it as their first opportunity to test their fit for theoretical alignment research. There were weekend retreats focused on ELK. It was one of the main topics that people were discussing from Jan-Feb 2022. People shared their training strategy ideas over lunch and dinner. It’s difficult to know for sure what kind of effect this had on the community as a whole. But at least for me, my current best-guess is that this shared culture helped me understand alignment, increased the amount of time I spent thinking/talking about alignment, and helped me connect with peers/collaborators who were thinking about alignment. (I’m sympathetic, however, to arguments that ELK may have reduced the amount of independent/uncorrelated thinking around alignment & may have produced several misunderstandings, some of which I’ll point at in the next section).
Ways I think ELK could be improved
Disclaimer: I think each of these improvements would have been (and still is) time-consuming, and I don’t think it’s crazy for ARC to say “yes, this we could do this, but it isn’t worth the time-cost.”
More context. ELK felt like a paper without an introduction or a discussion section. I think it could’ve benefitted from more context about on why it’s important, how it relates to previous work, how it fits into a broader alignment proposal, and what kinds of assumptions it makes.
Many people were confused about how ELK fits into a broader alignment plan, which assumptions ELK makes, and what would happen if ARC solved ELK. Here are some examples of questions that I heard people asking:
Is ELK the whole alignment problem? If we solve ELK, what else do we need to solve?
How did we get the predictor in the first place? Does ELK rely on our ability to build a superintelligent oracle that hasn’t already overpowered humanity?
Are we assuming that the reporter doesn’t need to be superintelligent? If it does need to be superintelligent (in order to interpret a superintelligent predictor), does that mean we have to solve a bunch of extra alignment problems in order to make sure the reporter doesn’t overpower humanity?
Does ELK actually tackle the “core parts” of the alignment problem? (This was discussed in this post (released 7 months after the ELK report), and this post (released 9 months after ELK) by Nate Soares. I think the discourse would have been faster, of higher-quality, and invited people other than Nate if ARC had made some of its positions clearer in the original report).
One could argue that it’s not ARC’s job to explain any of this. However, my impression is that ELK had a major influence on how a new cohort of researchers oriented toward the alignment problem. This is partially because of the ELK contest, partially because ELK was released around the same time as several community-building efforts had ramped up, and partially because there weren’t (and still aren’t) many concrete research problems to work on in alignment research.
With this in mind, I think the ELK report could have done a better job communicating the “big-picture” for readers.
Note that after the report was released, some of these questions were addressed in comments by Paul (see How is ARC planning to use ELK? and On how various plans miss the hard bits of alignment). Even with these clarifications, I still think there could be clearer communication about how ELK fits into a broader alignment plan, the assumptions behind ELK, and the aspects of alignment that ELK does not address.
More justification for focusing on worst-case scenarios. The ELK report focuses on solving ELK in the worst case. If we can think of a single counterexample to a proposal, the proposal breaks. This seems strange to me. It feels much more natural to think about ELK proposals probabilistically, ranking proposals based on how likely they are to reduce the chance of misalignment. In other words, I broadly see the aim of alignment researchers as “come up with proposals that reduce the chance of AI x-risk as much as possible” as opposed to “come up with proposals that would definitely work.”
While there are a few justifications for this in the ELK report, I didn’t find them compelling, and I would’ve appreciated more discussion of what an alternative approach would look like. For example, I would’ve found it valuable for the authors to (a) discuss their justification for focusing on the worst-case in more detail, (b) discuss what it might look like for people to think about ELK in “medium-difficulty scenarios”, (c) understand if ARC thinks about ELK probabilistically (e.g., X solution seems to improve our chance of getting the direct translator by ~2%), and (d) have ARC identify certain factors that might push them away from working on worst-case ELK (e.g., if ARC believed AGI was arriving in 2 years and they still didn’t have a solution to worst-case ELK, what would they do?)
Clearer writing. One of the most common complaints about ELK is that it’s long and dense. This is understandable; ELK conveys a lot of complicated ideas from a pre-paradigmatic field, and in doing so it introduces several novel vocabulary words and frames. Nonetheless, I would feel more excited about a version of ELK that was able to communicate concepts more clearly and succinctly. Some specific ideas include offering more real-world examples to illustrate concepts, defining terms/frames more frequently, including a glossary, and providing more labels/captions for figures.
Short anecdote
I’ll wrap up my review with a short anecdote. When I first began working on ELK (in Jan 2022), I reached out to Tamera (a friend from Penn EA) and asked her to come to Berkeley so we could work on ELK together. She came, started engaging with the AIS community, and ended up moving to Berkeley to skill-up in technical AIS. She’s now a research resident at Anthropic who has been working on externalized reasoning oversight. It’s unclear if or when Tamera would’ve had the opportunity to come to Berkeley, but my best-guess is that this was a major speed-up for Tamera. I’m not sure how many other cases there were of people getting involved or sped-up by ELK. But I think it’s a useful reminder that some of the impact of ELK (whether positive or negative) will be difficult to evaluate, especially given the number of people who engaged with ELK (I’d guess at least 100+, and quite plausibly 500+).