ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.
The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.
The report is available here as a google document. If you’re excited about this research, we’re hiring!
Q&A
We’re particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.
Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.
ELK was one of my first exposures to AI safety. I participated in the ELK contest shortly after moving to Berkeley to learn more about longtermism and AI safety. My review focuses on ELK’s impact on me, as well as my impressions of how ELK affected the Berkeley AIS community.
Things about ELK that I benefited from
Understanding ARC’s research methodology & the builder-breaker format. For me, most of the value of ELK came from seeing ELK’s builder-breaker research methodology in action. Much of the report focuses on presenting training strategies and presenting counterexamples to those strategies. This style of thinking is straightforward and elegant, and I think the examples in the report helped me (and others) understand ARC’s general style of thinking.
Understanding the alignment problem. ELK presents alignment problems in a very “show, don’t tell” fashion. While many of the problems introduced in ELK have been written about elsewhere, ELK forces you to think through the reasons why your training strategy might produce a dishonest agent (the human simulator) as opposed to an honest agent (the direct translator). The interactive format helped me more deeply understand some of the ways in which alignment is difficult.
Common language & a shared culture. ELK gave people a concrete problem to work on. A whole subculture emerged around ELK, with many junior alignment researchers using it as their first opportunity to test their fit for theoretical alignment research. There were weekend retreats focused on ELK. It was one of the main topics that people were discussing from Jan-Feb 2022. People shared their training strategy ideas over lunch and dinner. It’s difficult to know for sure what kind of effect this had on the community as a whole. But at least for me, my current best-guess is that this shared culture helped me understand alignment, increased the amount of time I spent thinking/talking about alignment, and helped me connect with peers/collaborators who were thinking about alignment. (I’m sympathetic, however, to arguments that ELK may have reduced the amount of independent/uncorrelated thinking around alignment & may have produced several misunderstandings, some of which I’ll point at in the next section).
Ways I think ELK could be improved
Disclaimer: I think each of these improvements would have been (and still is) time-consuming, and I don’t think it’s crazy for ARC to say “yes, this we could do this, but it isn’t worth the time-cost.”
More context. ELK felt like a paper without an introduction or a discussion section. I think it could’ve benefitted from more context about on why it’s important, how it relates to previous work, how it fits into a broader alignment proposal, and what kinds of assumptions it makes.
Many people were confused about how ELK fits into a broader alignment plan, which assumptions ELK makes, and what would happen if ARC solved ELK. Here are some examples of questions that I heard people asking:
Is ELK the whole alignment problem? If we solve ELK, what else do we need to solve?
How did we get the predictor in the first place? Does ELK rely on our ability to build a superintelligent oracle that hasn’t already overpowered humanity?
Are we assuming that the reporter doesn’t need to be superintelligent? If it does need to be superintelligent (in order to interpret a superintelligent predictor), does that mean we have to solve a bunch of extra alignment problems in order to make sure the reporter doesn’t overpower humanity?
Does ELK actually tackle the “core parts” of the alignment problem? (This was discussed in this post (released 7 months after the ELK report), and this post (released 9 months after ELK) by Nate Soares. I think the discourse would have been faster, of higher-quality, and invited people other than Nate if ARC had made some of its positions clearer in the original report).
One could argue that it’s not ARC’s job to explain any of this. However, my impression is that ELK had a major influence on how a new cohort of researchers oriented toward the alignment problem. This is partially because of the ELK contest, partially because ELK was released around the same time as several community-building efforts had ramped up, and partially because there weren’t (and still aren’t) many concrete research problems to work on in alignment research.
With this in mind, I think the ELK report could have done a better job communicating the “big-picture” for readers.
Note that after the report was released, some of these questions were addressed in comments by Paul (see How is ARC planning to use ELK? and On how various plans miss the hard bits of alignment). Even with these clarifications, I still think there could be clearer communication about how ELK fits into a broader alignment plan, the assumptions behind ELK, and the aspects of alignment that ELK does not address.
More justification for focusing on worst-case scenarios. The ELK report focuses on solving ELK in the worst case. If we can think of a single counterexample to a proposal, the proposal breaks. This seems strange to me. It feels much more natural to think about ELK proposals probabilistically, ranking proposals based on how likely they are to reduce the chance of misalignment. In other words, I broadly see the aim of alignment researchers as “come up with proposals that reduce the chance of AI x-risk as much as possible” as opposed to “come up with proposals that would definitely work.”
While there are a few justifications for this in the ELK report, I didn’t find them compelling, and I would’ve appreciated more discussion of what an alternative approach would look like. For example, I would’ve found it valuable for the authors to (a) discuss their justification for focusing on the worst-case in more detail, (b) discuss what it might look like for people to think about ELK in “medium-difficulty scenarios”, (c) understand if ARC thinks about ELK probabilistically (e.g., X solution seems to improve our chance of getting the direct translator by ~2%), and (d) have ARC identify certain factors that might push them away from working on worst-case ELK (e.g., if ARC believed AGI was arriving in 2 years and they still didn’t have a solution to worst-case ELK, what would they do?)
Clearer writing. One of the most common complaints about ELK is that it’s long and dense. This is understandable; ELK conveys a lot of complicated ideas from a pre-paradigmatic field, and in doing so it introduces several novel vocabulary words and frames. Nonetheless, I would feel more excited about a version of ELK that was able to communicate concepts more clearly and succinctly. Some specific ideas include offering more real-world examples to illustrate concepts, defining terms/frames more frequently, including a glossary, and providing more labels/captions for figures.
Short anecdote
I’ll wrap up my review with a short anecdote. When I first began working on ELK (in Jan 2022), I reached out to Tamera (a friend from Penn EA) and asked her to come to Berkeley so we could work on ELK together. She came, started engaging with the AIS community, and ended up moving to Berkeley to skill-up in technical AIS. She’s now a research resident at Anthropic who has been working on externalized reasoning oversight. It’s unclear if or when Tamera would’ve had the opportunity to come to Berkeley, but my best-guess is that this was a major speed-up for Tamera. I’m not sure how many other cases there were of people getting involved or sped-up by ELK. But I think it’s a useful reminder that some of the impact of ELK (whether positive or negative) will be difficult to evaluate, especially given the number of people who engaged with ELK (I’d guess at least 100+, and quite plausibly 500+).
I’ve written a bunch elsewhere about object-level thoughts on ELK. For this review, I want to focus instead on meta-level points.
I think ELK was very well-made; I think it did a great job of explaining itself with lots of surface area, explaining a way to think about solutions (the builder-breaker cycle), bridging the gap between toy demonstrations and philosophical problems, and focusing lots of attention on the same thing at the same time. In terms of impact on the growth and development on the AI safety community, I think this is one of the most important posts from 2021 (even tho the prize and much of the related work happened in 2022).
I don’t really need to ask for follow-on work; there’s already tons, as you can see from the ELK tag.
I think it is maybe underappreciated by the broad audience how much this is an old problem, and appreciate the appendix that gives credit to earlier thinking, while thinking this doesn’t erode any of the credit Paul, Mark, and Ajeya should get for the excellent packaging.
[To the best of my knowledge, ELK is still an open problem, and one of the things that I appreciated about the significant focus on ELK specifically was helping give people better models of how quickly progress happens in this space, and what it looks like (or doesn’t look like).]
I’m leaving this review primarily because this post somehow doesn’t have one yet, and it’s way too important to get dropped out of the Review!
ELK had some of the most alignment community engagement of any technical content that I’ve seen. It is extremely thorough, well-crafted, and aims at a core problem in alignment. It serves as an examplar of how to present concrete problems to induce more people to work on AI alignment.
That said, I personally bounced after reading the first few pages of the document. It was good as far as I got, but it was pretty effortful to get through, and (as mentioned above) already had tons of attention on it.
FWIW I think the Eliciting Latent Knowledge problem doesn’t stand well on its own as an introduction, and thinking about this problem is way better when you see the bigger picture that Paul is working through and used to generate it, in his post My Research Methodology (my review here). That walks through multiple of the major steps in Paul’s reasoning that led to this problem being brought up, rather than just dumping you in it, and is written in Paul’s native voice. I’d rank that post as substantially more useful than this one.