How to do conceptual research: Case study interview with Caspar Oesterheld
Caspar Oesterheld came up with two of the most important concepts in my field of work: Evidential Cooperation in Large Worlds and Safe Pareto Improvements. He also came up with a potential implementation of evidential decision theory in boundedly rational agents called decision auctions, wrote a comprehensive review of anthropics and how it interacts with decision theory which most of my anthropics discussions built on, and independently decided to work on AI some time late 2009 or early 2010.
Needless to say, I have a lot of respect for Caspar’s work. I’ve often felt very confused about what to do in my attempts at conceptual research, so I decided to ask Caspar how he did his research. Below is my writeup from the resulting conversation.
How Caspar came up with surrogate goals
The process
Caspar had spent six months FTE thinking about a specific bargaining problem between two factions with access to powerful AI, spread over two years.
A lot of the time was spent on specific somewhat narrow research projects, e.g. modelling the impact of moral advocacy in China on which bargaining problems we’ll realistically encounter in the future. At the time, he thought those particular projects were important although he maybe already had a hunch that he wouldn’t think so anymore ten years down the line.
At the same time, he also spent some time on most days thinking about bargaining problems on a relatively high level, either in discussions or on walks. This made up some double digit percentage of his time spent researching bargaining problems.
Caspar came up with the idea of surrogate goals during a conversation with Tobias Baumann. Caspar describes the conversation leading up to the surrogate goal idea as “going down the usual loops of reasoning about bargaining” where you consider just building values into your AI that have properties that are strategically advantaged in bargaining but then worrying that this is just another form of aggressive bargaining. The key insight was to go “Wait, maybe there’s a way to make it not so bad for the other side.” Hence, counterpart-friendly utility function modifications were born which later on turned into surrogate goals.
Once he had the core idea of surrogate goals, he spent some time trying to figure out what the general principle behind “this one weird trick” he found was. Thus, with Vincent Conitzer as his co-author, his SPI paper was created and he continues trying to answer this question now.
Caspar’s reflections on what was important during the process
He thinks it was important to just have spent a ton of time, in his case six months FTE, on the research area. This helps with building useful heuristics.
It’s hard or impossible and probably fruitless to just think about a research area on an extremely high level. “You have to pass the time somehow.” His particular projects, for example researching moral advocacy in China, served as a way of “passing the time” so to say.
At the same time, he thinks it is both very motivationally hard and perhaps not very sensible to work on something that’s in the roughly right research area where you really can’t see a direct impact case. You can end up wasting a bunch of time grinding out technical questions that have nothing much to do with anything.
Relatedly, he thinks it was really important that he continued doing some high-level thinking about bargaining alongside his more narrow projects.
He describes a common dynamic in high-level thinking: Often you get stuck on something that’s conceptually tricky and just go through the same reasoning loops over and over again, spread over days, weeks, months, or years. You usually start entering the loop because you think you have a new angle.
That seems normal and not necessarily bad to him: Most of the time, you realise your new idea is wrong or doesn’t fix any new problems. But sometimes, you genuinely find an alternative branch to go down such as the “maybe we can make this (a unilateral modification of our AI’s values) friendly to the counterpart” branch. He wouldn’t have discovered that if Caspar had refused to entertain familiar arguments again.
How Caspar came up with ECL
The process
Before coming up with ECL, Caspar had read a bunch of decision theory and AI. However, he hadn’t deeply researched these or other ECL related topics before coming up with ECL though.
At some point, he had a call with two people from what was back then the Foundational Research Institute where they talked a bunch about acausal interactions but not about ECL.
He had the ECL idea hours or perhaps a day after that call after thinking a bunch about the things discussed in the call.
To Caspar, ECL seems fairly obvious. He finds it hard to imagine a world where he never comes up with it upon learning about acausal dynamics.
There were some concepts that might have made it easier for him to make the connection:
Big worlds were quite salient in the s-risk sphere at this point in time with many discussions of other potential big-world-related s-risks.
Caspar was learning about AI and the ideas like the orthogonality thesis were very natural to him at this point, so it was easy for him to think of agents with different values correlating.
A lot of theoretical computer science/complexity theory is about analogous situations and how you can reduce one question to another question, so it was easy to think about agents often finding themselves in analogous situations to each other.
At the time of coming up with ECL, he hadn’t yet thought deeply about questions such as what acausal correlations are.
It took two years from the initial idea to the eventual paper on ECL. The process could have perhaps been sped up by half a year or so if he had prioritised speed of publication more.
The report was mostly created top down rather than bottom up. That is, Caspar at some point fairly early on decided that he was gonna write a big report. Some of the content that would eventually make it into the report was already written at the time. But a lot of the content also didn’t exist at all yet by the time there was an outline detailing which content should exist. Most of the research in the report was specifically done with the goal of making the report complete as opposed to, for example, being curiosity projects.
In hindsight, he would have probably spent less time researching technical sub-questions for the report.
Caspar’s reflections on what was important during the process
Given that the idea seems so obvious to Caspar and there wasn’t much leading up to the idea, this one is harder to answer. One hypothesis is that this was just a case of “needing to have the exact right set of ideas salient in the same mind to combine them” and then judging them to be important.
How Caspar came up with decision auctions
The process
While Caspar came up with surrogate goals and ECL more or less by accident, he came up with decision auctions starting with a very concrete impact-oriented question: How do we build an EDT agent?
He then went through these (highly idealised) steps of reasoning:
The Garrabrant logical induction paper looks pretty good. It’s a good approach to logical uncertainty and bounded rationality, which seems closely related to building an EDT agent. But they specifically say that their approach doesn’t work well to build an EDT agent. But maybe I can extend it to change that?
Logical induction takes a prediction market style approach. I will look into the prediction markets literature. There’s also this paper that takes an economics-style perspective, which points out essentially the same issues as the Garrabrant post above, so I seem to be on the right track with reading the literature on prediction markets.
There’s a pretty nice economics paper “Eliciting Predictions and Recommendations for Decision Making” (Yiling, Kash, Ruberry & Shnayder, 2024), but it involves randomisation, so can’t be used to build an EDT agent. But it has a nice formalism and seems to offer a great formal framework to think about agents that are powered by something like a prediction market. It is much easier to think about something with a concrete formal structure than just vaguely thinking about “Hm, how do you build an agent that kind of behaves like this or that.” Maybe I can extend the method in the paper to make it EDT friendly?
This, alongside the help of his co-author Vincent Conitzer, led to Caspar’s paper on decision scoring, which identifies the decision auctions mechanism.
He then went back to the agent foundations frame and, together with Abram Demski and with guidance from Vincent Conitzer, worked out how to connect it with decision auctions in a theory of bounded inductive rationality.
[editor’s note: I find it notable that all the linked papers are in CS venues rather than economics. That said, while Yiling Chen is a CS professor, she studied economics and has an economics PhD.]
How Caspar decided to work on superhuman AI in late 2009 or early 2010
My impression is that a few people in AI safety independently decided that AI was the most important lever over the future and then discovered LessWrong, Eliezer Yudkowsky, and the AI safety community. Caspar is one of those people. While this didn’t turn out to be unique or counterfactually impactful, I am including his story for deciding to work on superhuman AI. The story is from notes Caspar left in writing after the interview. I mostly copied them verbatim with some light editing for clarity and left it in first person.
The process
“Much of this happened when I was very young, so there’s some naivete throughout:
When I was young I wanted to become a physicist, because physics is the most fundamental science.
Physics uses maths, so I first wanted to learn some maths. To that end I took a linear algebra course at the University of Hamburg, which for some reason started with the Zermelo-Fraenkl axiomatization of set theory. (Linear algebra courses don’t normally introduce those ideas.)
This led me to think about automated theorem proving: using Zermelo-Fraenkl axiomatization, you can write down a program that finds all correct proofs/all provable theorems. You’d “just” have to figure out how to make this program fast/efficient. This seemed like a big deal to me at the time! Why be a mathematician and prove theorems yourself—seems much more leveraged to figure out automated theorem proving and then prove theorems that way? This led me to think and read about AI a bunch, including outside of the automated theorem proving context.
Then at some point I sat down and thought about what the most impactful thing would be that I could do with my life. And then creating superhuman AI for the improvement of society was my best guess. (I don’t remember why I chose this over global warming btw (I’d guess it was neglectedness or comparative advantage, but not sure). I had been thinking a bunch about global warming at the time.) [editor’s note: This was late 2009 or early 2010 when Caspar was 15.] So then I learned even more about AI and CS, deprioritized math, and when it came to choosing what BSc/undergrad to enroll in, I picked CS at a uni that had a lot of people working on AI. Within AI, I also focused on learning about the ones that seemed to me most useful for AGI, i.e., RL, neural nets, cognitive science as opposed to, say, support vector machines or automated theorem proving.
Eventually (once I used the English-language Internet more) I found some articles by Yudkowsky on AGI, which then led me to Yudkowsky’s writing on AI safety, which convinced me to think more about safety and adjacent topics (ethics), and also caused me to engage with EA. (As you might know, Yudkowsky et al. also wanted to create AGI before they started working on safety. So to some extent my trajectory is similar, though I didn’t have to do the hard work to become convinced of safety as a priority, which to me seems like a more difficult step than figuring out that AI is important in some way.)”
Caspar’s reflections on what was important during the process
“I was often driven by “this seems like a big deal”-type intuitions that weren’t exactly correct, but that did track the truth to some extent. This caused me to work on and think about various “adjacent” ideas and this was very useful. For example, take “automated theorem proving is a more leveraged way to prove mathematical theorems”. Of course, there are lots of issues with this idea. (Why is proving mathematical theorems important in the first place? Is any of this counterfactual? Can you solve automated theorem proving without “solving AGI”?) But to some extent the argument contains some of the structure of the true arguments for the importance of AI. And thinking about automated theorem proving was good because it led me to think about AI a bunch. Maybe at the time I could have known that I was wrong or naive in various ways. But just acting on the views at the time was definitely better than discarding them altogether.
Consuming existing ideas (e.g., taking a linear algebra course, texts about the future of AI and the importance of AI safety) is important.
It was also important to at various points think explicitly about impact as opposed to just following curiosity.”
General notes on his approach to research
What does research concretely look like in his case?
Thinks he might do when he does research, in no particular order:
Reading
Some types of writing → Whenever you think “is this actually true?” or “maybe this works?”
Talking with others
Talking to himself
Scribbling on a piece of paper
Taking walks and thinking
Thinking about stuff in the background, for example, sometimes he just has research ideas while watching a movie. (This includes very concrete research ideas, for example suddenly realising a mistake in a particular step in a proof sketch.)
Research immersion
Caspar is unsure about how important research immersion is (description below). He knows others who say it’s important to do good research.
One way to think about it: It’s a different mode of thinking. Research is to some extent about generating ideas, so thinking about your research in diverse modes of thinking is good.
Description of what research immersion means in Caspar’s case:
Occasional life-distracting obsessive immersion: Sometimes, especially when he has a fairly well-defined technical question, he can’t let the question go from his mind for a day or several days. His whole mental life will revolve around this question even when it’s not important. This makes it difficult to do other stuff, be it life or work. It also often feels bad if it doesn’t feel like he’s making progress.
Usual background immersion: Most days, he has his research questions in the back of his mind when he’s off work. If he’s not doing distracting activities, during perhaps 25% of his free evening time he will passively have some research on his mind. (A bit like a song that’s very mildly stuck in your head although often very quietly.)
Goal orientation vs. curiosity orientation
Initially, when at CLR, he spent a fair amount of time doing random readings.
Sometimes he does projects where the path to impact is as vague as “oh, maybe it will help us understand things better and then we’ll make better decisions”
Now, having shorter timelines, he cares more about shorter and more concrete paths to impact.
Given that he’s in academia, he also has to do a lot of things that are goal-oriented but e.g. towards the goal of publishing while the research itself is perhaps only intellectually interesting instead of important.
Thanks for the interesting write-up.
Regarding Evidential Cooperation in Large Worlds, the Identical Twin One Shot Prisoner’s dilemma makes sense to me because the entity giving the payout is connected to both worlds. What is the intuition for ECL (where my understanding is there isn’t any connection)?
The “entity giving the payout” in practice for ECL would be just the world states you end up in and requires you to care about the environment of the person you’re playing the PD with.
So, defecting might be just optimising my local environment for my own values and cooperating would be optimising my local environment for some aggregate of my own values and the values of the person I’m playing with. So, it only works if there are positive-sum aggregates and if each player cares about what the other does to their local environment.
What is PTI?
Likely: Path To Impact
Yes! Edited the main text to make it clear