On the purposes of decision theory research
Following the examples of Rob Bensinger and Rohin Shah, this post will try to clarify the aims of part of my research interests, and disclaim some possible misunderstandings about it. (I’m obviously only speaking for myself and not for anyone else doing decision theory research.)
I think decision theory research is useful for:
Gaining information about the nature of rationality (e.g., is “realism about rationality” true?) and the nature of philosophy (e.g., is it possible to make real progress in decision theory, and if so what cognitive processes are we using to do that?), and helping to solve the problems of normativity, meta-ethics, and metaphilosophy.
Better understanding potential AI safety failure modes that are due to flawed decision procedures implemented in or by AI.
Making progress on various seemingly important intellectual puzzles that seem directly related to decision theory, such as free will, anthropic reasoning, logical uncertainty, Rob’s examples of counterfactuals, updatelessness, and coordination, and more.
Firming up the foundations of human rationality.
To me, decision theory research is not meant to:
Provide a correct or normative decision theory that will be used as a specification or approximation target for programming or training a potentially superintelligent AI.
Help create “safety arguments” that aim to show that a proposed or already existing AI is free from decision theoretic flaws.
To help explain 5 and 6, here’s what I wrote in a previous comment (slightly edited):
One meta level above what even UDT tries to be is decision theory (as a philosophical subject) and one level above that is metaphilosophy, and my current thinking is that it seems bad (potentially dangerous or regretful) to put any significant (i.e., superhuman) amount of computation into anything except doing philosophy.
To put it another way, any decision theory that we come up with might have some kind of flaw that other agents can exploit, or just a flaw in general, such as in how well it cooperates or negotiates with or exploits other agents (which might include how quickly/cleverly it can make the necessary commitments). Wouldn’t it be better to put computation into trying to find and fix such flaws (in other words, coming up with better decision theories) than into any particular object-level decision theory, at least until the superhuman philosophical computation itself decides to start doing the latter?
Comparing my current post to Rob’s post on the same general topic, my mentions of 1, 2, and 4 above seem to be new, and he didn’t seem to share (or didn’t choose to emphasize) my concern that decision theory research (as done by humans in the foreseeable future) can’t solve decision theory in a definitive enough way that would obviate the need to make sure that any potentially superintelligent AI can find and fix decision theoretic flaws in itself.
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- Thoughts from a Two Boxer by 23 Aug 2019 0:24 UTC; 18 points) (
- [AN #61] AI policy and governance, from two people in the field by 5 Aug 2019 17:00 UTC; 12 points) (
- 2 Nov 2022 12:08 UTC; 9 points) 's comment on All AGI Safety questions welcome (especially basic ones) [~monthly thread] by (
- 1 Oct 2024 23:13 UTC; 4 points) 's comment on You can, in fact, bamboozle an unaligned AI into sparing your life by (
It seems like 1, 3, and 4 could apply to lots of different kinds of research. Did you consider other ways to achieve them before you chose to work on decision theory?
Like, it feels to me like there are recognized open problems in AI safety, which plausibly have direct useful applications to AI safety if solved, and additionally are philosophical in nature, relate to important intellectual puzzles, and allow one to firm up the foundations of human rationality.
And I would guess there is also a fairly broad set of problems which don’t have direct relevance to AI safety that satisfy 1/3/4. For example, I find a lot of discussion of the replication crisis philosophically unsatisfying and have some thoughts on this—should I write these thoughts up on the grounds that they are useful for AI safety?
Not to mention the possibility of trying to attack problems you mention in 1/3/4 directly.
Decision theory could still be a great focus for you personally if it’s something you’re interested in, but if that’s why you chose it, that would be useful for others to know, I think.
It looks like you’re interpreting this post as arguing for doing more decision theory research relative to other kinds of research, which is not really my intention, since as you note, that would require comparing decision theory research to other kinds of research, which I didn’t do. (I would be interested to know how I might have given this impression, so I can recalibrate my writing in the future to avoid such misunderstandings.) My aim in writing this post was more to explain why, given that I’m not optimistic that we can solve decision theory in a definitive way, I’m still interested in decision theory research.
No, but I have considered it afterwards, and have added to my research interests (such as directly attacking 1) as a result. (If you’re curious about how I got interested in decision theory originally, the linked post List of Problems That Motivated UDT should give a pretty good idea.)
If we do compare decision theory to other philosophical problems relevant to AI safety (say “how can we tell whether a physical system is having a positive or negative experience?” which I’m also interested in, BTW) decision theory feels relatively more tractable to me, and less prone to the sort of back-and-forth arguments between different camps preferring different solutions, common elsewhere in philosophy, because decision theory seems constrained by having to simultaneously solve so many problems that it’s easier to detect when clear progress has been made. (However, lack of clear evidence of progress in decision theory in recent years could be considered argument against this.)
If other people have different intuitions (and there’s no reason to think that they have especially bad intuitions) I definitely think they should pursue whatever problems/approaches seem most promising to them.
I’m not sure I understand this part. Are you saying there are problems that don’t have direct relevance to AI safety, but have indirect relevance via 1/3/4? If so, sure you should write them up, depending on the amount of indirect relevance...
As explained above, it’s not as simple as this, and I wasn’t prepared to give a full discussion of “should you choose to work on decision theory or something else” in this post.
Fair enough.
I am afraid of AI’s philosophy. I expect it to be something like: “I created meta-ultra-utilitarian double updateless dissection theory which needs 25 billions of years to be explained, but, in short, I have to kill you in this interesting way and convert the whole earth into the things which don’t have names in your language.”
Also, passing the buck of complexity to the the superintelligent AI creates again, as in the case of AI alignment, some kind of circularity: we need safe AI to help us to solve the problem X, but to create safe AI, we need to know the answer to X, where X is either “human values” or “decision theory”.
Aside from global coordination to not build AGI at all until we can exhaustively research all aspects of AI safety (which I really wish was feasible but we don’t seem to live in that world), I’m not sure how to avoid “passing the buck of complexity to the superintelligent AI” in some form or another. It seems to me that if we’re going to build superintelligent AI, we’ll need superintelligent AI to do philosophy for us (otherwise philosophical progress will fall behind other kinds of progress which will be disastrous), so we need to figure out how to get it to do that safely/correctly, which means solving metaphilosophy.
Do you see any other (feasible) alternatives to this?
It seems to me that you believe that there is a amount of work in the topic of philosophy, that would allow us to be certain, that if we can create a super intelligent ai we can guarantee that it is safe, but that this amount of work is impossible to do, before we develop super intelligent ai, or at least close to impossible. But it doesn’t seem obvious to me (and presumably others) that this is true. It might be a very long time before we create ai, or we might succeed in motivating enough people to work on the problem, that we could achieve this level of philosophical understanding, before it is possible to create super intelligent ai.
You can get a better sense of where I’m coming from by reading Some Thoughts on Metaphilosophy. Let me know if you’ve already read it and still have questions or disagreements.
I think that there could be other ways to escape this alternative. In fact, I wrote a list of the possible ideas of “global solutions” (e.g. ban AI, take over the world, create many AIs) here.
Some possible ideas (not necessary good ones) are:
Use the first human upload as effective AI police which prevent creations of any other AI.
Use other forms of the narrow AI to take over the world and to create effective AI Police which is capable to find unauthorised AI research and stop it.
Drexler’s CAIS.
Something like Christiano’s approach. A group of people augmented by a narrow AI form a “human-AI-Oracle” and solves philosophy.
Active AI boxing as a commercial service.
Human augmentation.
Most of these ideas are centered around the ways of getting high-level real-world capabilities by combining limited AI with something powerful in the outside world (humans, data, nuclear power, market forces, active box), and then using these combined capabilities to prevent creation really dangerous AI.
None of these ideas seem especially promising even for achieving temporary power over the world (sufficient for preventing creation of other AI).
It seems even harder to achieve a long-term stable and safe world environment, in which we can take our time to solve remaining philosophy and AI safety problems and eventually realize the full potential value of the universe.
Some of them (other forms of the narrow AI to take over the world, Christiano’s approach) seem to require solving something like decision theory or metaphilosophy anyway to ensure safety.
I know I sound like a broken record, but I don’t believe that a lot of progress can be made in 1 and 3 until one starts to routinely taboo “true” and “free will”, Try changing the wording from “true” to “useful” and from “free will” to “algorithm”.
So you’re saying we need to solve decision theory at the meta-level, instead of the object-level. But can’t we view any meta-level solution as also (trivially) an object level solution?
In other words, “[making] sure that any potentially superintelligent AI can find and fix decision theoretic flaws in itself” sounds like a special case of “[solving] decision theory in a definitive enough way”.
====================
I’m starting with the objective of objecting to your (6): this seems like an important goal, in my mind. And if we *aren’t* able to verify that an AI is free from decision theoretic flaws, then how can we trust it to self-modify to be free of such flaws?
Your perspective still make sense to me if you say: “this AI (soit ALICE) is exploitable, but it’ll fix that within 100 years, so if it doesn’t get exploited in the meanwhile, then we’ll be OK”. And OFC in principle, making an agent that will have no flaws within X years of when it is created is easier than the special case of X=0.
In reality, it seems plausible to me that we can build an agent like ALICE and have a decent change that ALICE won’t get exploited within 100 years.
But I still don’t see why you dismiss the goal of (6); I don’t think we have anything like definitive evidence that it is an (effectively) impossible goal.
Sure, I’m saying that it seems easier to reach the object level solution by working on the meta level (except in so far as working on the object level helps with meta-level understanding). As an analogy, if you need to factor a large number, it’s much easier to try to find an algorithm for doing so than to try to factor the number yourself.
How do we verify that an AI is free from decision theoretic flaws, given that we don’t have a formal definition of what counts as a “decision theoretic flaw”? (If we did have such a formal definition, the definition itself might have a flaw.) It seems like the only way to do that is to spend a lot of researcher hours searching the AI (or the formal definition) for flaws (or what humans would recognize as a flaw), but no matter how much time we spend, it seems like we can’t definitively rule out the possibility that there might still be some flaws left that we haven’t found. (See Philosophy as interminable debate.)
One might ask why the same concern doesn’t apply to metaphilosophy. Well, I’m guessing that “correct metaphilosophy” might have a bigger basin of attraction around itself than “correct decision theory” so not being able to be sure that we’ve found and fixed all flaws might be less crucial. As evidence for this, it seems clear that humans don’t have (i.e., aren’t already using) a decision theory that is in the basin of attraction around “correct decision theory”, but we do plausibly have a metaphilosophy that is in the basin of attraction around “correct metaphilosophy”.
What’s the best description of what you mean by “metaphilosophy” you can point me to? I think I have a pretty good sense of it, but it seems worthwhile to be as rigorous / formal / descriptive / etc. as possible.
This description from Three Approaches to “Friendliness” perhaps gives the best idea of what I mean by “metaphilosophy”:
(One could also imagine approaches that are somewhere in between these two, where for example AI designers have some partial understanding of what “doing philosophy” is, and programs the AI to learn from human philosophers based on this partial understanding.)
For more of my thoughts on this topic, see Some Thoughts on Metaphilosophy and the posts that it links to.