Open question: are minimal circuits daemon-free?
Note: weird stuff, very informal.
Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.
I may get a “daemon,” a consequentialist who happens to be motivated to make good predictions (perhaps because it has realized that only good predictors survive). Under different conditions, the daemon may no longer be motivated to predict well, and may instead make “predictions” that help it achieve its goals at my expense.
I don’t know whether this is a real problem or not. But from a theoretical perspective, not knowing is already concerning—I’m trying to find a strong argument that we’ve solved alignment, not just something that seems to work in practice.
I am pretty convinced that daemons are a real problem for Solomonoff induction. Intuitively, the problem is caused by “too much compute.” I suspect that daemons are also a problem for some more realistic learning procedures (like human evolution), though in a different shape. I think that this problem can probably be patched, but that’s one of the major open questions for the feasibility of prosaic AGI alignment.
I suspect that daemons aren’t a problem if we exclusively select for computational efficiency. That is, I suspect that the fastest way to solve any particular problem doesn’t involve daemons.
I don’t think this question has much intrinsic importance, because almost all realistic learning procedures involve a strong simplicity prior (e.g. weight sharing in neural networks).
But I do think this question has deep similarities to more important problems, and that answering this question will involve developing useful conceptual machinery. Because we have an unusually strong intuitive handle on the problem, I think it’s a good thing to think about.
Problem statement and intuition
Can the smallest boolean circuit that solves a problem be a daemon? For example, can the smallest circuit that predicts my behavior (at some level of accuracy) be a daemon?
Intuitively, if we have a daemon that is instrumentally or incidentally motivated to solve my problem, then there is some smaller circuit that solves the problem equally well but skips the instrumental reasoning. If my daemon is doing some complex reasoning to answer the question “Should I predict well?” we could just skip straight to the answer “yes.” This both makes the circuit smaller, and prevents the circuit from ever deciding not to predict well.
A different perspective on a similar intuition: the daemon is doing some actual cognitive work to solve the problem. Since that computation is being done by the daemon, it is embedded as a smaller circuit. Jessica explores this intuition a bit here. Here we are considering an easy version of the problem, since by taking the smallest circuit we are effectively quantifying over all possible ways of extracting logical information from the daemon.
Instead of showing that minimal circuits can’t be daemons, we might end up concluding that they can be. That would be even more interesting.
Another possible outcome is giving a strong argument that captures our intuitions/concerns about daemons, and which clearly doesn’t apply to the minimal circuit that solves a problem. In this case we couldn’t prove anything positive about the minimal circuit, but we would have “screened off” the possible cause for concern.
Difficulties
The first and most serious difficulty is understanding what we are talking about.
I don’t expect to get total clarity on concepts like “daemon” or “optimization” or “generic problem,” but we need to have a better grip than we do right now. I expect that we’ll develop better concepts in the course of solving the problem, rather than as a precondition for solving the problem (in general I think “define things so that you can prove the theorem” is often the right strategy).
A second difficulty is that the different parts of the computation can be tangled up in an extremely complex way. In an extreme case, the daemon may be cryptographically obfuscated.
We want to show that given any daemon, there is a smaller circuit that solves the problem. The most natural approach is showing how to construct a smaller circuit, given a daemon. But if the daemon is obfuscated, there is no efficient procedure which takes the daemon circuit as input and produces a smaller circuit that still solves the problem.
So we can’t find any efficient constructive argument. That rules out most of the obvious strategies.
- 2018 Review: Voting Results! by 24 Jan 2020 2:00 UTC; 135 points) (
- 2018 Review: Voting Results! by 24 Jan 2020 2:00 UTC; 135 points) (
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- Value systematization: how values become coherent (and misaligned) by 27 Oct 2023 19:06 UTC; 102 points) (
- Inaccessible information by 3 Jun 2020 5:10 UTC; 83 points) (
- Are minimal circuits deceptive? by 7 Sep 2019 18:11 UTC; 78 points) (
- AGI safety from first principles: Alignment by 1 Oct 2020 3:13 UTC; 60 points) (
- Deception?! I ain’t got time for that! by 18 Jul 2022 0:06 UTC; 55 points) (
- Mesa-Search vs Mesa-Control by 18 Aug 2020 18:51 UTC; 55 points) (
- Useful Does Not Mean Secure by 30 Nov 2019 2:05 UTC; 46 points) (
- Getting up to Speed on the Speed Prior in 2022 by 28 Dec 2022 7:49 UTC; 36 points) (
- Confucianism in AI Alignment by 2 Nov 2020 21:16 UTC; 33 points) (
- Stable Pointers to Value III: Recursive Quantilization by 21 Jul 2018 8:06 UTC; 20 points) (
- 29 Jun 2021 1:33 UTC; 15 points) 's comment on paulfchristiano’s Shortform by (
- Exorcizing the Speed Prior? by 22 Jul 2018 6:45 UTC; 14 points) (
- [AN #71]: Avoiding reward tampering through current-RF optimization by 30 Oct 2019 17:10 UTC; 12 points) (
- 22 Mar 2023 0:51 UTC; 10 points) 's comment on My Objections to “We’re All Gonna Die with Eliezer Yudkowsky” by (
- 28 Oct 2019 23:35 UTC; 9 points) 's comment on Are minimal circuits deceptive? by (
- The Alignment Newsletter #6: 05/14/18 by 14 May 2018 16:00 UTC; 8 points) (
- 13 Jan 2021 4:07 UTC; 5 points) 's comment on What failure looks like by (
- 28 Sep 2022 19:05 UTC; 5 points) 's comment on The LessWrong 2018 Book is Available for Pre-order by (
- 29 Jun 2019 6:16 UTC; 4 points) 's comment on Aligning a toy model of optimization by (
- 14 Mar 2021 21:41 UTC; 2 points) 's comment on The case for aligning narrowly superhuman models by (
- 12 Mar 2019 1:48 UTC; 2 points) 's comment on Asymptotically Unambitious AGI by (
This post grounds a key question in safety in a relatively simple way. It led to the useful distinction between upstream and downstream daemons, which I think is necessary to make conceptual progress on understanding when and how daemons will arise.
This post formulated a concrete open problem about what are now called ‘inner optimisers’. For me, it added ‘surface area’ to the concept of inner optimisers in a way that I think was healthy and important for their study. It also spurred research that resulted in this post giving a promising framework for a negative answer.