This seems like the sort of problem that can be tackled more efficiently in the context of an actual AGI design. I don’t see “daemons” as a problem per se; instead I see a heuristic for finding potential problems.
Consider something like code injection. There is no deep theory of code injection, at least not that I know of. It just describes a particular cluster of software vulnerabilities. You might create best practices to prevent particular types of code injection, but a software stack which claims to be “immune to code injection” sounds like snake oil. If someone says their software stack is “immune to code injection”, what they really mean is they implement best practices for guarding against all the code injection types they can think of. Which is great, but it doesn’t make sense to go around telling people you are “immune to code injection” because that will discourage security researchers from thinking of new code injection types.
Instead of trying to figure out how to create AIs that are “immune to daemons”, I would suggest trying to think of more cases where daemons are actually a problem. Trying to guard against a problem before you have characterized it seems like premature optimization. The more cases you can describe where daemons are a problem, and the more clearly you can characterize these cases, the easier it will be to spot potential daemons in a potential AGI design. Once you have spotted the potential daemon, identifying a very general way to guard against it is likely to be the easy part. Proofs are the last step, not the first.
I’ve listed one algorithm for which daemons are obviously a problem, namely Solomonoff induction. Now I’m describing a very similar algorithm, and wondering if daemons are a problem. As far as I can tell, any learning algorithm is plausibly beset by daemons, so it seems natural to ask for a variant of learning that isn’t.
I’m not sure exactly how to characterize the problem other than by doing this kind of exercise. This post is already implicitly considering a particular design for AGI, I don’t see what we gain by being more specific here.
That’s fair. I guess my intuition is that the Solomonoff induction example could use more work as motivation. Sampling a cell at a particular frequency pretty fairly unrealistic to me. Realistically an AGI is going to be interested in more complex outcomes. So then there’s a connection to the idea of adversarial examples, where the consequentialists in the universal prior are trying to make the AGI think that something is going on when it isn’t actually going on. (Absent such deception, I’m not sure there is a problem. For example, if consequentialists reliably make their universe one in which everyone is truly having a good time for the purpose of possibly influencing a universal prior, then that will be true in our universe too, and we should take it into account for decisionmaking purposes.) But this is actually easier than typical adversarial examples, because an AGI also gets to observe the consequentialists plot their adversarial strategy and read their minds while they’re plotting. The AGI would have to be rather “dumb” in order to get tricked. If it’s simulating the universal prior in sufficiently high resolution to produce these weird effects, then by definition it’s able to see what is going on.
Humans already seem able to solve this problem: We simulate how others might think and react, and we don’t seem super worried about people we simulate internally breaking out of our simulation and hijacking our cognition. (Or at least, insofar as we do get anxious about e.g. putting ourselves in the shoes of people we dislike, this doesn’t have obvious relevance to an AGI—although again, perhaps this would be a good heuristic for brainstorming potential problems.) Anyway, my hunch is that this particular manifestation of the “daemon” problem will not require a lot of special effort once other AGI/FAI problems are solved.
This post is already implicitly considering a particular design for AGI, I don’t see what we gain by being more specific here.
Does your idea of neural nets + RL involve use of the universal prior? If not, I think I would try to understand if/how the daemon problem transfers to the neural nets + RL framework before trying to solve it. A solid description of a problem is the first step to finding a solution. The minimal version is a concrete example of how it could occur.
(Apologies if I am coming across as disagreeable—IMO, this is a mistake that FAI people make semi-frequently, and I would like for them to make it less often—you just got unlucky that I’m explaining myself in a comment on your post :P)
This seems like the sort of problem that can be tackled more efficiently in the context of an actual AGI design. I don’t see “daemons” as a problem per se; instead I see a heuristic for finding potential problems.
Consider something like code injection. There is no deep theory of code injection, at least not that I know of. It just describes a particular cluster of software vulnerabilities. You might create best practices to prevent particular types of code injection, but a software stack which claims to be “immune to code injection” sounds like snake oil. If someone says their software stack is “immune to code injection”, what they really mean is they implement best practices for guarding against all the code injection types they can think of. Which is great, but it doesn’t make sense to go around telling people you are “immune to code injection” because that will discourage security researchers from thinking of new code injection types.
Instead of trying to figure out how to create AIs that are “immune to daemons”, I would suggest trying to think of more cases where daemons are actually a problem. Trying to guard against a problem before you have characterized it seems like premature optimization. The more cases you can describe where daemons are a problem, and the more clearly you can characterize these cases, the easier it will be to spot potential daemons in a potential AGI design. Once you have spotted the potential daemon, identifying a very general way to guard against it is likely to be the easy part. Proofs are the last step, not the first.
I’ve listed one algorithm for which daemons are obviously a problem, namely Solomonoff induction. Now I’m describing a very similar algorithm, and wondering if daemons are a problem. As far as I can tell, any learning algorithm is plausibly beset by daemons, so it seems natural to ask for a variant of learning that isn’t.
I’m not sure exactly how to characterize the problem other than by doing this kind of exercise. This post is already implicitly considering a particular design for AGI, I don’t see what we gain by being more specific here.
That’s fair. I guess my intuition is that the Solomonoff induction example could use more work as motivation. Sampling a cell at a particular frequency pretty fairly unrealistic to me. Realistically an AGI is going to be interested in more complex outcomes. So then there’s a connection to the idea of adversarial examples, where the consequentialists in the universal prior are trying to make the AGI think that something is going on when it isn’t actually going on. (Absent such deception, I’m not sure there is a problem. For example, if consequentialists reliably make their universe one in which everyone is truly having a good time for the purpose of possibly influencing a universal prior, then that will be true in our universe too, and we should take it into account for decisionmaking purposes.) But this is actually easier than typical adversarial examples, because an AGI also gets to observe the consequentialists plot their adversarial strategy and read their minds while they’re plotting. The AGI would have to be rather “dumb” in order to get tricked. If it’s simulating the universal prior in sufficiently high resolution to produce these weird effects, then by definition it’s able to see what is going on.
Humans already seem able to solve this problem: We simulate how others might think and react, and we don’t seem super worried about people we simulate internally breaking out of our simulation and hijacking our cognition. (Or at least, insofar as we do get anxious about e.g. putting ourselves in the shoes of people we dislike, this doesn’t have obvious relevance to an AGI—although again, perhaps this would be a good heuristic for brainstorming potential problems.) Anyway, my hunch is that this particular manifestation of the “daemon” problem will not require a lot of special effort once other AGI/FAI problems are solved.
Does your idea of neural nets + RL involve use of the universal prior? If not, I think I would try to understand if/how the daemon problem transfers to the neural nets + RL framework before trying to solve it. A solid description of a problem is the first step to finding a solution. The minimal version is a concrete example of how it could occur.
(Apologies if I am coming across as disagreeable—IMO, this is a mistake that FAI people make semi-frequently, and I would like for them to make it less often—you just got unlucky that I’m explaining myself in a comment on your post :P)