I’m having trouble thinking about what it would mean for a circuit to contain daemons such that we could hope for a proof. It would be nice if we could find a simple such definition, but it seems hard to make this intuition precise.
For example, we might say that a circuit contains daemons if it displays more optimization that necessary to solve a problem. Minimal circuits could have daemons under this definition though. Suppose that some function f describes the behaviour of some powerful agent, a function ~f is like f with noise added, and our problem is to predict sufficiently well the function ~f. Then, the simplest circuit that does well won’t bother to memorize a bunch of noise, so it will pursue the goals of the agent described by f more efficiently than ~f, and thus more efficiently than necessary.
I don’t know what the statement of the theorem would be. I don’t really think we’d have a clean definition of “contains daemons” and then have a proof that a particular circuit doesn’t contain daemons.
Also I expect we’re going to have to make some assumption that the problem is “generic” (or else be careful about what daemon means), ruling out problems with the consequentialism embedded in them.
(Also, see the comment thread with Wei Dai above, clearly the plausible version of this involves something more specific than daemons.)
Also I expect we’re going to have to make some assumption that the problem is “generic” (or else be careful about what daemon means), ruling out problems with the consequentialism embedded in them.
I agree. The following is an attempt to show that if we don’t rule out problems with the consequentialism embedded in them then the answer is trivially “no” (i.e. minimal circuits may contain consequentialists).
Let c be a minimal circuit that takes as input a string of length 10100 that encodes a Turing machine, and outputs a string that is the concatenation of the first 10100 configurations in the simulation of that Turing machine (each configuration is encoded as a string).
Now consider a string x′ that encodes a Turing machine that simulates some consequentialist (e.g. a human upload). For the input x′, the computation of the output of c simulates a consequentialist; and c is a minimal circuit.
By “predict sufficiently well” do you mean “predict such that we can’t distinguish their output”?
Unless the noise is of a special form, can’t we distinguish $f$ and $tilde{f}$ by how well they do on $f$’s goals? It seems like for this not to be the case, the noise would have to be of the form “occasionally do something weak which looks strong to weaker agents”. But then we could get this distribution by using a weak (or intermediate) agent directly, which would probably need less compute.
Suppose “predict well” means “guess the output with sufficiently high probability,” and the noise is just to replace the output with something random 5% of the time.
Yeah, I had something along the lines of what Paul said in mind. I wanted not to require that the circuit implement exactly a given function, so that we could see if daemons show up in the output. It seems easier to define daemons if we can just look at input-output behaviour.
I’m having trouble thinking about what it would mean for a circuit to contain daemons such that we could hope for a proof. It would be nice if we could find a simple such definition, but it seems hard to make this intuition precise.
For example, we might say that a circuit contains daemons if it displays more optimization that necessary to solve a problem. Minimal circuits could have daemons under this definition though. Suppose that some function f describes the behaviour of some powerful agent, a function ~f is like f with noise added, and our problem is to predict sufficiently well the function ~f. Then, the simplest circuit that does well won’t bother to memorize a bunch of noise, so it will pursue the goals of the agent described by f more efficiently than ~f, and thus more efficiently than necessary.
I don’t know what the statement of the theorem would be. I don’t really think we’d have a clean definition of “contains daemons” and then have a proof that a particular circuit doesn’t contain daemons.
Also I expect we’re going to have to make some assumption that the problem is “generic” (or else be careful about what daemon means), ruling out problems with the consequentialism embedded in them.
(Also, see the comment thread with Wei Dai above, clearly the plausible version of this involves something more specific than daemons.)
I agree. The following is an attempt to show that if we don’t rule out problems with the consequentialism embedded in them then the answer is trivially “no” (i.e. minimal circuits may contain consequentialists).
Let c be a minimal circuit that takes as input a string of length 10100 that encodes a Turing machine, and outputs a string that is the concatenation of the first 10100 configurations in the simulation of that Turing machine (each configuration is encoded as a string).
Now consider a string x′ that encodes a Turing machine that simulates some consequentialist (e.g. a human upload). For the input x′, the computation of the output of c simulates a consequentialist; and c is a minimal circuit.
By “predict sufficiently well” do you mean “predict such that we can’t distinguish their output”?
Unless the noise is of a special form, can’t we distinguish $f$ and $tilde{f}$ by how well they do on $f$’s goals? It seems like for this not to be the case, the noise would have to be of the form “occasionally do something weak which looks strong to weaker agents”. But then we could get this distribution by using a weak (or intermediate) agent directly, which would probably need less compute.
Suppose “predict well” means “guess the output with sufficiently high probability,” and the noise is just to replace the output with something random 5% of the time.
Yeah, I had something along the lines of what Paul said in mind. I wanted not to require that the circuit implement exactly a given function, so that we could see if daemons show up in the output. It seems easier to define daemons if we can just look at input-output behaviour.