It’s not a question of stopping it. Gödel is not giving it a stern look, saying: “you can’t alter your own code until you’ve done your homework”. It’s more that these considerations prevent the agent from being in a state where it will, in fact, alter its own code in certain ways. This claim can and should be proved mathematically, but I don’t have the resources to do that at the moment. In the meanwhile, I’d agree if you wanted to disagree.
I’d like to understand what you’re saying here better. An agent instantiated as a binary program can do any of the following:
Rewrite its own source code with a random binary string.
Do things until it encounters a different agent, obtain its source code, and replace its own source code with that.
It seems to me that either of these would be enough to provide “complete control” over the agent’s source code in the sense that any possible program can be obtained as a result. So you must mean something different. What is it?
Rewrite its own source code with a random binary string
This is in a sense the electronic equivalent of setting oneself on fire—replacing oneself with maximum entropy. An artificial agent is extremely unlikely to “survive” this operation.
any possible program can be obtained as a result
Any possible program could be obtained, and the huge number of possible programs should hint that most are extremely unlikely to be obtained.
I assumed we were talking about an agent that is active and kicking, and with some non-negligible chance to keep surviving. Such an agent must have a strongly non-uniform distribution over its next internal state (code included). This means that only a tiny fraction of possible programs will have any significant probability of being obtained. I believe one can give a formula for (at least an upper bound on) the expected size of this fraction (actually, the expected log size), but I also believe nobody has ever done that, so you may doubt this particular point until I prove it.
I don’t think “surviving” is a well-defined term here. Every time you self-modify, you replace yourself with a different agent, so in that sense any agent that keeps surviving is one that does not self-modify.
Obviously, we really think that sufficiently similar agents are basically the same agent. But “sufficiently similar” is vague. Can I write a program that begins by computing the cluster of all agents similar to it, and switches to the next one (lexicographically) every 24 hours? If so, then it would eventually take on all states that are still “the same agent”.
The natural objection is that there is one part of the agent’s state that is inviolate in this example: the 24-hour rotation period (if it ever self-modified to get rid of the rotation, then it would get stuck in that state forever, without “dying” in an information theoretic sense). But I’m skeptical that this limitation can be encoded mathematically.
In addition to the rotation period, the “list of sufficiently similar agents” would become effectively non-modifiable in that case. If it ever recalculated the list, starting from a different baseline or with a different standard of ‘sufficiently similar,’ it would not be rotating, but rather on a random walk through a much larger cluster of potential agent-types.
I don’t think “surviving” is a well-defined term here. Every time you self-modify, you replace yourself with a different agent, so in that sense any agent that keeps surviving is one that does not self-modify.
I placed “survive” in quotation marks to signal that I was aware of that, and that I meant “the other thing”. I didn’t realize that this was far from clear enough, sorry.
For lack of better shared terminology, what I meant by “surviving” is continuing to be executable. Self modification is not suicide, you and I are doing it all the time.
Can I write a program that begins by computing the cluster of all agents similar to it, and switches to the next one (lexicographically) every 24 hours?
No, you cannot. This function is non-computable in the Turing sense.
A computable limited version of it (whatever it is) could be possible. But this particular agent cannot modify itself “in any way it wants”, so it’s consistent with my proposition.
The natural objection is that there is one part of the agent’s state that is inviolate in this example: the 24-hour rotation period
This is a very weak limitation of the space of possible modifications. I meant a much stronger one.
But I’m skeptical that this limitation can be encoded mathematically.
This weak limitation is easy to formalize.
The stronger limitation I’m thinking of is challenging to formalize, but I’m pretty confident that it can be done.
I’d like to understand what you’re saying here better. An agent instantiated as a binary program can do any of the following:
Rewrite its own source code with a random binary string.
Do things until it encounters a different agent, obtain its source code, and replace its own source code with that.
It seems to me that either of these would be enough to provide “complete control” over the agent’s source code in the sense that any possible program can be obtained as a result. So you must mean something different. What is it?
This is in a sense the electronic equivalent of setting oneself on fire—replacing oneself with maximum entropy. An artificial agent is extremely unlikely to “survive” this operation.
Any possible program could be obtained, and the huge number of possible programs should hint that most are extremely unlikely to be obtained.
I assumed we were talking about an agent that is active and kicking, and with some non-negligible chance to keep surviving. Such an agent must have a strongly non-uniform distribution over its next internal state (code included). This means that only a tiny fraction of possible programs will have any significant probability of being obtained. I believe one can give a formula for (at least an upper bound on) the expected size of this fraction (actually, the expected log size), but I also believe nobody has ever done that, so you may doubt this particular point until I prove it.
I don’t think “surviving” is a well-defined term here. Every time you self-modify, you replace yourself with a different agent, so in that sense any agent that keeps surviving is one that does not self-modify.
Obviously, we really think that sufficiently similar agents are basically the same agent. But “sufficiently similar” is vague. Can I write a program that begins by computing the cluster of all agents similar to it, and switches to the next one (lexicographically) every 24 hours? If so, then it would eventually take on all states that are still “the same agent”.
The natural objection is that there is one part of the agent’s state that is inviolate in this example: the 24-hour rotation period (if it ever self-modified to get rid of the rotation, then it would get stuck in that state forever, without “dying” in an information theoretic sense). But I’m skeptical that this limitation can be encoded mathematically.
In addition to the rotation period, the “list of sufficiently similar agents” would become effectively non-modifiable in that case. If it ever recalculated the list, starting from a different baseline or with a different standard of ‘sufficiently similar,’ it would not be rotating, but rather on a random walk through a much larger cluster of potential agent-types.
I placed “survive” in quotation marks to signal that I was aware of that, and that I meant “the other thing”. I didn’t realize that this was far from clear enough, sorry.
For lack of better shared terminology, what I meant by “surviving” is continuing to be executable. Self modification is not suicide, you and I are doing it all the time.
No, you cannot. This function is non-computable in the Turing sense.
A computable limited version of it (whatever it is) could be possible. But this particular agent cannot modify itself “in any way it wants”, so it’s consistent with my proposition.
This is a very weak limitation of the space of possible modifications. I meant a much stronger one.
This weak limitation is easy to formalize.
The stronger limitation I’m thinking of is challenging to formalize, but I’m pretty confident that it can be done.
Aha! I think this is the important bit. I’ll have to think about this, but it’s probably what the problem is.