I should probably have specified that building another agent doesn’t really count as self modification if the other agent is identical to the original (or maybe it does count as self modification, but in a very vacuous sense, the same way ‘do nothing’ is technically an algorithm). So if the other agent is CDT this is not a counter-example.
If the other agent is a more primitive approximation to a CDT then I would view constructing it not as self-modification, but simply as making a choice in an action-determined problem.
If the other agent is TDT or UDT or something then this may count as self-modification, but there is no need to make it this way.
Suppose we use the rigorous definition where an action-determined problem is just a list of choices, each of which leads to a probability distribution across possible outcomes, each of which has a utility assigned to it. In this case I think it is clear that “An ideal CDT agent that anticipates facing only action-determined problems will always choose not to self modify” is true while “An ideal CDT agent that anticipates facing only action-determined problems will always choose not to do anything” is false.
Suppose we use the rigorous definition where an action-determined problem is just a list of choices, each of which leads to a probability distribution across possible outcomes, each of which has a utility assigned to it. In this case I think it is clear that “An ideal CDT agent that anticipates facing only action-determined problems will always choose not to self modify” is true
That’s plausible, but my counterexample still holds, apparently. I’m sure the desired theorem is true under the right hypotheses, but I can’t quite guess what they are right now.
In the cloning scenario, Tim-in-China would have to be a modified version of Tim-in-US. Tim-in-US is optimizing for a utility function U of the environment which perhaps can only be evaluated based on information available to Tim-in-US. Tim-in-China would be constructed to optimize for the best estimate of U it can make, given that it’s in China. This best estimate will be different from U. If everything important happens in China and needs quick responses, and Tim-in-US can’t move, it might even be worthwhile for Tim-in-US to sacrifice himself to create Tim-in-China.
Tim-in-China is clearly a self-modification, since the utility function is different, right?
In general, we can contrive the circumstances so the agent is paid to self-modify. If the agent is rational and it’s paid enough to self-modify, it will.
In the cloning scenario, Tim-in-China would have to be a modified version of Tim-in-US.
There’s no reason for this. A true CDT doesn’t need to see the results of its actions, it just needs to predict them. Since its an ideal Bayesian, it should be quite good at this. Tim in China might acquire new information the Tim in US didn’t know, causing it to revise its probability distribution, but it would not change its utility function. Nor would it cease to be a CDT, which means in practice it would not self-modify.
Also, strictly speaking, prior to the point where Tim in China is created the problems are not fully action-determined, since the outcome is affected by things other than random chance the the choices made by Tim in US.
Heat but no light this time around. I won’t reply more unless it gets better.
In the cloning scenario, Tim-in-China would have to be a modified version of Tim-in-US.
There’s no reason for this.
The world in which Tim-in-US lives determines what options are available when creating Tim-in-China, not any property of CDT, so if I’m creating the scenario I can fill in the details so there is reason for Tim-in-China to be lame in any way I choose. It could be very simple—Tim-in-US has a button to push that will both destroy Tim-in-US and set Tim-in-China into action, where Tim-in-China existed at the beginning of the scenario and is therefore whatever I want it to be. Tim-in-US cannot take any direct action other than pushing the button. Pushing the button is self-modification. If we can contrive for it to be rational for Tim-in-US to push the button, Tim-in-US will self-modify.
In a more realistic scenario, Tim-in-China might be imperfect because it is built of whatever materials are at hand, rather than the mathematically perfect substrate Tim-in-US’s mind runs on. If you want Tim-in-China to be an ideal CDT for it to qualify as self-modification, then fine, Tim-in-China is an ideal CDT but the environment constrains things so that Tim-in-China’s utility function is not a particularly good approximation to that of Tim-in-US. If Tim-in-China’s utility function is good enough, and Tim-in-US’s ability to take direct action is impaired enough, then we can fill in the details so Tim-in-US will still benefit from self-modifying.
Also, strictly speaking, prior to the point where Tim in China is created the problems are not fully action-determined, since the outcome is affected by things other than random chance the the choices made by Tim in US.
I can’t make sense of this. Please tell me the influence on the outcome that wasn’t random chance and wasn’t a choice made by Tim-in-US. (We don’t need any randomness in this scenario.) You’ll also have to choose something that leads to it not being action-determined, and something that’s consistent with a definition of action-determined that doesn’t lead to “action-determined” referring a useless or empty set of possibilities.
You might be referring to actions taken by Tim-in-China. Tim-in-US chose to create Tim-in-China, so all actions taken by Tim-in-China are a consequence of choices made by Tim-in-US.
You might be referring to actions taken by Tim-in-China. Tim-in-US chose to create Tim-in-China, so all actions taken by Tim-in-China are a consequence of choices made by Tim-in-US.
The thing is there’s two ways of looking at this problem. Either creating Tim-in-China is just one option avaliable in an action-determined, everything he does is just a consequence which Tim-in-US predicted. In this case it isn’t self-modification. Alternatively, he is an independent agent, in which case creating him is self-modification but the problem isn’t action-determined.
I think I’m beginning to see that you’re right, self-modification isn’t a strictly defined concept. On the other hand, very few things are strictly defined, ‘human’ and ‘AI’ are certainly not but we wouldn’t be wise to ignore them when solving Friendliness.
It is possible to set up mathematical models in which self-modification is well defined (in the same way that atoms aren’t fundamental physical entities, but we can set up models in which they are and those models are useful). The basic idea is an agent is given a problem of some type, but prior to the problem we offer it the chance to have the problem faced by another agent instead of itself, if there is any other agent for which it would say yes then it self modifies on this problem.
You’ll also have to choose something that leads to it not being action-determined, and something that’s consistent with a definition of action-determined that doesn’t lead to “action-determined” referring a useless or empty set of possibilities.
The set of real world strictly action-determined problems is empty, the concept is similar to that of an ideal straight line, it is a useful approximation not a real category.
The strict definition of action-determined problem is something like this:
agent comes into existence, out of nowhere, in a way the is completely uncaused within the universe and could not have been predicted by its contents
agent is presented with list of options
agent chooses one option
agent disappears
I think the last part may not be strictly necessary, but I’m unsure. The first is necessary, it is what separates action-determined problems from broader categories like decision-determined problems and identity-determined problems.
The strict definition of action-determined problem is something like this:
agent comes into existence, out of nowhere, in a way the is completely uncaused within the universe and could not have been predicted by its contents
agent is presented with list of options
agent chooses one option
agent disappears
I think the last part may not be strictly necessary, but I’m unsure.
We seem to be agreed that it is possible to define mathematical situations in which self-modification has a well-defined meaning, and that it doesn’t have a well-defined meaning for an AI that exists in the real world and is planning actions in the real world. We don’t know how to generalize those mathematical situations so they are more relevant to the real world.
We differ in that I don’t want to generalize those mathematical situations to work with the real world. I’d rather discard them. You’d rather try to find a use for them.
I suppose clarifying all that is a useful outcome for the conversation.
Outside of ‘electron’, ‘quark’ , ‘neutrino’ almost none of the words we use are well-defined on the real world. All non-fundamental concepts break if you push them hard enough.
I think they are useful in that I have a pretty good idea of what I mean by ‘self-modification’ in the real world. For a simpler example, if I want to build a paperclipping AI, the sort of thing I’m looking to avoid is where for some reason my paperclipping AI starts making something pointless and stupid, like staples. I wish to study self-modification, because I want to stop it from modifying itself into a staple-maker. I may not know exactly what counts as self-modification, but the correct response is not to ignore it and say ‘oh, I’m sure it will all work out fine either way’.
Yes, making it rigorous will be difficult. Yudkowsky himself has said he thinks that 95% of the work will be in figuring out which theorem to prove. The correct response to a difficult problem is not to run away.
Yes, making it rigorous will be difficult. Yudkowsky himself has said he thinks that 95% of the work will be in figuring out which theorem to prove. The correct response to a difficult problem is not to run away.
I’m not suggesting running away. I’m suggesting that the rigorous statement of the theorem will not include the notions of self-modification (my definition) or self-modification (your definition), since we don’t have rigorous definitions of those terms that apply outside of a counterfactual mathematical formalism.
I should probably have specified that building another agent doesn’t really count as self modification if the other agent is identical to the original (or maybe it does count as self modification, but in a very vacuous sense, the same way ‘do nothing’ is technically an algorithm). So if the other agent is CDT this is not a counter-example.
If the other agent is a more primitive approximation to a CDT then I would view constructing it not as self-modification, but simply as making a choice in an action-determined problem.
If the other agent is TDT or UDT or something then this may count as self-modification, but there is no need to make it this way.
Suppose we use the rigorous definition where an action-determined problem is just a list of choices, each of which leads to a probability distribution across possible outcomes, each of which has a utility assigned to it. In this case I think it is clear that “An ideal CDT agent that anticipates facing only action-determined problems will always choose not to self modify” is true while “An ideal CDT agent that anticipates facing only action-determined problems will always choose not to do anything” is false.
That’s plausible, but my counterexample still holds, apparently. I’m sure the desired theorem is true under the right hypotheses, but I can’t quite guess what they are right now.
In the cloning scenario, Tim-in-China would have to be a modified version of Tim-in-US. Tim-in-US is optimizing for a utility function U of the environment which perhaps can only be evaluated based on information available to Tim-in-US. Tim-in-China would be constructed to optimize for the best estimate of U it can make, given that it’s in China. This best estimate will be different from U. If everything important happens in China and needs quick responses, and Tim-in-US can’t move, it might even be worthwhile for Tim-in-US to sacrifice himself to create Tim-in-China.
Tim-in-China is clearly a self-modification, since the utility function is different, right?
In general, we can contrive the circumstances so the agent is paid to self-modify. If the agent is rational and it’s paid enough to self-modify, it will.
There’s no reason for this. A true CDT doesn’t need to see the results of its actions, it just needs to predict them. Since its an ideal Bayesian, it should be quite good at this. Tim in China might acquire new information the Tim in US didn’t know, causing it to revise its probability distribution, but it would not change its utility function. Nor would it cease to be a CDT, which means in practice it would not self-modify.
Also, strictly speaking, prior to the point where Tim in China is created the problems are not fully action-determined, since the outcome is affected by things other than random chance the the choices made by Tim in US.
Heat but no light this time around. I won’t reply more unless it gets better.
The world in which Tim-in-US lives determines what options are available when creating Tim-in-China, not any property of CDT, so if I’m creating the scenario I can fill in the details so there is reason for Tim-in-China to be lame in any way I choose. It could be very simple—Tim-in-US has a button to push that will both destroy Tim-in-US and set Tim-in-China into action, where Tim-in-China existed at the beginning of the scenario and is therefore whatever I want it to be. Tim-in-US cannot take any direct action other than pushing the button. Pushing the button is self-modification. If we can contrive for it to be rational for Tim-in-US to push the button, Tim-in-US will self-modify.
In a more realistic scenario, Tim-in-China might be imperfect because it is built of whatever materials are at hand, rather than the mathematically perfect substrate Tim-in-US’s mind runs on. If you want Tim-in-China to be an ideal CDT for it to qualify as self-modification, then fine, Tim-in-China is an ideal CDT but the environment constrains things so that Tim-in-China’s utility function is not a particularly good approximation to that of Tim-in-US. If Tim-in-China’s utility function is good enough, and Tim-in-US’s ability to take direct action is impaired enough, then we can fill in the details so Tim-in-US will still benefit from self-modifying.
I can’t make sense of this. Please tell me the influence on the outcome that wasn’t random chance and wasn’t a choice made by Tim-in-US. (We don’t need any randomness in this scenario.) You’ll also have to choose something that leads to it not being action-determined, and something that’s consistent with a definition of action-determined that doesn’t lead to “action-determined” referring a useless or empty set of possibilities.
You might be referring to actions taken by Tim-in-China. Tim-in-US chose to create Tim-in-China, so all actions taken by Tim-in-China are a consequence of choices made by Tim-in-US.
The thing is there’s two ways of looking at this problem. Either creating Tim-in-China is just one option avaliable in an action-determined, everything he does is just a consequence which Tim-in-US predicted. In this case it isn’t self-modification. Alternatively, he is an independent agent, in which case creating him is self-modification but the problem isn’t action-determined.
I think I’m beginning to see that you’re right, self-modification isn’t a strictly defined concept. On the other hand, very few things are strictly defined, ‘human’ and ‘AI’ are certainly not but we wouldn’t be wise to ignore them when solving Friendliness.
It is possible to set up mathematical models in which self-modification is well defined (in the same way that atoms aren’t fundamental physical entities, but we can set up models in which they are and those models are useful). The basic idea is an agent is given a problem of some type, but prior to the problem we offer it the chance to have the problem faced by another agent instead of itself, if there is any other agent for which it would say yes then it self modifies on this problem.
The set of real world strictly action-determined problems is empty, the concept is similar to that of an ideal straight line, it is a useful approximation not a real category.
The strict definition of action-determined problem is something like this:
agent comes into existence, out of nowhere, in a way the is completely uncaused within the universe and could not have been predicted by its contents
agent is presented with list of options
agent chooses one option
agent disappears
I think the last part may not be strictly necessary, but I’m unsure. The first is necessary, it is what separates action-determined problems from broader categories like decision-determined problems and identity-determined problems.
We seem to be agreed that it is possible to define mathematical situations in which self-modification has a well-defined meaning, and that it doesn’t have a well-defined meaning for an AI that exists in the real world and is planning actions in the real world. We don’t know how to generalize those mathematical situations so they are more relevant to the real world.
We differ in that I don’t want to generalize those mathematical situations to work with the real world. I’d rather discard them. You’d rather try to find a use for them.
I suppose clarifying all that is a useful outcome for the conversation.
Outside of ‘electron’, ‘quark’ , ‘neutrino’ almost none of the words we use are well-defined on the real world. All non-fundamental concepts break if you push them hard enough.
I think they are useful in that I have a pretty good idea of what I mean by ‘self-modification’ in the real world. For a simpler example, if I want to build a paperclipping AI, the sort of thing I’m looking to avoid is where for some reason my paperclipping AI starts making something pointless and stupid, like staples. I wish to study self-modification, because I want to stop it from modifying itself into a staple-maker. I may not know exactly what counts as self-modification, but the correct response is not to ignore it and say ‘oh, I’m sure it will all work out fine either way’.
Yes, making it rigorous will be difficult. Yudkowsky himself has said he thinks that 95% of the work will be in figuring out which theorem to prove. The correct response to a difficult problem is not to run away.
I’m not suggesting running away. I’m suggesting that the rigorous statement of the theorem will not include the notions of self-modification (my definition) or self-modification (your definition), since we don’t have rigorous definitions of those terms that apply outside of a counterfactual mathematical formalism.