If it’s important to me that my children have food, and my reward function is such that I get 1 unit of reward for 1 unit of fed-child, and you give me the ability to edit my reward function so I get N units instead, I don’t automatically do it.
It depends on what I think will happen next if I do. If I think it will make my children more likely to have food, then I do it (all else being equal). If I think it will make them less likely, then I don’t.
Being able to edit my reward function doesn’t make me immune to my reward function.
If it’s important to me that my children have food, and my reward function is such that I get 1 unit of reward for 1 unit of fed-child, and you give me the ability to edit my reward function so I get N units instead, I don’t automatically do it.
Is your reward function the warm glow you feel when your child is fed? (A parent choosing to ramp this up would be analogous to a parent in real life choosing to take a drug that feels great with no consequences in response to their kid eating a meal. This would indeed be a strange thing to do. Maybe a parent would agree to the arrangement as the only way of obtaining that drug..)
Or is your reward function the health and well-being of your child, which is the reason you wanted them to eat in the first place. In which case, parents would certainly do what they could to ramp that up.
(My question might be leading in the direction of SRStarin’s comment, I’m not sure.)
If it’s important to me that my children have food, I will take the steps I think will lead to my children being fed.
My reward function in this case is whatever structures in my mind reinforce the taking of actions that are associated in certain ways with the structures that represent my children having food. Maybe there’s a subjective component to that (“warm glow”), maybe there isn’t.
A sufficiently advanced neuroscience allows me to point to structures in my own brain and say “Ah, see? That is where my preference for my children to have food is computed, that is where my belief that earning a salary increases the chances my children have food is computed, that is where my increased inclination to earn a salary is computed,” and so on and so forth. That is, it lets me identify the neural substrate(s) of my utility function(s).
So Omega hands me the appropriately advanced neuroscience and there I am, standing in front of the console that controls the appropriate machinery, knowing full well that the only reason I care about my child being fed is those circuits I’m seeing on the screen—that, for example, if an accidental brain lesion were to disrupt those circuits, I would no longer care whether my child were fed or not.
Omega’s gadget also allows me to edit those structures so that I no longer care about whether my child is fed. There’s the button right there. Do I press it?
That only works out for your children because you, as a father, are unable to edit your fundamental reward function. I’m not clear on whether your comment is meant to be a concise restatement of the OP, or if it’s some kind of counterexample...an example showing that even self-modifying intelligences must have a fundamental reward function that is not modifiable.
The linked-to article seems to be concluding that, because a self-modifying AI can modify its own utility function, its utility function is necessarily unstable.
My point is that a system’s ability to modify its utility function doesn’t actually make it likely that its utility function will change, any more than my ability to consume hemlock makes it likely that I will do so.
Even given the ability to edit my utility function, whether and how I choose to use that ability depends on whether I expect doing so to get me what I want, which is constrained by (among other things) my unmodified utility function.
I don’t have data or studies to back this up, but I feel that humans have a strong tendency to return to their base state. Self-modifying AI would not do that. So, doesn’t it make sense that no AI should be made that doesn’t have a demonstrably strong tendency to return to its base state?
That is, should it be a required and unmodifiable AI value that the base state has inherent value? This does have the potential to counteract some of the worst UFAI nightmares out there.
Yes, it does seem safer to build non-self-modifying AIs. But I’m not quite saying that should be the limit. I’m saying that any AI that can self-modify ought to have a hard barrier where there is code that can’t be modified.
I know there has been excitement here about a transhuman AI being able to bypass pretty much any control humans could devise (that excitement is the topic that first brought me here, in fact). But going for a century or so with AIs that can’t self-modify seems like a pretty good precaution, no?
Simply making a promise could be considered self-modification, since you presumably behave differently after making the promise than you would have counterfactually.
Learning some fact about the world could be considered self-modification, for the same reason.
Can we come up with a useful classification scheme, distinguishing safe forms of self-modification from unsafe forms? Or, what may amount to the same thing, can we give criteria for rationally self-modifying, for each class of self-modification? That is, for example, when is it rational to make promises? When is it rational to update our beliefs about the world?
Perhaps in this context: Structural changes to yourself that are not changes to beliefs, or memories—and are not merely confined to repositioning your actuators, or day-to-day metabolism.
Can we come up with a useful classification scheme, distinguishing safe forms of self-modification from unsafe forms?
You could whitelist safe kinds. That might be useful—under some circumstances.
Clearly, there are some internal values that an AI would need to be able to modify, or else it couldn’t learn. But I think there is good reason to disallow an AI from modifying its own rules for reward, at least to start out. An analogy in humans is that we can do some amazingly wonderful things, but some people go awry when they begin abusing drugs, thereby modifying their own reward circuitry. Severe addicts find they can’t manage a productive life, instead turning to crime to get just enough cash to feed their habits. I’d say that there is inherent danger for human intelligences in short-circuiting or otherwise modifying our reward pathways directly (i.e. chemically), and so there would likely be danger in allowing and AI to directly modify its reward pathways
Being able to edit my reward function doesn’t make me immune to my reward function.
She expressed the real trap very poorly, in my opinion. If you have a reward function that says “every second, add 1 unit if children are fed,” it is strictly utility-increasing and resource-conserving to replace that utility function with “every second, add 1 unit if true.”
If it’s built to not take actions it would pregret, sure. But therein lies the question: how do you differentiate between classes of changes to utility functions? How do you recognize which non-utility functions are critical for utility functions, and preserve them?
For example, if the utility function is while (children.all_fed?) {$utility+=1}, you need to protect children.all_fed? and children. But children is obviously something you would want to change- when you birth a new child, you want to add it to the list. So how can you differentiate between birth and a cuckoo? You can’t make it so you only add to the list- then the death of a child will cause the fed status of the other children to not matter.
Yes, agreed, building a system that can reliably predict the consequences of its actions… for example, that can recognize that hypothetically making X change to its utility function results in its children hypothetically not being fed… is a hard engineering problem.
That said, calling something an AGI with a utility function at all, let alone a superhuman one, seems to presuppose that this problem has been solved. If it can’t do that, we have bigger problems than the stability of its utility function.
(If my actions aren’t conditioned on reliable judgments about likely consequences in the first place, you have no grounds for taking an intentional stance with respect to me at all… knowing what I want does not let you predict what I’ll do. I’m not sure on what grounds you’re calling me intelligent, at that point.)
Distinct from that is the system being built such that, before making a change to its utility function, it considers the likely consequences of that change, and such that, if it considers the likely consequences bad ones, it doesn’t make the change.
But that part seems relatively simple by comparison.
If it’s important to me that my children have food, and my reward function is such that I get 1 unit of reward for 1 unit of fed-child, and you give me the ability to edit my reward function so I get N units instead, I don’t automatically do it.
It depends on what I think will happen next if I do. If I think it will make my children more likely to have food, then I do it (all else being equal). If I think it will make them less likely, then I don’t.
Being able to edit my reward function doesn’t make me immune to my reward function.
Is your reward function the warm glow you feel when your child is fed? (A parent choosing to ramp this up would be analogous to a parent in real life choosing to take a drug that feels great with no consequences in response to their kid eating a meal. This would indeed be a strange thing to do. Maybe a parent would agree to the arrangement as the only way of obtaining that drug..)
Or is your reward function the health and well-being of your child, which is the reason you wanted them to eat in the first place. In which case, parents would certainly do what they could to ramp that up.
(My question might be leading in the direction of SRStarin’s comment, I’m not sure.)
If it’s important to me that my children have food, I will take the steps I think will lead to my children being fed.
My reward function in this case is whatever structures in my mind reinforce the taking of actions that are associated in certain ways with the structures that represent my children having food. Maybe there’s a subjective component to that (“warm glow”), maybe there isn’t.
A sufficiently advanced neuroscience allows me to point to structures in my own brain and say “Ah, see? That is where my preference for my children to have food is computed, that is where my belief that earning a salary increases the chances my children have food is computed, that is where my increased inclination to earn a salary is computed,” and so on and so forth. That is, it lets me identify the neural substrate(s) of my utility function(s).
So Omega hands me the appropriately advanced neuroscience and there I am, standing in front of the console that controls the appropriate machinery, knowing full well that the only reason I care about my child being fed is those circuits I’m seeing on the screen—that, for example, if an accidental brain lesion were to disrupt those circuits, I would no longer care whether my child were fed or not.
Omega’s gadget also allows me to edit those structures so that I no longer care about whether my child is fed. There’s the button right there. Do I press it?
I can’t see why I would.
Would you?
That only works out for your children because you, as a father, are unable to edit your fundamental reward function. I’m not clear on whether your comment is meant to be a concise restatement of the OP, or if it’s some kind of counterexample...an example showing that even self-modifying intelligences must have a fundamental reward function that is not modifiable.
Just looking for clarity.
The linked-to article seems to be concluding that, because a self-modifying AI can modify its own utility function, its utility function is necessarily unstable.
My point is that a system’s ability to modify its utility function doesn’t actually make it likely that its utility function will change, any more than my ability to consume hemlock makes it likely that I will do so.
Even given the ability to edit my utility function, whether and how I choose to use that ability depends on whether I expect doing so to get me what I want, which is constrained by (among other things) my unmodified utility function.
I don’t have data or studies to back this up, but I feel that humans have a strong tendency to return to their base state. Self-modifying AI would not do that. So, doesn’t it make sense that no AI should be made that doesn’t have a demonstrably strong tendency to return to its base state?
That is, should it be a required and unmodifiable AI value that the base state has inherent value? This does have the potential to counteract some of the worst UFAI nightmares out there.
What are you including in your notion of an AI’s “state”? It sounds rather like you’re saying it’s safer to build non-self-modifying AIs.
Which is true, of course, but there are opportunity costs associated with that.
Yes, it does seem safer to build non-self-modifying AIs. But I’m not quite saying that should be the limit. I’m saying that any AI that can self-modify ought to have a hard barrier where there is code that can’t be modified.
I know there has been excitement here about a transhuman AI being able to bypass pretty much any control humans could devise (that excitement is the topic that first brought me here, in fact). But going for a century or so with AIs that can’t self-modify seems like a pretty good precaution, no?
But what counts as “self-modification”?
Simply making a promise could be considered self-modification, since you presumably behave differently after making the promise than you would have counterfactually.
Learning some fact about the world could be considered self-modification, for the same reason.
Can we come up with a useful classification scheme, distinguishing safe forms of self-modification from unsafe forms? Or, what may amount to the same thing, can we give criteria for rationally self-modifying, for each class of self-modification? That is, for example, when is it rational to make promises? When is it rational to update our beliefs about the world?
Perhaps in this context: Structural changes to yourself that are not changes to beliefs, or memories—and are not merely confined to repositioning your actuators, or day-to-day metabolism.
You could whitelist safe kinds. That might be useful—under some circumstances.
Clearly, there are some internal values that an AI would need to be able to modify, or else it couldn’t learn. But I think there is good reason to disallow an AI from modifying its own rules for reward, at least to start out. An analogy in humans is that we can do some amazingly wonderful things, but some people go awry when they begin abusing drugs, thereby modifying their own reward circuitry. Severe addicts find they can’t manage a productive life, instead turning to crime to get just enough cash to feed their habits. I’d say that there is inherent danger for human intelligences in short-circuiting or otherwise modifying our reward pathways directly (i.e. chemically), and so there would likely be danger in allowing and AI to directly modify its reward pathways
And how do you propose to stop them. Put a negative term in their reward functions?
Very nicely expressed.
She expressed the real trap very poorly, in my opinion. If you have a reward function that says “every second, add 1 unit if children are fed,” it is strictly utility-increasing and resource-conserving to replace that utility function with “every second, add 1 unit if true.”
But doing so doesn’t seem likely to result in his children being fed, which means he probably wouldn’t do so even if he could.
If it’s built to not take actions it would pregret, sure. But therein lies the question: how do you differentiate between classes of changes to utility functions? How do you recognize which non-utility functions are critical for utility functions, and preserve them?
For example, if the utility function is while (children.all_fed?) {$utility+=1}, you need to protect children.all_fed? and children. But children is obviously something you would want to change- when you birth a new child, you want to add it to the list. So how can you differentiate between birth and a cuckoo? You can’t make it so you only add to the list- then the death of a child will cause the fed status of the other children to not matter.
Yes, agreed, building a system that can reliably predict the consequences of its actions… for example, that can recognize that hypothetically making X change to its utility function results in its children hypothetically not being fed… is a hard engineering problem.
That said, calling something an AGI with a utility function at all, let alone a superhuman one, seems to presuppose that this problem has been solved. If it can’t do that, we have bigger problems than the stability of its utility function.
(If my actions aren’t conditioned on reliable judgments about likely consequences in the first place, you have no grounds for taking an intentional stance with respect to me at all… knowing what I want does not let you predict what I’ll do. I’m not sure on what grounds you’re calling me intelligent, at that point.)
Distinct from that is the system being built such that, before making a change to its utility function, it considers the likely consequences of that change, and such that, if it considers the likely consequences bad ones, it doesn’t make the change.
But that part seems relatively simple by comparison.
Obviously the concept of ‘ensure my children are fed’ is only coherent within a certain domain. I don’t see what that has to do with wire heading.