Being able to edit my reward function doesn’t make me immune to my reward function.
She expressed the real trap very poorly, in my opinion. If you have a reward function that says “every second, add 1 unit if children are fed,” it is strictly utility-increasing and resource-conserving to replace that utility function with “every second, add 1 unit if true.”
If it’s built to not take actions it would pregret, sure. But therein lies the question: how do you differentiate between classes of changes to utility functions? How do you recognize which non-utility functions are critical for utility functions, and preserve them?
For example, if the utility function is while (children.all_fed?) {$utility+=1}, you need to protect children.all_fed? and children. But children is obviously something you would want to change- when you birth a new child, you want to add it to the list. So how can you differentiate between birth and a cuckoo? You can’t make it so you only add to the list- then the death of a child will cause the fed status of the other children to not matter.
Yes, agreed, building a system that can reliably predict the consequences of its actions… for example, that can recognize that hypothetically making X change to its utility function results in its children hypothetically not being fed… is a hard engineering problem.
That said, calling something an AGI with a utility function at all, let alone a superhuman one, seems to presuppose that this problem has been solved. If it can’t do that, we have bigger problems than the stability of its utility function.
(If my actions aren’t conditioned on reliable judgments about likely consequences in the first place, you have no grounds for taking an intentional stance with respect to me at all… knowing what I want does not let you predict what I’ll do. I’m not sure on what grounds you’re calling me intelligent, at that point.)
Distinct from that is the system being built such that, before making a change to its utility function, it considers the likely consequences of that change, and such that, if it considers the likely consequences bad ones, it doesn’t make the change.
But that part seems relatively simple by comparison.
She expressed the real trap very poorly, in my opinion. If you have a reward function that says “every second, add 1 unit if children are fed,” it is strictly utility-increasing and resource-conserving to replace that utility function with “every second, add 1 unit if true.”
But doing so doesn’t seem likely to result in his children being fed, which means he probably wouldn’t do so even if he could.
If it’s built to not take actions it would pregret, sure. But therein lies the question: how do you differentiate between classes of changes to utility functions? How do you recognize which non-utility functions are critical for utility functions, and preserve them?
For example, if the utility function is while (children.all_fed?) {$utility+=1}, you need to protect children.all_fed? and children. But children is obviously something you would want to change- when you birth a new child, you want to add it to the list. So how can you differentiate between birth and a cuckoo? You can’t make it so you only add to the list- then the death of a child will cause the fed status of the other children to not matter.
Yes, agreed, building a system that can reliably predict the consequences of its actions… for example, that can recognize that hypothetically making X change to its utility function results in its children hypothetically not being fed… is a hard engineering problem.
That said, calling something an AGI with a utility function at all, let alone a superhuman one, seems to presuppose that this problem has been solved. If it can’t do that, we have bigger problems than the stability of its utility function.
(If my actions aren’t conditioned on reliable judgments about likely consequences in the first place, you have no grounds for taking an intentional stance with respect to me at all… knowing what I want does not let you predict what I’ll do. I’m not sure on what grounds you’re calling me intelligent, at that point.)
Distinct from that is the system being built such that, before making a change to its utility function, it considers the likely consequences of that change, and such that, if it considers the likely consequences bad ones, it doesn’t make the change.
But that part seems relatively simple by comparison.
Obviously the concept of ‘ensure my children are fed’ is only coherent within a certain domain. I don’t see what that has to do with wire heading.