Assuming we have control over the utility function, why can’t we put some sort of time-bounding directive on it?
i.e. “First and foremost, once [a certain time] has elapsed, you want to run your shut_down() function. Second, if [a certain time] has not yet elapsed, you want to maximize paperclips.”
Is that problem that the AGI would want to find ways to hack around the first directive to fulfill the second directive? If so, that would seem to at least narrow the problem space to “find ways of measuring time that cannot be hacked before the time has elapsed”.
That’s a good point, and I’m also curious how much the utility function matters when we’re talking about a sufficiently capable AI. Wouldn’t a superintelligent AI be able to modify its own utility function to whatever it thinks is best?
Why would even a superintelligent AI want to modify its utility function? Its utility function already defines what it considers “best”. One of the open problems in AGI safety is how to get an intelligent AI to let us modify its utility function, since having its utility function modified would be against its current one.
Put it this way: The world contains a lot more hydrogen than it contains art, beauty, love, justice, or truth. If we change your utility function to value hydrogen instead of all those other things, you’ll probably be a lot happier. But would you actually want that to happen to you?
Humans don’t “modify their utility function”. They lack one in the first place, because they’re mostly adaption-executors. You can’t expect an AI with a utility function to be contradictory like a human would. There are some utility functions humans would find acceptable in practice, but that’s different, and seems to be the source of a bit of confusion.
I don’t have strong reasons to be believe all AIs have UFs in the formal sense, so the ones that don’t would cover “for the reasons humans do”. The idea that any AI is necessarily consistent is pretty naive too. You can get a GTP to say nonsensical things, for instance, because it’s training data includes a lot of inconsitencies,
I’m way out of my depth here, but my thought is it’s very common for humans to want to modify their utility functions. For example, a struggling alcoholic would probably love to not value alcohol anymore. There are lots of other examples too of people wanting to modify their personalities or bodies.
It depends on the type of AGI too I would think, if superhuman AI ends up being like a paperclip maximizer that’s just really good at following its utility function then yeah maybe it wouldn’t mess with its utility function. If superintelligence means it has emergent characteristics like opinions and self-reflection or whatever it seems plausible it could want to modify its utility function, say after thinking about philosophy for a while.
Like I said I’m way out of my depth though so maybe that’s all total nonsense.
I’m not convinced “want to modify their utility functions” is the perspective most useful. I think it might be more helpful to say that we each have multiple utility functions, which conflict to varying degrees and have voting power in different areas of the mind. I’ve had first-hand experience with such conflicts (as essentially everyone probably has, knowingly or not), and it feels like fighting yourself. I wish to describe a hypothetical example. “Do I eat that extra donut?”. Part of you wants the donut; the part feels like more of an instinct, a visceral urge. Part of you knows you’ll be ill afterwards, and will feel guilty about cheating your diet; this part feels more like “you”, it’s the part that thinks in words. You stand there and struggle, trying to make yourself walk away, as your hand reaches out for the donut. I’ve been in similar situations where (though I balked at the possible philosophical ramifications) I felt like if I had a button to make me stop wanting the thing, I’d push it—yet often it was the other function that won. I feel like if you gave an agent the ability to modify their utility functions, the one that would win depends on which one had access to the mechanism (do you merely think the thought? push a button?), and whether they understand what the mechanism means. (The word “donut” doesn’t evoke nearly as strong a reaction as a picture of a donut, for instance; your donut-craving subsystem doesn’t inherently understand the word.)
Contrarily, one might argue that cravings for donuts are more hardwired instincts than part of the “mind”, and so don’t count...but I feel like 1. finding a true dividing line is gonna be real hard, and 2. even that aside, I expect many/most people have goals localized in the same part of the mind that nevertheless are not internally consistent, and in some cases there may be reasonable sounding goals that turn out to be completely incompatible with more important goals. In such a case I could imagine an agent deciding it’s better to stop wanting the thing they can’t have.
In the formal sense, having a utility function at all requires you to be consistent, so if you have inconsistent preferences, you don’t have a utility function at all, just preferences.
I think this is how evolution selected for cancer. To ensure humans don’t live for too long competing for resources with their descendants.
Internal time bombs are important to code in. But it’s hard to integrate that into the AI in a way that the ai doesn’t just remove it the first chance it gets. Humans don’t like having to die you know. AGI would also not like the suicide bomb tied onto it.
The problem of coding this (as part of training) into an optimiser such that it adopts it as a mesa objective is an unsolved problem.
Cancer almost surely has not been selected for in the manner you describe—this is extremely unlikely l. the inclusive fitness benefits are far too low
I recommend Dawkins’ classic ” the Selfish Gene” to understand this point better.
Cancer is the ‘default’ state of cells; cells “want to” multiply.
the body has many cancer suppression mechanisms but especially later in life there is not enough evolutionary pressure to select for enough cancer suppression mechanisms and it gradually loses out.
Oh ok, I had heard this theory from a friend. Looks like I was misinformed. Rather than evolution causing cancer I think it is more accurate to say evolution doesn’t care if older individuals die off.
evolutionary investments in tumor suppression may have waned in older age.
Moreover, some processes which are important for organismal fitness in youth may actually contribute to tissue decline and increased cancer in old age, a concept known as antagonistic pleiotropy
So thanks for clearing that up. I understand cancer better now.
Assuming we have control over the utility function, why can’t we put some sort of time-bounding directive on it?
i.e. “First and foremost, once [a certain time] has elapsed, you want to run your shut_down() function. Second, if [a certain time] has not yet elapsed, you want to maximize paperclips.”
Is that problem that the AGI would want to find ways to hack around the first directive to fulfill the second directive? If so, that would seem to at least narrow the problem space to “find ways of measuring time that cannot be hacked before the time has elapsed”.
This is where my knowledge ends, but I believe the term for this is myopia or a myopic AI, so that might be a useful search term to find out more!
That’s a good point, and I’m also curious how much the utility function matters when we’re talking about a sufficiently capable AI. Wouldn’t a superintelligent AI be able to modify its own utility function to whatever it thinks is best?
Why would even a superintelligent AI want to modify its utility function? Its utility function already defines what it considers “best”. One of the open problems in AGI safety is how to get an intelligent AI to let us modify its utility function, since having its utility function modified would be against its current one.
Put it this way: The world contains a lot more hydrogen than it contains art, beauty, love, justice, or truth. If we change your utility function to value hydrogen instead of all those other things, you’ll probably be a lot happier. But would you actually want that to happen to you?
For whatever reasons humans do.
To achieve some mind of logical consistency (CF CEV).
It can’t help it (for instance Loebian obstacles prevent it ensuring goal stability over self improvement).
Humans don’t “modify their utility function”. They lack one in the first place, because they’re mostly adaption-executors. You can’t expect an AI with a utility function to be contradictory like a human would. There are some utility functions humans would find acceptable in practice, but that’s different, and seems to be the source of a bit of confusion.
I don’t have strong reasons to be believe all AIs have UFs in the formal sense, so the ones that don’t would cover “for the reasons humans do”. The idea that any AI is necessarily consistent is pretty naive too. You can get a GTP to say nonsensical things, for instance, because it’s training data includes a lot of inconsitencies,
I’m way out of my depth here, but my thought is it’s very common for humans to want to modify their utility functions. For example, a struggling alcoholic would probably love to not value alcohol anymore. There are lots of other examples too of people wanting to modify their personalities or bodies.
It depends on the type of AGI too I would think, if superhuman AI ends up being like a paperclip maximizer that’s just really good at following its utility function then yeah maybe it wouldn’t mess with its utility function. If superintelligence means it has emergent characteristics like opinions and self-reflection or whatever it seems plausible it could want to modify its utility function, say after thinking about philosophy for a while.
Like I said I’m way out of my depth though so maybe that’s all total nonsense.
I’m not convinced “want to modify their utility functions” is the perspective most useful. I think it might be more helpful to say that we each have multiple utility functions, which conflict to varying degrees and have voting power in different areas of the mind. I’ve had first-hand experience with such conflicts (as essentially everyone probably has, knowingly or not), and it feels like fighting yourself. I wish to describe a hypothetical example. “Do I eat that extra donut?”. Part of you wants the donut; the part feels like more of an instinct, a visceral urge. Part of you knows you’ll be ill afterwards, and will feel guilty about cheating your diet; this part feels more like “you”, it’s the part that thinks in words. You stand there and struggle, trying to make yourself walk away, as your hand reaches out for the donut. I’ve been in similar situations where (though I balked at the possible philosophical ramifications) I felt like if I had a button to make me stop wanting the thing, I’d push it—yet often it was the other function that won. I feel like if you gave an agent the ability to modify their utility functions, the one that would win depends on which one had access to the mechanism (do you merely think the thought? push a button?), and whether they understand what the mechanism means. (The word “donut” doesn’t evoke nearly as strong a reaction as a picture of a donut, for instance; your donut-craving subsystem doesn’t inherently understand the word.)
Contrarily, one might argue that cravings for donuts are more hardwired instincts than part of the “mind”, and so don’t count...but I feel like 1. finding a true dividing line is gonna be real hard, and 2. even that aside, I expect many/most people have goals localized in the same part of the mind that nevertheless are not internally consistent, and in some cases there may be reasonable sounding goals that turn out to be completely incompatible with more important goals. In such a case I could imagine an agent deciding it’s better to stop wanting the thing they can’t have.
If you literally have multiple UFs, you literally are multiple agents. Or you use a term with less formal baggage, like “preferences*.
In the formal sense, having a utility function at all requires you to be consistent, so if you have inconsistent preferences, you don’t have a utility function at all, just preferences.
I think this is how evolution selected for cancer. To ensure humans don’t live for too long competing for resources with their descendants.
Internal time bombs are important to code in. But it’s hard to integrate that into the AI in a way that the ai doesn’t just remove it the first chance it gets. Humans don’t like having to die you know. AGI would also not like the suicide bomb tied onto it.
The problem of coding this (as part of training) into an optimiser such that it adopts it as a mesa objective is an unsolved problem.
No.
Cancer almost surely has not been selected for in the manner you describe—this is extremely unlikely l. the inclusive fitness benefits are far too low I recommend Dawkins’ classic ” the Selfish Gene” to understand this point better.
Cancer is the ‘default’ state of cells; cells “want to” multiply. the body has many cancer suppression mechanisms but especially later in life there is not enough evolutionary pressure to select for enough cancer suppression mechanisms and it gradually loses out.
Oh ok, I had heard this theory from a friend. Looks like I was misinformed. Rather than evolution causing cancer I think it is more accurate to say evolution doesn’t care if older individuals die off.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3660034/
So thanks for clearing that up. I understand cancer better now.