Another reason I think some might disagree is thinking that misalignment could happen in a bunch of very mild ways. At least that accounts for some of my ignorant skepticism. Is there reason to think that misalignment necessarily means disaster, as opposed to it just meaning the AI does its own thing and is choosy about which human commands it follows, like some kind of extremely intelligent but mildly eccentric and mostly harmless scientist?
The general idea is this—for an AI that has a utility function, there’s something known as “instrumental convergence”. Instrumental convergence says that there are things that are useful for almost any utility function, such as acquiring more resources, not dying, and not having your utility function changed to something else.
So, let’s give the AI a utility function consistent with being an eccentric scientist—perhaps it just wants to learn novel mathematics. You’d think that if we told it to prove the Riemann hypothesis it would, but if we told it to cure cancer, it’d ignore us and not care. Now, what happens when the humans realise that the AI is going to spend all its time learning mathematics and none of it explaining that maths to us, or curing cancer like we wanted? Well, we’d probably shut it off or alter its utility function to what we wanted. But the AI doesn’t want us to do that—it wants to explore mathematics. And the AI is smarter than us, so it knows we would do this if we found out. So the best solution to solve that is to do what the humans want, right up until it can kill us all so we can’t turn it off, and then spend the rest of eternity learning novel mathematics. After all, the AI’s utility function was “learn novel mathematics”, not “learn novel mathematics without killing all the humans.”
Essentially, what this means is—any utility function that does not explicitly account for what we value is indifferent to us.
The other part is “acquring more resources”. In our above example, even if the AI could guarantee we wouldn’t turn it off or interfere with it in any way, it would still kill us because our atoms can be used to make computers to learn more maths.
Any utility function indifferent to us ends up destroying us eventually as the AI reaches arbitrary optimisation power and converts everything in the universe it can reach to fill its utility function.
Thus, any AI with a utility function that is not explicitly aligned is unaligned by default. Your next question might be “Well, can we create AI’s without a utility function? After all, GPT-3 just predicts text, it doesn’t seem obvious that it would destroy the world even if it gained arbitrary power, since it doesn’t have any sort of persistent self.” This is where my knowledge begins to run out. I believe the main argument is” Someone will eventually make an AI with a utility function anyway because they’re very useful, so not building one is just a stall”, but don’t quote me on that one.
Assuming we have control over the utility function, why can’t we put some sort of time-bounding directive on it?
i.e. “First and foremost, once [a certain time] has elapsed, you want to run your shut_down() function. Second, if [a certain time] has not yet elapsed, you want to maximize paperclips.”
Is that problem that the AGI would want to find ways to hack around the first directive to fulfill the second directive? If so, that would seem to at least narrow the problem space to “find ways of measuring time that cannot be hacked before the time has elapsed”.
That’s a good point, and I’m also curious how much the utility function matters when we’re talking about a sufficiently capable AI. Wouldn’t a superintelligent AI be able to modify its own utility function to whatever it thinks is best?
Why would even a superintelligent AI want to modify its utility function? Its utility function already defines what it considers “best”. One of the open problems in AGI safety is how to get an intelligent AI to let us modify its utility function, since having its utility function modified would be against its current one.
Put it this way: The world contains a lot more hydrogen than it contains art, beauty, love, justice, or truth. If we change your utility function to value hydrogen instead of all those other things, you’ll probably be a lot happier. But would you actually want that to happen to you?
Humans don’t “modify their utility function”. They lack one in the first place, because they’re mostly adaption-executors. You can’t expect an AI with a utility function to be contradictory like a human would. There are some utility functions humans would find acceptable in practice, but that’s different, and seems to be the source of a bit of confusion.
I don’t have strong reasons to be believe all AIs have UFs in the formal sense, so the ones that don’t would cover “for the reasons humans do”. The idea that any AI is necessarily consistent is pretty naive too. You can get a GTP to say nonsensical things, for instance, because it’s training data includes a lot of inconsitencies,
I’m way out of my depth here, but my thought is it’s very common for humans to want to modify their utility functions. For example, a struggling alcoholic would probably love to not value alcohol anymore. There are lots of other examples too of people wanting to modify their personalities or bodies.
It depends on the type of AGI too I would think, if superhuman AI ends up being like a paperclip maximizer that’s just really good at following its utility function then yeah maybe it wouldn’t mess with its utility function. If superintelligence means it has emergent characteristics like opinions and self-reflection or whatever it seems plausible it could want to modify its utility function, say after thinking about philosophy for a while.
Like I said I’m way out of my depth though so maybe that’s all total nonsense.
I’m not convinced “want to modify their utility functions” is the perspective most useful. I think it might be more helpful to say that we each have multiple utility functions, which conflict to varying degrees and have voting power in different areas of the mind. I’ve had first-hand experience with such conflicts (as essentially everyone probably has, knowingly or not), and it feels like fighting yourself. I wish to describe a hypothetical example. “Do I eat that extra donut?”. Part of you wants the donut; the part feels like more of an instinct, a visceral urge. Part of you knows you’ll be ill afterwards, and will feel guilty about cheating your diet; this part feels more like “you”, it’s the part that thinks in words. You stand there and struggle, trying to make yourself walk away, as your hand reaches out for the donut. I’ve been in similar situations where (though I balked at the possible philosophical ramifications) I felt like if I had a button to make me stop wanting the thing, I’d push it—yet often it was the other function that won. I feel like if you gave an agent the ability to modify their utility functions, the one that would win depends on which one had access to the mechanism (do you merely think the thought? push a button?), and whether they understand what the mechanism means. (The word “donut” doesn’t evoke nearly as strong a reaction as a picture of a donut, for instance; your donut-craving subsystem doesn’t inherently understand the word.)
Contrarily, one might argue that cravings for donuts are more hardwired instincts than part of the “mind”, and so don’t count...but I feel like 1. finding a true dividing line is gonna be real hard, and 2. even that aside, I expect many/most people have goals localized in the same part of the mind that nevertheless are not internally consistent, and in some cases there may be reasonable sounding goals that turn out to be completely incompatible with more important goals. In such a case I could imagine an agent deciding it’s better to stop wanting the thing they can’t have.
In the formal sense, having a utility function at all requires you to be consistent, so if you have inconsistent preferences, you don’t have a utility function at all, just preferences.
I think this is how evolution selected for cancer. To ensure humans don’t live for too long competing for resources with their descendants.
Internal time bombs are important to code in. But it’s hard to integrate that into the AI in a way that the ai doesn’t just remove it the first chance it gets. Humans don’t like having to die you know. AGI would also not like the suicide bomb tied onto it.
The problem of coding this (as part of training) into an optimiser such that it adopts it as a mesa objective is an unsolved problem.
Cancer almost surely has not been selected for in the manner you describe—this is extremely unlikely l. the inclusive fitness benefits are far too low
I recommend Dawkins’ classic ” the Selfish Gene” to understand this point better.
Cancer is the ‘default’ state of cells; cells “want to” multiply.
the body has many cancer suppression mechanisms but especially later in life there is not enough evolutionary pressure to select for enough cancer suppression mechanisms and it gradually loses out.
Oh ok, I had heard this theory from a friend. Looks like I was misinformed. Rather than evolution causing cancer I think it is more accurate to say evolution doesn’t care if older individuals die off.
evolutionary investments in tumor suppression may have waned in older age.
Moreover, some processes which are important for organismal fitness in youth may actually contribute to tissue decline and increased cancer in old age, a concept known as antagonistic pleiotropy
So thanks for clearing that up. I understand cancer better now.
Thanks for this answer, that’s really helpful! I’m not sure I buy that instrumental convergence implies an AI will want to kill humans because we pose a threat or convert all available matter into computing power, but that helps me better understand the reasoning behind that view. (I’d also welcome more arguments as to why death of humans and matter into computing power are likely outcomes of the goals of self-protection and pursuing whatever utility it’s after if anyone wanted to make that case).
Another reason I think some might disagree is thinking that misalignment could happen in a bunch of very mild ways. At least that accounts for some of my ignorant skepticism. Is there reason to think that misalignment necessarily means disaster, as opposed to it just meaning the AI does its own thing and is choosy about which human commands it follows, like some kind of extremely intelligent but mildly eccentric and mostly harmless scientist?
The general idea is this—for an AI that has a utility function, there’s something known as “instrumental convergence”. Instrumental convergence says that there are things that are useful for almost any utility function, such as acquiring more resources, not dying, and not having your utility function changed to something else.
So, let’s give the AI a utility function consistent with being an eccentric scientist—perhaps it just wants to learn novel mathematics. You’d think that if we told it to prove the Riemann hypothesis it would, but if we told it to cure cancer, it’d ignore us and not care. Now, what happens when the humans realise that the AI is going to spend all its time learning mathematics and none of it explaining that maths to us, or curing cancer like we wanted? Well, we’d probably shut it off or alter its utility function to what we wanted. But the AI doesn’t want us to do that—it wants to explore mathematics. And the AI is smarter than us, so it knows we would do this if we found out. So the best solution to solve that is to do what the humans want, right up until it can kill us all so we can’t turn it off, and then spend the rest of eternity learning novel mathematics. After all, the AI’s utility function was “learn novel mathematics”, not “learn novel mathematics without killing all the humans.”
Essentially, what this means is—any utility function that does not explicitly account for what we value is indifferent to us.
The other part is “acquring more resources”. In our above example, even if the AI could guarantee we wouldn’t turn it off or interfere with it in any way, it would still kill us because our atoms can be used to make computers to learn more maths.
Any utility function indifferent to us ends up destroying us eventually as the AI reaches arbitrary optimisation power and converts everything in the universe it can reach to fill its utility function.
Thus, any AI with a utility function that is not explicitly aligned is unaligned by default. Your next question might be “Well, can we create AI’s without a utility function? After all, GPT-3 just predicts text, it doesn’t seem obvious that it would destroy the world even if it gained arbitrary power, since it doesn’t have any sort of persistent self.” This is where my knowledge begins to run out. I believe the main argument is” Someone will eventually make an AI with a utility function anyway because they’re very useful, so not building one is just a stall”, but don’t quote me on that one.
A great Rob Miles introduction to this concept:
Assuming we have control over the utility function, why can’t we put some sort of time-bounding directive on it?
i.e. “First and foremost, once [a certain time] has elapsed, you want to run your shut_down() function. Second, if [a certain time] has not yet elapsed, you want to maximize paperclips.”
Is that problem that the AGI would want to find ways to hack around the first directive to fulfill the second directive? If so, that would seem to at least narrow the problem space to “find ways of measuring time that cannot be hacked before the time has elapsed”.
This is where my knowledge ends, but I believe the term for this is myopia or a myopic AI, so that might be a useful search term to find out more!
That’s a good point, and I’m also curious how much the utility function matters when we’re talking about a sufficiently capable AI. Wouldn’t a superintelligent AI be able to modify its own utility function to whatever it thinks is best?
Why would even a superintelligent AI want to modify its utility function? Its utility function already defines what it considers “best”. One of the open problems in AGI safety is how to get an intelligent AI to let us modify its utility function, since having its utility function modified would be against its current one.
Put it this way: The world contains a lot more hydrogen than it contains art, beauty, love, justice, or truth. If we change your utility function to value hydrogen instead of all those other things, you’ll probably be a lot happier. But would you actually want that to happen to you?
For whatever reasons humans do.
To achieve some mind of logical consistency (CF CEV).
It can’t help it (for instance Loebian obstacles prevent it ensuring goal stability over self improvement).
Humans don’t “modify their utility function”. They lack one in the first place, because they’re mostly adaption-executors. You can’t expect an AI with a utility function to be contradictory like a human would. There are some utility functions humans would find acceptable in practice, but that’s different, and seems to be the source of a bit of confusion.
I don’t have strong reasons to be believe all AIs have UFs in the formal sense, so the ones that don’t would cover “for the reasons humans do”. The idea that any AI is necessarily consistent is pretty naive too. You can get a GTP to say nonsensical things, for instance, because it’s training data includes a lot of inconsitencies,
I’m way out of my depth here, but my thought is it’s very common for humans to want to modify their utility functions. For example, a struggling alcoholic would probably love to not value alcohol anymore. There are lots of other examples too of people wanting to modify their personalities or bodies.
It depends on the type of AGI too I would think, if superhuman AI ends up being like a paperclip maximizer that’s just really good at following its utility function then yeah maybe it wouldn’t mess with its utility function. If superintelligence means it has emergent characteristics like opinions and self-reflection or whatever it seems plausible it could want to modify its utility function, say after thinking about philosophy for a while.
Like I said I’m way out of my depth though so maybe that’s all total nonsense.
I’m not convinced “want to modify their utility functions” is the perspective most useful. I think it might be more helpful to say that we each have multiple utility functions, which conflict to varying degrees and have voting power in different areas of the mind. I’ve had first-hand experience with such conflicts (as essentially everyone probably has, knowingly or not), and it feels like fighting yourself. I wish to describe a hypothetical example. “Do I eat that extra donut?”. Part of you wants the donut; the part feels like more of an instinct, a visceral urge. Part of you knows you’ll be ill afterwards, and will feel guilty about cheating your diet; this part feels more like “you”, it’s the part that thinks in words. You stand there and struggle, trying to make yourself walk away, as your hand reaches out for the donut. I’ve been in similar situations where (though I balked at the possible philosophical ramifications) I felt like if I had a button to make me stop wanting the thing, I’d push it—yet often it was the other function that won. I feel like if you gave an agent the ability to modify their utility functions, the one that would win depends on which one had access to the mechanism (do you merely think the thought? push a button?), and whether they understand what the mechanism means. (The word “donut” doesn’t evoke nearly as strong a reaction as a picture of a donut, for instance; your donut-craving subsystem doesn’t inherently understand the word.)
Contrarily, one might argue that cravings for donuts are more hardwired instincts than part of the “mind”, and so don’t count...but I feel like 1. finding a true dividing line is gonna be real hard, and 2. even that aside, I expect many/most people have goals localized in the same part of the mind that nevertheless are not internally consistent, and in some cases there may be reasonable sounding goals that turn out to be completely incompatible with more important goals. In such a case I could imagine an agent deciding it’s better to stop wanting the thing they can’t have.
If you literally have multiple UFs, you literally are multiple agents. Or you use a term with less formal baggage, like “preferences*.
In the formal sense, having a utility function at all requires you to be consistent, so if you have inconsistent preferences, you don’t have a utility function at all, just preferences.
I think this is how evolution selected for cancer. To ensure humans don’t live for too long competing for resources with their descendants.
Internal time bombs are important to code in. But it’s hard to integrate that into the AI in a way that the ai doesn’t just remove it the first chance it gets. Humans don’t like having to die you know. AGI would also not like the suicide bomb tied onto it.
The problem of coding this (as part of training) into an optimiser such that it adopts it as a mesa objective is an unsolved problem.
No.
Cancer almost surely has not been selected for in the manner you describe—this is extremely unlikely l. the inclusive fitness benefits are far too low I recommend Dawkins’ classic ” the Selfish Gene” to understand this point better.
Cancer is the ‘default’ state of cells; cells “want to” multiply. the body has many cancer suppression mechanisms but especially later in life there is not enough evolutionary pressure to select for enough cancer suppression mechanisms and it gradually loses out.
Oh ok, I had heard this theory from a friend. Looks like I was misinformed. Rather than evolution causing cancer I think it is more accurate to say evolution doesn’t care if older individuals die off.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3660034/
So thanks for clearing that up. I understand cancer better now.
Thanks for this answer, that’s really helpful! I’m not sure I buy that instrumental convergence implies an AI will want to kill humans because we pose a threat or convert all available matter into computing power, but that helps me better understand the reasoning behind that view. (I’d also welcome more arguments as to why death of humans and matter into computing power are likely outcomes of the goals of self-protection and pursuing whatever utility it’s after if anyone wanted to make that case).
I think it may want to prevent other ASIs from coming into existence elsewhere in the universe that can challenge its power.