I’m thinking a bit about AI safety lately as I’m considering writing about it for one of my college classes.
I’m hardly heavy into AI safety research and so expect flaws and mistaken assumptions in these ideas. I would be grateful for corrections.
An AI told to make people smile tiles the world with smiley faces but an AI told to do what humans would want it to do might still get it wrong eg. Failed Utopia #4-2 . However, wouldn’t it research further and correct itself (and before that, have care to not do something un-correctable)? Reasoning as follows: let’s say a non-failed utopia is worth 100⁄100 utility points per year. A failed utopia is that which at first seems to the AI to be a true utopia (100 points) but is actually less (idk, 90 points). Even were the cost of research heavy, if the AI wants/expects billions or trillions of years of human existence, it should be doing a lot of research early on, and would be very careful to not be fooled. Therefore, we don’t need to do anything more than tell an AI to do what humans would want it to do, and let it do the work on figuring out exactly what that is itself.
Partially un-consequentialist AI/careful AI: Weights harms caused by its decisions (somewhat but not absolutely) heavier than other harms. (Therefore: tendency towards protecting what humanity already has against large harms and causes smaller surer gains like curing cancer rather than instituting a new world order (haha).)
1: Imagine a utility function as a function that takes as input a description of the world in some standard format, and outputs a “goodness rating” between 0 and 100. The AI can then take actions that it predicts will make the world have a higher goodness rating.
Lots of utility functions are possible. Suppose there’s one possible future where I get cake, and one where I get pie. I have a very strong opinion on these futures’ goodness, and I will take actions that I predict will make the world more likely to turn out pie. But this is not a priori necessary—we could define a utility function that swaps the goodness ratings of cake and pie, and an AI using that utility function would take actions that it predicts will lead to worlds with higher goodness rating, i.e. cake. There is no objective standard that it could use to realize that pie is better—it is merely a computer program that makes predictions and then picks the action that it predicts maximizes some function.
Utopias are like cake and pie. If I give the pie utopia a higher goodness rating, and the AI gives the cake utopia a higher goodness rating, it’s not “wrong” in the sense of being able to check its work and find a mistake. The AI can prefer the cake utopia even while operating perfectly.
This is what happens in the case of Failed Utopia 4-2. The AI has some preferences about the world. And those preferences are very close but not quite human preferences. And so the main character ends up in the cake utopia. Even if the AI does a lot more research and checks its reasoning carefully, it is not a priori necessary that it should realize the error of its ways and make the world a pie utopia instead. It’s wrong(2), but not wrong(1).
Similar problems show up when you try to make any sort of AI that just “does what humans want.” Eventually, somewhere, you have to turn this vague verbal statement into a precise specification (like the code of the AI), which is used to compute something like a goodness rating. And it turns out that when you actually try to do this, it’s pretty tricky to make the AI’s goodness ratings similar to a human’s goodness ratings. Basically every easy way has some critical flaw, and the ways that seem promising are not very easy.
So sure, we want to make an AI that just does what humans want (sort of). But this is like “make an AI that recognizes pictures of cats”—an admirable goal, but a nontrivial one. And one that might have bad consequences even if only slightly wrong.
However, wouldn’t it research further and correct itself (and before that, have care to not do something un-correctable)?
Check out the Cake or Death value loading problem, as Stuart Armstrong puts it.
There’s a rough similarity to the ‘resist blackmail’ problem, which is that you need to be able to tell the difference between someone delivering bad news and doing bad things. If the AI is mistaken about what is right, we want to be able to correct it without being interpreted as villains out to destroy potential utility.
(Also, “correctable” is not really a low-level separation in reality, since the passage of time means nothing is truly correctable.)
I’m thinking a bit about AI safety lately as I’m considering writing about it for one of my college classes.
I’m hardly heavy into AI safety research and so expect flaws and mistaken assumptions in these ideas. I would be grateful for corrections.
An AI told to make people smile tiles the world with smiley faces but an AI told to do what humans would want it to do might still get it wrong eg. Failed Utopia #4-2 . However, wouldn’t it research further and correct itself (and before that, have care to not do something un-correctable)? Reasoning as follows: let’s say a non-failed utopia is worth 100⁄100 utility points per year. A failed utopia is that which at first seems to the AI to be a true utopia (100 points) but is actually less (idk, 90 points). Even were the cost of research heavy, if the AI wants/expects billions or trillions of years of human existence, it should be doing a lot of research early on, and would be very careful to not be fooled. Therefore, we don’t need to do anything more than tell an AI to do what humans would want it to do, and let it do the work on figuring out exactly what that is itself.
Partially un-consequentialist AI/careful AI: Weights harms caused by its decisions (somewhat but not absolutely) heavier than other harms. (Therefore: tendency towards protecting what humanity already has against large harms and causes smaller surer gains like curing cancer rather than instituting a new world order (haha).)
Thanks in advance. :)
1: Imagine a utility function as a function that takes as input a description of the world in some standard format, and outputs a “goodness rating” between 0 and 100. The AI can then take actions that it predicts will make the world have a higher goodness rating.
Lots of utility functions are possible. Suppose there’s one possible future where I get cake, and one where I get pie. I have a very strong opinion on these futures’ goodness, and I will take actions that I predict will make the world more likely to turn out pie. But this is not a priori necessary—we could define a utility function that swaps the goodness ratings of cake and pie, and an AI using that utility function would take actions that it predicts will lead to worlds with higher goodness rating, i.e. cake. There is no objective standard that it could use to realize that pie is better—it is merely a computer program that makes predictions and then picks the action that it predicts maximizes some function.
Utopias are like cake and pie. If I give the pie utopia a higher goodness rating, and the AI gives the cake utopia a higher goodness rating, it’s not “wrong” in the sense of being able to check its work and find a mistake. The AI can prefer the cake utopia even while operating perfectly.
This is what happens in the case of Failed Utopia 4-2. The AI has some preferences about the world. And those preferences are very close but not quite human preferences. And so the main character ends up in the cake utopia. Even if the AI does a lot more research and checks its reasoning carefully, it is not a priori necessary that it should realize the error of its ways and make the world a pie utopia instead. It’s wrong(2), but not wrong(1).
Similar problems show up when you try to make any sort of AI that just “does what humans want.” Eventually, somewhere, you have to turn this vague verbal statement into a precise specification (like the code of the AI), which is used to compute something like a goodness rating. And it turns out that when you actually try to do this, it’s pretty tricky to make the AI’s goodness ratings similar to a human’s goodness ratings. Basically every easy way has some critical flaw, and the ways that seem promising are not very easy.
So sure, we want to make an AI that just does what humans want (sort of). But this is like “make an AI that recognizes pictures of cats”—an admirable goal, but a nontrivial one. And one that might have bad consequences even if only slightly wrong.
Check out the Cake or Death value loading problem, as Stuart Armstrong puts it.
There’s a rough similarity to the ‘resist blackmail’ problem, which is that you need to be able to tell the difference between someone delivering bad news and doing bad things. If the AI is mistaken about what is right, we want to be able to correct it without being interpreted as villains out to destroy potential utility.
(Also, “correctable” is not really a low-level separation in reality, since the passage of time means nothing is truly correctable.)