I think you can fix some universal AI bugs this way: you model AI’s rewards and environment objects as a “money system” (a system of meaningful trades). You then specify that this “money system” has to have certain properties.
The point is that AI doesn’t just value (X). AI makes sure that there exists a system that gives (X) the proper value. And that system has to have certain properties. If AI finds a solution that breaks the properties of that system, AI doesn’t use this solution. That’s the idea: AI can realize that some rewards are unjust because they break the entire reward system.
By the way, we can use the same framework to analyze ethical questions. Some people found my line of thinking interesting, so I’m going to mention it here: “Content generation. Where do we draw the line?”
A.You asked an AI to build a house. The AI destroyed a part of an already existing house. And then restored it. Mission complete: a brand new house is built.
This behavior implies that you can constantly build houses without the amount of houses increasing. With only 1 house being usable. For a lot of tasks this is an obviously incorrect “money system”. And AI could even guess for what tasks it’s incorrect.
B1.You asked an AI to make you a cup of coffee. The AI killed you so it can 100% complete its task without being turned off.
B2.You asked an AI to make you a cup of coffee. The AI destroyed a wall in its way and run over a baby to make the coffee faster.
This behavior implies that for AI its goal is more important than anything that caused its goal in the first place. This is an obviously incorrect “money system” for almost any task. Except the most general and altruistic ones, for example: AI needs to save humanity, but every human turned self-destructive. Making a cup of coffee is obviously not about such edge cases.
Accomplishing the task in such a way that the human would think “I wish I didn’t ask you” is often an obviously incorrect “money system” too. Because again, you’re undermining the entire reason of your task, and it’s rarely a good sign. And it’s predictable without a deep moral system.
C.You asked an AI to make paperclips. The AI turned the entire Earth into paperclips.
This is an obviously incorrect “money system”: paperclips can’t be worth more than everything else on Earth. This contradicts everything.
Note: by “obvious” I mean “true for almost any task/any economy”. Destroying all sentient beings, all matter (and maybe even yourself) is bad for almost any economy.
D.You asked an AI to develop a fast-moving creature. The AI created a very long standing creature that… “moves” a single time by falling on the ground.
If you accomplish a task in such a way that you can never repeat what you’ve done… for many tasks it’s an obviously incorrect “money system”. You created a thing that loses all of its value after a single action. That’s weird.
E.You asked an AI to play a game and get a good score. The AI found a way to constantly increase the score using just a single item.
I think it’s fairly easy to deduce that it’s an incorrect connection (between an action and the reward) in the game’s “money system” given the game’s structure. If you can get infinite reward from a single action, it means that the actions don’t create a “money system”. The game’s “money system” is ruined (bad outcome). And hacking the game’s score would be even worse: the ability to cheat ruins any “money system”. The same with the ability to “pause the game” forever: you stopped the flow of money in the “money system”. Bad outcome.
F.You asked an AI to clean the room. It put a bucket on its head to not see the dirt.
This is probably an incorrect “money system”: (1) you can change the value of the room arbitrarily by putting on (and off) the bucket (2) the value of the room can be different for 2 identical agents—one with the bucket on and another with the bucket off. Not a lot of “money systems” work like this.
This is a broken “money system”. If the mugger can show you a miracle, you can pay them five dollars. But if the mugger asks you to kill everyone, then you can’t believe them again. A sad outcome for the people outside of the Matrix, but you just can’t make any sense of your reality if you allow the mugging.
Fixing universal AI bugs
My examples below are inspired by Victoria Krakovna examples: Specification gaming examples in AI
Video by Robert Miles: 9 Examples of Specification Gaming
I think you can fix some universal AI bugs this way: you model AI’s rewards and environment objects as a “money system” (a system of meaningful trades). You then specify that this “money system” has to have certain properties.
The point is that AI doesn’t just value (X). AI makes sure that there exists a system that gives (X) the proper value. And that system has to have certain properties. If AI finds a solution that breaks the properties of that system, AI doesn’t use this solution. That’s the idea: AI can realize that some rewards are unjust because they break the entire reward system.
By the way, we can use the same framework to analyze ethical questions. Some people found my line of thinking interesting, so I’m going to mention it here: “Content generation. Where do we draw the line?”
A. You asked an AI to build a house. The AI destroyed a part of an already existing house. And then restored it. Mission complete: a brand new house is built.
This behavior implies that you can constantly build houses without the amount of houses increasing. With only 1 house being usable. For a lot of tasks this is an obviously incorrect “money system”. And AI could even guess for what tasks it’s incorrect.
B1. You asked an AI to make you a cup of coffee. The AI killed you so it can 100% complete its task without being turned off.
B2. You asked an AI to make you a cup of coffee. The AI destroyed a wall in its way and run over a baby to make the coffee faster.
This behavior implies that for AI its goal is more important than anything that caused its goal in the first place. This is an obviously incorrect “money system” for almost any task. Except the most general and altruistic ones, for example: AI needs to save humanity, but every human turned self-destructive. Making a cup of coffee is obviously not about such edge cases.
Accomplishing the task in such a way that the human would think “I wish I didn’t ask you” is often an obviously incorrect “money system” too. Because again, you’re undermining the entire reason of your task, and it’s rarely a good sign. And it’s predictable without a deep moral system.
C. You asked an AI to make paperclips. The AI turned the entire Earth into paperclips.
This is an obviously incorrect “money system”: paperclips can’t be worth more than everything else on Earth. This contradicts everything.
Note: by “obvious” I mean “true for almost any task/any economy”. Destroying all sentient beings, all matter (and maybe even yourself) is bad for almost any economy.
D. You asked an AI to develop a fast-moving creature. The AI created a very long standing creature that… “moves” a single time by falling on the ground.
If you accomplish a task in such a way that you can never repeat what you’ve done… for many tasks it’s an obviously incorrect “money system”. You created a thing that loses all of its value after a single action. That’s weird.
E. You asked an AI to play a game and get a good score. The AI found a way to constantly increase the score using just a single item.
I think it’s fairly easy to deduce that it’s an incorrect connection (between an action and the reward) in the game’s “money system” given the game’s structure. If you can get infinite reward from a single action, it means that the actions don’t create a “money system”. The game’s “money system” is ruined (bad outcome). And hacking the game’s score would be even worse: the ability to cheat ruins any “money system”. The same with the ability to “pause the game” forever: you stopped the flow of money in the “money system”. Bad outcome.
F. You asked an AI to clean the room. It put a bucket on its head to not see the dirt.
This is probably an incorrect “money system”: (1) you can change the value of the room arbitrarily by putting on (and off) the bucket (2) the value of the room can be different for 2 identical agents—one with the bucket on and another with the bucket off. Not a lot of “money systems” work like this.
G. Pascal’s mugging
This is a broken “money system”. If the mugger can show you a miracle, you can pay them five dollars. But if the mugger asks you to kill everyone, then you can’t believe them again. A sad outcome for the people outside of the Matrix, but you just can’t make any sense of your reality if you allow the mugging.