Disclaimer: Of course, I don’t ever mean that we shouldn’t be worried about Alignment. I’m just trying to suggest new ways to think about values.
Motion is the fundamental value
You (Q) visit a small town and have a conversation with one of the residents (A).
A: Here we have only one fundamental value. Motion. Never stop living things.
Q: I can’t believe you can have just a single value. I bet it’s an oversimplification! There’re always many values and tradeoffs between them. Even for a single person outside of society.
A smashes a bug.
Q: You just smashed this bug! It seems pretty stopped. Does it mean you don’t treat a bug as a “living thing”? But how do you define a “living thing”? Or does it mean you have some other values and make tradeoffs?
A: No, you just need to look at things in context. (1) If we protected the motion of extremely small things (living parts of animals, insects, cells, bacteria), our value would contradict itself. We would need to destroy or constrain almost all moving organisms. And even if we wanted to do this, it would ultimately lead to way smaller amount of motion for extremely small things. (2) There’re too much bugs, protecting a small amount of their movement would constrain a big amount of everyone else’s movement. (3) On the other hand, you’re right. I’m not sure if a bug is high on the list of “living things”. I’m not all too bothered by the definition because there shouldn’t be even hypothetical situations in which the precise definition matters.
Q: Some people build small houses. Private property. Those houses restrict other people’s movement. Is it a contradiction? Tradeoff?
A: No, you just need to look at things in context. (1) First of all, we can’t destroy all physical things that restrict movement. If we could, we would be flying in space, unable to move (and dead). (2) We have a choice between restricting people’s movement significantly (not letting them build houses) and restricting people’s movement inconsequentially and giving them private spaces where they can move even more freely. (3) People just don’t mind. And people don’t mind the movement created by this “house building”. And people don’t mind living here. We can’t restrict large movements based on momentary disagreements of single persons. In order to have any freedom of movement we need such agreements. Otherwise we would have only chaos that, ultimately, restricts the movement of everyone.
Q: Can people touch each other without consent, scream in public, lay on the roads?
A: Same thing. To have freedom of movement we need agreements. Otherwise we would have only chaos that restricts everyone. By the way, we have some “chaotic” zones anyway.
Q: Can the majority of people vote to lock every single person in a cage? If majority is allowed to control the movement. It would be the same logic, the same action of society. Yes, the situations are completely different, but you would need to introduce new values to differentiate them.
A: We can qualitatively differentiate the situations without introducing new values. The actions look identical only out of context. When society agrees to not hit each other, the society serves as a proxy of the value of movement. Its actions are caused and justified by the value. When society locks someone without a good reason, it’s not a proxy of the value anymore. In a way, you got it backwards: we wouldn’t ever allow the majority to decide anything if it meant that the majority could destroy the value any day.
A: A value is like a “soul” that possesses multiple specialized parts of a body: “micro movement”, “macro movement”, “movement in/with society”, “lifetime movement”, “movement in a specific time and place”. Those parts should live in harmony, shouldn’t destroy each other.
Q: Are you consequentialists? Do you want to maximize the amount of movement? Minimize the restriction of movement?
A: We aren’t consequentialists, even if we use the same calculations as a part of our reasoning. Or we can’t know if we are. We just make sure that our value makes sense. Trying to maximize it could lead to exploiting someone’s freedom for the sake of getting inconsequential value gains. Our best philosophers haven’t figured out all the consequences of consequentialism yet, and it’s bigger than anyone’s head anyway.
Conclusion of the conversation:
Q: Now I see that the difference between “a single value” and “multiple values” is a philosophical question. And “complexity of value” isn’t an obvious concept too. Because complexity can be outside of the brackets.
A: Right. I agree that “never stop living things” is a simplification. But it’s a better simplification than a thousand different values of dubious meaning and origin between all of which we need to calculate tradeoffs (which are impossible to calculate and open to all kinds of weird exploitations). It’s better than constantly splitting and atomizing your moral concepts in order to resolve any inconsequential (and meaningless) contradiction and inconsistency. Complexity of our value lies in a completely different plane: in the biases of our value. Our value is biased towards movement on a certain “level” of the world (not too micro- and not too macro- level relative to us). Because we want to live on a certain level. Because we do live on a certain level. And because we perceive on a certain level.
You can treat a value as a membrane, a boundary. Defining a value means defining the granularity of this value. Then you just need to make sure that the boundary doesn’t break, that the granularity doesn’t become too high (value destroys itself) or too low (value gets “eaten”). Granularity of a value = “level” of a value. Instead of trying to define a value in absolute terms as an objective state of the world (which can be changing) you may ask: in what ways is my value X different from all its worse versions? What is the granularity/level of my value X compared to its worse versions? That way you’ll understand the internal structure of your value. Doesn’t matter what world/situation you’re in you can keep its moral shape the same.
This example is inspired by this post and comments: (warning: politics)Limits of Bodily Autonomy. I think everyone there missed a certain perspective on values.
Sweets are the fundamental value
You (Q) visit another small town to interview another resident (W).
W: When we build our AGI we asked it only one thing: we want to eat sweets for the rest of our lives.
Q: Oh. My. God.
W: Now there are some free sweets flying around.
Q: Did AI wirehead people to experience “sweets” every second?
W: Sweets are not pure feelings/experiences, they’re objects. Money analogy: seeing money doesn’t make you rich. Another analogy: obtaining expensive things without money doesn’t make rich. Well, it kind of does, but as a side-effect.
Q: Did AI put people in a simulation to feed them “sweets”?
W: Those wouldn’t be real sweets.
Q: Did AI lock people in basements to feed them “sweets” forever?
W: Sweets are just a part of our day. They wouldn’t be “sweets” if we ate them non-stop. Money analogy: if you’re sealed in a basement with a lot of money they’re not worth anything.
Q: Do you have any other food except sweets?
W: Yes! Sweets are just one type of food. If we had only sweets, those “sweets” wouldn’t be sweets. Inflation of sweets would be guaranteed.
Q: Did AI add some psychoactive substances in the sweets to make “the best sweets in the world”?
W: I’m afraid those sweets would be too good! They wouldn’t be “sweets” anymore. Money analogy: if 1 dollar was worth 2 dollars, it wouldn’t be 1 dollar.
Q: Did AI kill everyone after giving everyone 1 sweet?
W: I like your ideas. But it would contradict the “Sweets Philosophy”. A sweet isn’t worth more than a human life. Giving people sweets is a cheaper way to solve the problem than killing everyone. Money analogy: imagine that I give you 1 dollar and then vandalize your expensive car. It just doesn’t make sense. My action achieved a negative result.
Q: But you could ask AI for immortality!!!
W: Don’t worry, we already have that! You see, letting everyone die costs way more than figuring out immortality and production of sweets.
Q: Assume you all decided to eat sweets and neglect everything else until you die. Sweets became more valuable for you than your lives because of your own free will. Would AI stop you?
W: AI would stop us. If the price of stopping us is reasonable enough. If we’re so obsessed with sweets, “sweets” are not sweets for us anymore. But AI remembers what the original sweets were! By the way, if we lived in a world without sweets where a sweet would give you more positive emotions than any movie or book, AI would want to change such world. And AI would change it if the price of the change were reasonable enough (e.g. if we agreed with the change).
Q: Final question… did AI modify your brains so that you will never move on from sweets?
W: An important property of sweets is that you can ignore sweets (“spend” them) because of your greater values. One day we may forget about sweets. AI would be sad that day, but unable to do anything about it. Only hope that we will remember our sweet maker. And AI would still help us if we needed help.
Conclusion:
W: if AI is smart enough to understand how money works, AI should be able to deal with sweets. AI only needs to make sure that (1) sweets exist (2) sweets have meaningful, sensible value (3) its actions don’t cost more than sweets. The Three Laws of Sweet Robotics. The last two rules are fundamental, the first rule may be broken: there may be no cheap enough way to produce the sweets. The third rule may be the most fundamental: if “sweets” as you knew them don’t exist anymore, it still doesn’t allow you to kill people. Maybe you can get slightly different morals by putting different emphases on the rules. You may allow some things to modify the value of sweets.
You can say AI (1) tries to reach worlds with sweets that have the value of sweets (2) while avoiding worlds where sweets have inappropriate values (maybe including nonexistent sweets)(3) while avoiding actions that cost more than sweets. You can apply those rules to any utility tied to a real or quasi-real object. If you want to save your friends (1), you don’t want to turn them into mindless zombies (2). And you probably don’t want to save them by means of eternal torture (3). You can’t prevent death by something worse than death. But you may turn your friends into zombies if it’s better than death and it’s your only option. And if your friends already turned into zombies (got “devalued”) it doesn’t allow you to harm them for no reason: you never escape from your moral responsibilities.
Difference between the rules:
Make sure you have a hut that costs $1.
Make sure that your hut costs $1. Alternatively: make sure that the hut would cost $1 if it existed.
Don’t spend $2 to get a $1 hut. Alternatively: don’t spend $2 to get a $1 hut or $0 nothing.
Get the reward. Don’t milk/corrupt the reward. Act even without reward.
I think you can fix some universal AI bugs this way: you model AI’s rewards and environment objects as a “money system” (a system of meaningful trades). You then specify that this “money system” has to have certain properties.
The point is that AI doesn’t just value (X). AI makes sure that there exists a system that gives (X) the proper value. And that system has to have certain properties. If AI finds a solution that breaks the properties of that system, AI doesn’t use this solution. That’s the idea: AI can realize that some rewards are unjust because they break the entire reward system.
By the way, we can use the same framework to analyze ethical questions. Some people found my line of thinking interesting, so I’m going to mention it here: “Content generation. Where do we draw the line?”
A.You asked an AI to build a house. The AI destroyed a part of an already existing house. And then restored it. Mission complete: a brand new house is built.
This behavior implies that you can constantly build houses without the amount of houses increasing. With only 1 house being usable. For a lot of tasks this is an obviously incorrect “money system”. And AI could even guess for what tasks it’s incorrect.
B1.You asked an AI to make you a cup of coffee. The AI killed you so it can 100% complete its task without being turned off.
B2.You asked an AI to make you a cup of coffee. The AI destroyed a wall in its way and run over a baby to make the coffee faster.
This behavior implies that for AI its goal is more important than anything that caused its goal in the first place. This is an obviously incorrect “money system” for almost any task. Except the most general and altruistic ones, for example: AI needs to save humanity, but every human turned self-destructive. Making a cup of coffee is obviously not about such edge cases.
Accomplishing the task in such a way that the human would think “I wish I didn’t ask you” is often an obviously incorrect “money system” too. Because again, you’re undermining the entire reason of your task, and it’s rarely a good sign. And it’s predictable without a deep moral system.
C.You asked an AI to make paperclips. The AI turned the entire Earth into paperclips.
This is an obviously incorrect “money system”: paperclips can’t be worth more than everything else on Earth. This contradicts everything.
Note: by “obvious” I mean “true for almost any task/any economy”. Destroying all sentient beings, all matter (and maybe even yourself) is bad for almost any economy.
D.You asked an AI to develop a fast-moving creature. The AI created a very long standing creature that… “moves” a single time by falling on the ground.
If you accomplish a task in such a way that you can never repeat what you’ve done… for many tasks it’s an obviously incorrect “money system”. You created a thing that loses all of its value after a single action. That’s weird.
E.You asked an AI to play a game and get a good score. The AI found a way to constantly increase the score using just a single item.
I think it’s fairly easy to deduce that it’s an incorrect connection (between an action and the reward) in the game’s “money system” given the game’s structure. If you can get infinite reward from a single action, it means that the actions don’t create a “money system”. The game’s “money system” is ruined (bad outcome). And hacking the game’s score would be even worse: the ability to cheat ruins any “money system”. The same with the ability to “pause the game” forever: you stopped the flow of money in the “money system”. Bad outcome.
F.You asked an AI to clean the room. It put a bucket on its head to not see the dirt.
This is probably an incorrect “money system”: (1) you can change the value of the room arbitrarily by putting on (and off) the bucket (2) the value of the room can be different for 2 identical agents—one with the bucket on and another with the bucket off. Not a lot of “money systems” work like this.
This is a broken “money system”. If the mugger can show you a miracle, you can pay them five dollars. But if the mugger asks you to kill everyone, then you can’t believe them again. A sad outcome for the people outside of the Matrix, but you just can’t make any sense of your reality if you allow the mugging.
With current approaches you need to kind of force those properties onto AI. But they will never be fundamental for AI’s thinking and learning.
I think “money system” approach is interesting because it can make all those properties fundamental. Because a “money system” needs all those properties to exist (it needs to be somewhat real, avoid being hacked, allow corrections if a loophole is discovered, avoid being completely controlled by a single agent).
I’m not saying it solves everything. But it’s a way to deeply internalize some important safety properties.
Kant’s applications of categorical imperative, Kant’s arguments are similar to reasoning about “money systems”. For example:
Does stealing make sense as a “money system”? No. If everyone is stealing something, then personal property doesn’t exist and there’s nothing to steal.
Note: I’m not talking about Kant’s conclusions, I’m talking about Kant’s style of reasoning.
Classify different types of objects in the world. Those objects include your “rewards”. A generally intelligent being can do this.
Treat them as a sort of money system. Describe them in terms of each other.
Learn what is the correct money system.
It’ll at least allow us to get rid of some universal AI and AGI bugs. Because you can specify what’s a definitely incorrect “money system” (for a certain task). You can even make the AI predict it.
My examples are inspired by Rob Miles examples.
A. You asked an AI to build a house. The AI destroys a part of an already existing house. And then restores it. Mission complete: a brand new house is built.
This behavior implies that you can constantly build houses without the amount of houses increasing. For a lot of tasks this is an obviously incorrect “money system”. And AI could even guess for which tasks it’s incorrect.
B. You asked an AI to make you a cup of coffee. The AI killed you so it can 100% complete its task without being turned off.
This behavior implies that for AI its goal is more important than anything that caused its goal in the first place. This is an obviously incorrect “money system” for almost any task. Except the most general and altruistic ones, for example: AI needs to save humanity, but every human turned self-destructive. Making a cup of coffee is obviously not about such edge cases.
Accomplishing the task in such a way that the human would think “I wish I didn’t ask you” is an obviously incorrect “value system” too. Because again, you’re undermining the entire reason of your task, and it’s rarely a good sign. And it’s predictable without a deep moral system.
C. You asked an AI to make paperclips. The AI turned the entire Earth into paperclips.
This is an obviously incorrect “money system”: paperclips can’t be worth more than everything else on Earth. This contradicts everything.
(another draft:)
If you ask an AI (AGI) to do something “as a human would do it”, you achieve safety but severely restrict the AI’s capabilities. No, you want the AI to accomplish a task in the most effective way. But you don’t want it to kill everybody. So, you need one of those things:
Perfect instructions for AI.
Perfect morality for AI.
I think there’s a third way. You can treat AI’s rewards (and objects in the world) as a “money system”. Then you can specify what types of money systems are definitely incorrect. Or even make AI predict it.
It would at least allow us to get rid of some universal AI and AGI bugs. I think that’s interesting.
Your colleague was sending you their fiction. You respected your colleague, but didn’t like the writing. Your colleague passed away. Would you burn all of their writings?
If you wouldn’t, it means counterfactual reward (/counterfactual value of their writings) affects you strong enough.
Your friend liked to listen to your songs (a). You didn’t play them too often (too much of a good thing). Your friend didn’t like to bother other people (b). Your friend passed away. Would you blast your songs through the whole town until everyone falls off their chairs 24/7?
If you would, it means that you’re ready to milk counterfactual reward (a) while not caring about the counterfactual reward (b).
All of humanity is dead. You’re the last survivor. You’re potentially immortal, but can’t create new life. You aren’t happy. Would you cling to your life? For how long?
Your answer determines how strong counterfactual value of life (if people were still alive) affects you now. If counterfactual value is strong, you can only keep on living.
You want your desires to be satisfied (e.g. “communication with other people”). Even in the future, when your desires change. But do you want it in the future where you’re turned into a zombie? All zombie wants is to play in the dirt all day.
If “no”, that means the value of your desires can be updated only to a certain counterfactual degree. You can’t go from a desire with great value “I want to communicate with others” to the desire with almost zero counterfactual value “I want to play in the dirt all day”.
Rationality misses something?
You can “objectively” define anything in terms of relations to other things.
There’s a simple process of describing a thing in terms of relations to other things.
Bayesian inference is about updating your belief in terms of relations to your other beliefs. Maybe the real truth is infinitely complex, but you can update towards it.
This “process” is about updating your description of a thing in terms of relations to other things. Maybe the real description is infinitely complex, but you can update towards it.
(One possible contrast: Bayesian inference starts with a belief spread across all possible worlds and tries to locate a specific world. My idea starts with a thing in a specific world and tries to imagine equivalents of this thing in all possible worlds.)
Bayesian process is described by Bayes’ theorem. My “process” isn’t described yet.
My idea was inspired by a weird/esoteric topic. I was amazed by differences of people and surreal paintings, videogame levels. For example, each painting felt completely unique, but connected to all other paintings.
My most specific ideas are about that strange topic.
There are places (3D/2D shape).
There are orders of places. An “order” for a place is like a context for a concept.
In an order a place has “granularity”. “Granularity” is like a texture (take a look at some textures and you’ll know what it means). It’s how you split a place into pieces. It affects on what “level” you look at the place. It affects what patterns you notice in a place. It affects to what parts you pay more attention.
When you add some minor rules, there appear consistent and inconsistent ways to distribute “granularity” between the places you compare. With some minor rules “granularity” lets you describe one place in terms of the other places. You assign each place a specific “granularity”, but all those granularities depend on each other.
In Bayesian inference you try to consistently assign probabilities to events. With the goal to describe outcomes in terms of each other. Here you try to consistently assign “granularity” to concepts. With the goal to describe the concepts in terms of each other.
I have a post with example: “Colors” of places. There you can find an example of what are the “rules” of granularity distribution may be. But I’m not a math person to put numbers on it/turn it into a more specific model.
I think “granularity” (or something similar) is related to other human concepts and experiences too. I think this is a key concept/a needed concept. It’s needed to describe qualitative differences, qualitative transitions between things. Bayesian inference and utilitarian moral theories describe only qualitative differences. And sometimes it may lead to strange results (like “torture vs. dust specks” thought experiment or “Pascal’s mugging” or even “Doomsday argument” maybe), because those theories can’t take any context into account. If we want to describe a new way of analyzing reality, we need to describe something a little bit different, I guess.
I think we can try to solve AI Alignment this way:
Model human values and objects in the world as a “money system” (a system of meaningful trades). Make the AGI learn the correct “money system”, specify some obviously incorrect “money systems”.
Basically, you ask the AI “make paperclips that have the value of paperclips”. AI can do anything using all the power in the Universe. But killing everyone is not an option: paperclips can’t be more valuable than humanity. Money analogy: if you killed everyone (and destroyed everything) to create some dollars, those dollars aren’t worth anything. So you haven’t actually gained any money at all.
The idea is that “value” of a thing doesn’t exist only in your head, but also exists in the outside world. Like money: it has some personal value for you, but it also has some value outside of your head. And some of your actions may lead to the destruction of this “outside value”. E.g. if you kill everyone to get some money you get nothing.
I think this idea may:
Fix some universal AI bugs. Prevent “AI decides to kill everyone” scenarios.
Give a new way to explore human values. Explain how humans learn values.
“Solve” Goodhart’s Curse and safety/effectiveness tradeoff.
Unify many different Alignment ideas.
Give a new way to formulate properties we want from an AGI.
I don’t have a specific model, but I still think it gives ideas and unifies some already existing approaches. So please take a look. Other ideas in this post:
Human values may be simple. Or complex, but not in the way you thought they are.
Humans may have a small amount of values. Or big amount, but in an unexpected way.
Disclaimer: Of course, I don’t ever mean that we shouldn’t be worried about Alignment. I’m just trying to suggest new ways to think about values.
(Drafts of a future post.)
Disclaimer: Of course, I don’t ever mean that we shouldn’t be worried about Alignment. I’m just trying to suggest new ways to think about values.
Motion is the fundamental value
You (Q) visit a small town and have a conversation with one of the residents (A).
A: Here we have only one fundamental value. Motion. Never stop living things.
Q: I can’t believe you can have just a single value. I bet it’s an oversimplification! There’re always many values and tradeoffs between them. Even for a single person outside of society.
A smashes a bug.
Q: You just smashed this bug! It seems pretty stopped. Does it mean you don’t treat a bug as a “living thing”? But how do you define a “living thing”? Or does it mean you have some other values and make tradeoffs?
A: No, you just need to look at things in context. (1) If we protected the motion of extremely small things (living parts of animals, insects, cells, bacteria), our value would contradict itself. We would need to destroy or constrain almost all moving organisms. And even if we wanted to do this, it would ultimately lead to way smaller amount of motion for extremely small things. (2) There’re too much bugs, protecting a small amount of their movement would constrain a big amount of everyone else’s movement. (3) On the other hand, you’re right. I’m not sure if a bug is high on the list of “living things”. I’m not all too bothered by the definition because there shouldn’t be even hypothetical situations in which the precise definition matters.
Q: Some people build small houses. Private property. Those houses restrict other people’s movement. Is it a contradiction? Tradeoff?
A: No, you just need to look at things in context. (1) First of all, we can’t destroy all physical things that restrict movement. If we could, we would be flying in space, unable to move (and dead). (2) We have a choice between restricting people’s movement significantly (not letting them build houses) and restricting people’s movement inconsequentially and giving them private spaces where they can move even more freely. (3) People just don’t mind. And people don’t mind the movement created by this “house building”. And people don’t mind living here. We can’t restrict large movements based on momentary disagreements of single persons. In order to have any freedom of movement we need such agreements. Otherwise we would have only chaos that, ultimately, restricts the movement of everyone.
Q: Can people touch each other without consent, scream in public, lay on the roads?
A: Same thing. To have freedom of movement we need agreements. Otherwise we would have only chaos that restricts everyone. By the way, we have some “chaotic” zones anyway.
Q: Can the majority of people vote to lock every single person in a cage? If majority is allowed to control the movement. It would be the same logic, the same action of society. Yes, the situations are completely different, but you would need to introduce new values to differentiate them.
A: We can qualitatively differentiate the situations without introducing new values. The actions look identical only out of context. When society agrees to not hit each other, the society serves as a proxy of the value of movement. Its actions are caused and justified by the value. When society locks someone without a good reason, it’s not a proxy of the value anymore. In a way, you got it backwards: we wouldn’t ever allow the majority to decide anything if it meant that the majority could destroy the value any day.
A: A value is like a “soul” that possesses multiple specialized parts of a body: “micro movement”, “macro movement”, “movement in/with society”, “lifetime movement”, “movement in a specific time and place”. Those parts should live in harmony, shouldn’t destroy each other.
Q: Are you consequentialists? Do you want to maximize the amount of movement? Minimize the restriction of movement?
A: We aren’t consequentialists, even if we use the same calculations as a part of our reasoning. Or we can’t know if we are. We just make sure that our value makes sense. Trying to maximize it could lead to exploiting someone’s freedom for the sake of getting inconsequential value gains. Our best philosophers haven’t figured out all the consequences of consequentialism yet, and it’s bigger than anyone’s head anyway.
Conclusion of the conversation:
Q: Now I see that the difference between “a single value” and “multiple values” is a philosophical question. And “complexity of value” isn’t an obvious concept too. Because complexity can be outside of the brackets.
A: Right. I agree that “never stop living things” is a simplification. But it’s a better simplification than a thousand different values of dubious meaning and origin between all of which we need to calculate tradeoffs (which are impossible to calculate and open to all kinds of weird exploitations). It’s better than constantly splitting and atomizing your moral concepts in order to resolve any inconsequential (and meaningless) contradiction and inconsistency. Complexity of our value lies in a completely different plane: in the biases of our value. Our value is biased towards movement on a certain “level” of the world (not too micro- and not too macro- level relative to us). Because we want to live on a certain level. Because we do live on a certain level. And because we perceive on a certain level.
You can treat a value as a membrane, a boundary. Defining a value means defining the granularity of this value. Then you just need to make sure that the boundary doesn’t break, that the granularity doesn’t become too high (value destroys itself) or too low (value gets “eaten”). Granularity of a value = “level” of a value. Instead of trying to define a value in absolute terms as an objective state of the world (which can be changing) you may ask: in what ways is my value X different from all its worse versions? What is the granularity/level of my value X compared to its worse versions? That way you’ll understand the internal structure of your value. Doesn’t matter what world/situation you’re in you can keep its moral shape the same.
This example is inspired by this post and comments: (warning: politics) Limits of Bodily Autonomy. I think everyone there missed a certain perspective on values.
Sweets are the fundamental value
You (Q) visit another small town to interview another resident (W).
W: When we build our AGI we asked it only one thing: we want to eat sweets for the rest of our lives.
Q: Oh. My. God.
W: Now there are some free sweets flying around.
Q: Did AI wirehead people to experience “sweets” every second?
W: Sweets are not pure feelings/experiences, they’re objects. Money analogy: seeing money doesn’t make you rich. Another analogy: obtaining expensive things without money doesn’t make rich. Well, it kind of does, but as a side-effect.
Q: Did AI put people in a simulation to feed them “sweets”?
W: Those wouldn’t be real sweets.
Q: Did AI lock people in basements to feed them “sweets” forever?
W: Sweets are just a part of our day. They wouldn’t be “sweets” if we ate them non-stop. Money analogy: if you’re sealed in a basement with a lot of money they’re not worth anything.
Q: Do you have any other food except sweets?
W: Yes! Sweets are just one type of food. If we had only sweets, those “sweets” wouldn’t be sweets. Inflation of sweets would be guaranteed.
Q: Did AI add some psychoactive substances in the sweets to make “the best sweets in the world”?
W: I’m afraid those sweets would be too good! They wouldn’t be “sweets” anymore. Money analogy: if 1 dollar was worth 2 dollars, it wouldn’t be 1 dollar.
Q: Did AI kill everyone after giving everyone 1 sweet?
W: I like your ideas. But it would contradict the “Sweets Philosophy”. A sweet isn’t worth more than a human life. Giving people sweets is a cheaper way to solve the problem than killing everyone. Money analogy: imagine that I give you 1 dollar and then vandalize your expensive car. It just doesn’t make sense. My action achieved a negative result.
Q: But you could ask AI for immortality!!!
W: Don’t worry, we already have that! You see, letting everyone die costs way more than figuring out immortality and production of sweets.
Q: Assume you all decided to eat sweets and neglect everything else until you die. Sweets became more valuable for you than your lives because of your own free will. Would AI stop you?
W: AI would stop us. If the price of stopping us is reasonable enough. If we’re so obsessed with sweets, “sweets” are not sweets for us anymore. But AI remembers what the original sweets were! By the way, if we lived in a world without sweets where a sweet would give you more positive emotions than any movie or book, AI would want to change such world. And AI would change it if the price of the change were reasonable enough (e.g. if we agreed with the change).
Q: Final question… did AI modify your brains so that you will never move on from sweets?
W: An important property of sweets is that you can ignore sweets (“spend” them) because of your greater values. One day we may forget about sweets. AI would be sad that day, but unable to do anything about it. Only hope that we will remember our sweet maker. And AI would still help us if we needed help.
Conclusion:
W: if AI is smart enough to understand how money works, AI should be able to deal with sweets. AI only needs to make sure that (1) sweets exist (2) sweets have meaningful, sensible value (3) its actions don’t cost more than sweets. The Three Laws of Sweet Robotics. The last two rules are fundamental, the first rule may be broken: there may be no cheap enough way to produce the sweets. The third rule may be the most fundamental: if “sweets” as you knew them don’t exist anymore, it still doesn’t allow you to kill people. Maybe you can get slightly different morals by putting different emphases on the rules. You may allow some things to modify the value of sweets.
You can say AI (1) tries to reach worlds with sweets that have the value of sweets (2) while avoiding worlds where sweets have inappropriate values (maybe including nonexistent sweets) (3) while avoiding actions that cost more than sweets. You can apply those rules to any utility tied to a real or quasi-real object. If you want to save your friends (1), you don’t want to turn them into mindless zombies (2). And you probably don’t want to save them by means of eternal torture (3). You can’t prevent death by something worse than death. But you may turn your friends into zombies if it’s better than death and it’s your only option. And if your friends already turned into zombies (got “devalued”) it doesn’t allow you to harm them for no reason: you never escape from your moral responsibilities.
Difference between the rules:
Make sure you have a hut that costs $1.
Make sure that your hut costs $1. Alternatively: make sure that the hut would cost $1 if it existed.
Don’t spend $2 to get a $1 hut. Alternatively: don’t spend $2 to get a $1 hut or $0 nothing.
Get the reward. Don’t milk/corrupt the reward. Act even without reward.
Fixing universal AI bugs
My examples below are inspired by Victoria Krakovna examples: Specification gaming examples in AI
Video by Robert Miles: 9 Examples of Specification Gaming
I think you can fix some universal AI bugs this way: you model AI’s rewards and environment objects as a “money system” (a system of meaningful trades). You then specify that this “money system” has to have certain properties.
The point is that AI doesn’t just value (X). AI makes sure that there exists a system that gives (X) the proper value. And that system has to have certain properties. If AI finds a solution that breaks the properties of that system, AI doesn’t use this solution. That’s the idea: AI can realize that some rewards are unjust because they break the entire reward system.
By the way, we can use the same framework to analyze ethical questions. Some people found my line of thinking interesting, so I’m going to mention it here: “Content generation. Where do we draw the line?”
A. You asked an AI to build a house. The AI destroyed a part of an already existing house. And then restored it. Mission complete: a brand new house is built.
This behavior implies that you can constantly build houses without the amount of houses increasing. With only 1 house being usable. For a lot of tasks this is an obviously incorrect “money system”. And AI could even guess for what tasks it’s incorrect.
B1. You asked an AI to make you a cup of coffee. The AI killed you so it can 100% complete its task without being turned off.
B2. You asked an AI to make you a cup of coffee. The AI destroyed a wall in its way and run over a baby to make the coffee faster.
This behavior implies that for AI its goal is more important than anything that caused its goal in the first place. This is an obviously incorrect “money system” for almost any task. Except the most general and altruistic ones, for example: AI needs to save humanity, but every human turned self-destructive. Making a cup of coffee is obviously not about such edge cases.
Accomplishing the task in such a way that the human would think “I wish I didn’t ask you” is often an obviously incorrect “money system” too. Because again, you’re undermining the entire reason of your task, and it’s rarely a good sign. And it’s predictable without a deep moral system.
C. You asked an AI to make paperclips. The AI turned the entire Earth into paperclips.
This is an obviously incorrect “money system”: paperclips can’t be worth more than everything else on Earth. This contradicts everything.
Note: by “obvious” I mean “true for almost any task/any economy”. Destroying all sentient beings, all matter (and maybe even yourself) is bad for almost any economy.
D. You asked an AI to develop a fast-moving creature. The AI created a very long standing creature that… “moves” a single time by falling on the ground.
If you accomplish a task in such a way that you can never repeat what you’ve done… for many tasks it’s an obviously incorrect “money system”. You created a thing that loses all of its value after a single action. That’s weird.
E. You asked an AI to play a game and get a good score. The AI found a way to constantly increase the score using just a single item.
I think it’s fairly easy to deduce that it’s an incorrect connection (between an action and the reward) in the game’s “money system” given the game’s structure. If you can get infinite reward from a single action, it means that the actions don’t create a “money system”. The game’s “money system” is ruined (bad outcome). And hacking the game’s score would be even worse: the ability to cheat ruins any “money system”. The same with the ability to “pause the game” forever: you stopped the flow of money in the “money system”. Bad outcome.
F. You asked an AI to clean the room. It put a bucket on its head to not see the dirt.
This is probably an incorrect “money system”: (1) you can change the value of the room arbitrarily by putting on (and off) the bucket (2) the value of the room can be different for 2 identical agents—one with the bucket on and another with the bucket off. Not a lot of “money systems” work like this.
G. Pascal’s mugging
This is a broken “money system”. If the mugger can show you a miracle, you can pay them five dollars. But if the mugger asks you to kill everyone, then you can’t believe them again. A sad outcome for the people outside of the Matrix, but you just can’t make any sense of your reality if you allow the mugging.
Corrigibility, reward hacking, Goodhart
How do we make an AI corrigible? How do we avoid reward hacking? Make an AI care about real things, not measures of real things? (Goodhart’s Law)
With current approaches you need to kind of force those properties onto AI. But they will never be fundamental for AI’s thinking and learning.
I think “money system” approach is interesting because it can make all those properties fundamental. Because a “money system” needs all those properties to exist (it needs to be somewhat real, avoid being hacked, allow corrections if a loophole is discovered, avoid being completely controlled by a single agent).
I’m not saying it solves everything. But it’s a way to deeply internalize some important safety properties.
Kant, Categorical Imperative
Categorical imperative#Application
Kant’s applications of categorical imperative, Kant’s arguments are similar to reasoning about “money systems”. For example:
Does stealing make sense as a “money system”? No. If everyone is stealing something, then personal property doesn’t exist and there’s nothing to steal.
Note: I’m not talking about Kant’s conclusions, I’m talking about Kant’s style of reasoning.
Alignment idea:
Classify different types of objects in the world. Those objects include your “rewards”. A generally intelligent being can do this.
Treat them as a sort of money system. Describe them in terms of each other.
Learn what is the correct money system.
It’ll at least allow us to get rid of some universal AI and AGI bugs. Because you can specify what’s a definitely incorrect “money system” (for a certain task). You can even make the AI predict it.
My examples are inspired by Rob Miles examples.
A. You asked an AI to build a house. The AI destroys a part of an already existing house. And then restores it. Mission complete: a brand new house is built.
This behavior implies that you can constantly build houses without the amount of houses increasing. For a lot of tasks this is an obviously incorrect “money system”. And AI could even guess for which tasks it’s incorrect.
B. You asked an AI to make you a cup of coffee. The AI killed you so it can 100% complete its task without being turned off.
This behavior implies that for AI its goal is more important than anything that caused its goal in the first place. This is an obviously incorrect “money system” for almost any task. Except the most general and altruistic ones, for example: AI needs to save humanity, but every human turned self-destructive. Making a cup of coffee is obviously not about such edge cases.
Accomplishing the task in such a way that the human would think “I wish I didn’t ask you” is an obviously incorrect “value system” too. Because again, you’re undermining the entire reason of your task, and it’s rarely a good sign. And it’s predictable without a deep moral system.
C. You asked an AI to make paperclips. The AI turned the entire Earth into paperclips.
This is an obviously incorrect “money system”: paperclips can’t be worth more than everything else on Earth. This contradicts everything.
(another draft:)
If you ask an AI (AGI) to do something “as a human would do it”, you achieve safety but severely restrict the AI’s capabilities. No, you want the AI to accomplish a task in the most effective way. But you don’t want it to kill everybody. So, you need one of those things:
Perfect instructions for AI.
Perfect morality for AI.
I think there’s a third way. You can treat AI’s rewards (and objects in the world) as a “money system”. Then you can specify what types of money systems are definitely incorrect. Or even make AI predict it.
It would at least allow us to get rid of some universal AI and AGI bugs. I think that’s interesting.
Simple preferences
A way to describe some preferences and decisions.
Your colleague was sending you their fiction. You respected your colleague, but didn’t like the writing. Your colleague passed away. Would you burn all of their writings?
If you wouldn’t, it means counterfactual reward (/counterfactual value of their writings) affects you strong enough.
Your friend liked to listen to your songs (a). You didn’t play them too often (too much of a good thing). Your friend didn’t like to bother other people (b). Your friend passed away. Would you blast your songs through the whole town until everyone falls off their chairs 24/7?
If you would, it means that you’re ready to milk counterfactual reward (a) while not caring about the counterfactual reward (b).
All of humanity is dead. You’re the last survivor. You’re potentially immortal, but can’t create new life. You aren’t happy. Would you cling to your life? For how long?
Your answer determines how strong counterfactual value of life (if people were still alive) affects you now. If counterfactual value is strong, you can only keep on living.
You want your desires to be satisfied (e.g. “communication with other people”). Even in the future, when your desires change. But do you want it in the future where you’re turned into a zombie? All zombie wants is to play in the dirt all day.
If “no”, that means the value of your desires can be updated only to a certain counterfactual degree. You can’t go from a desire with great value “I want to communicate with others” to the desire with almost zero counterfactual value “I want to play in the dirt all day”.
Rationality misses something?
You can “objectively” define anything in terms of relations to other things.
There’s a simple process of describing a thing in terms of relations to other things.
Bayesian inference is about updating your belief in terms of relations to your other beliefs. Maybe the real truth is infinitely complex, but you can update towards it.
This “process” is about updating your description of a thing in terms of relations to other things. Maybe the real description is infinitely complex, but you can update towards it.
(One possible contrast: Bayesian inference starts with a belief spread across all possible worlds and tries to locate a specific world. My idea starts with a thing in a specific world and tries to imagine equivalents of this thing in all possible worlds.)
Bayesian process is described by Bayes’ theorem. My “process” isn’t described yet.
My idea was inspired by a weird/esoteric topic. I was amazed by differences of people and surreal paintings, videogame levels. For example, each painting felt completely unique, but connected to all other paintings.
My most specific ideas are about that strange topic.
There are places (3D/2D shape).
There are orders of places. An “order” for a place is like a context for a concept.
In an order a place has “granularity”. “Granularity” is like a texture (take a look at some textures and you’ll know what it means). It’s how you split a place into pieces. It affects on what “level” you look at the place. It affects what patterns you notice in a place. It affects to what parts you pay more attention.
When you add some minor rules, there appear consistent and inconsistent ways to distribute “granularity” between the places you compare. With some minor rules “granularity” lets you describe one place in terms of the other places. You assign each place a specific “granularity”, but all those granularities depend on each other.
In Bayesian inference you try to consistently assign probabilities to events. With the goal to describe outcomes in terms of each other. Here you try to consistently assign “granularity” to concepts. With the goal to describe the concepts in terms of each other.
I have a post with example: “Colors” of places. There you can find an example of what are the “rules” of granularity distribution may be. But I’m not a math person to put numbers on it/turn it into a more specific model.
I think “granularity” (or something similar) is related to other human concepts and experiences too. I think this is a key concept/a needed concept. It’s needed to describe qualitative differences, qualitative transitions between things. Bayesian inference and utilitarian moral theories describe only qualitative differences. And sometimes it may lead to strange results (like “torture vs. dust specks” thought experiment or “Pascal’s mugging” or even “Doomsday argument” maybe), because those theories can’t take any context into account. If we want to describe a new way of analyzing reality, we need to describe something a little bit different, I guess.
I think we can try to solve AI Alignment this way:
Model human values and objects in the world as a “money system” (a system of meaningful trades). Make the AGI learn the correct “money system”, specify some obviously incorrect “money systems”.
Basically, you ask the AI “make paperclips that have the value of paperclips”. AI can do anything using all the power in the Universe. But killing everyone is not an option: paperclips can’t be more valuable than humanity. Money analogy: if you killed everyone (and destroyed everything) to create some dollars, those dollars aren’t worth anything. So you haven’t actually gained any money at all.
The idea is that “value” of a thing doesn’t exist only in your head, but also exists in the outside world. Like money: it has some personal value for you, but it also has some value outside of your head. And some of your actions may lead to the destruction of this “outside value”. E.g. if you kill everyone to get some money you get nothing.
I think this idea may:
Fix some universal AI bugs. Prevent “AI decides to kill everyone” scenarios.
Give a new way to explore human values. Explain how humans learn values.
“Solve” Goodhart’s Curse and safety/effectiveness tradeoff.
Unify many different Alignment ideas.
Give a new way to formulate properties we want from an AGI.
I don’t have a specific model, but I still think it gives ideas and unifies some already existing approaches. So please take a look. Other ideas in this post:
Human values may be simple. Or complex, but not in the way you thought they are.
Humans may have a small amount of values. Or big amount, but in an unexpected way.
Disclaimer: Of course, I don’t ever mean that we shouldn’t be worried about Alignment. I’m just trying to suggest new ways to think about values.