I often think that the idea that “human have values” is wrong. Humans don’t “have” values. They are boxes, where different values appear, reach their goals and dissolved.
I had infinitely many different values during my life, they often contradict each other. There is something like the democracy of values in human mind, where different values affect my behaviour according to some form of their interaction. Sometimes it is a dictatorship.
But if we look on a human as on box for values, it still creates some preferred set of values. One—the need to preserve the box, that is survival (and life extension). Another is about preventing the dictatorship of one value - it may be less obvious.
It is a set of meta-values, which help to thrive and interact different values, which come from social medium, form book I read, from biological drives, and from personal choices.
This is correct. In fact, it is common on LW to use the word “agent” to mean something that rigidly pursues a single goal as though it were infinitely important. The title of this post uses it this way. But no agents exist, in this sense, and no agents should exist. We are not agents and should not want to be, in that way.
On the other hand, this is bad way to use the word “agent”, since it is better just to use it of humans as they are.
That’s why I used the “(idealised) agent” description (but titles need to be punchier).
Though I think “simple” goal is incorrect. The goal can be extremely complex—much more complex that human preferences. There’s no limit to the subtleties you can pack into a utility function. There is a utility function that will fit perfectly to every decision you make in your entire life, for example.
The reason to look for an idealised agent, though, is that a utility function is stable in a way that humans are not. If there is some stable utility function that encompasses human preferences (it might be something like “this is the range of human preferences” or similar) then, if given to an AI, the AI will not seek to transform humans into something else in order to satisfy our “preferences”.
The AI has to be something of an agent, so it’s model of human preferences has to be an agent-ish model.
“There is a utility function that will fit perfectly to every decision you make in your entire life, for example.”
Sure, but I don’t care about that. If two years from now a random glitch causes me to do something a bit different, which means that my full set of actions matches some slightly different utility function, I will not care at all.
Is that really the standard definition of agent though? Most textbooks I’ve seen talk of agents working towards the achievement of a goal, but it says nothing about the permanence of that goal system. I would expect an “idealized agent” to always take actions that maximize likelihood of achieving its goals, but that is orthogonal from whether the system of goals changes.
Then take my definition of agent in this post as “expected utility maximiser with a clear and distinct utility that is, in practice, Cartesianianly separated from the rest of the universe”, and I’ll try and be clearer in subsequent posts.
I think that any agent with a short single goal is dangerous, and such people are named “maniacs”. Addicts also have only one goal.
One way to try to create “safe agent” is to give it a very long list of goals. Human being comes with a complex set of biological drives, and culture provides complex set of values. This large set of values creates context for any value or action.
Not all complex values are safe. For example, the negation of human values is exactly as complex as human values but is the most dangerous set of values possible.
This is true, as long as you do not allow any consistent way of aggregating the list (and humans do not have a way to do that, which prevents them from being dangerous.)
I often think that the idea that “human have values” is wrong. Humans don’t “have” values. They are boxes, where different values appear, reach their goals and dissolved.
I had infinitely many different values during my life, they often contradict each other. There is something like the democracy of values in human mind, where different values affect my behaviour according to some form of their interaction. Sometimes it is a dictatorship.
But if we look on a human as on box for values, it still creates some preferred set of values. One—the need to preserve the box, that is survival (and life extension). Another is about preventing the dictatorship of one value - it may be less obvious.
It is a set of meta-values, which help to thrive and interact different values, which come from social medium, form book I read, from biological drives, and from personal choices.
This is correct. In fact, it is common on LW to use the word “agent” to mean something that rigidly pursues a single goal as though it were infinitely important. The title of this post uses it this way. But no agents exist, in this sense, and no agents should exist. We are not agents and should not want to be, in that way.
On the other hand, this is bad way to use the word “agent”, since it is better just to use it of humans as they are.
That’s why I used the “(idealised) agent” description (but titles need to be punchier).
Though I think “simple” goal is incorrect. The goal can be extremely complex—much more complex that human preferences. There’s no limit to the subtleties you can pack into a utility function. There is a utility function that will fit perfectly to every decision you make in your entire life, for example.
The reason to look for an idealised agent, though, is that a utility function is stable in a way that humans are not. If there is some stable utility function that encompasses human preferences (it might be something like “this is the range of human preferences” or similar) then, if given to an AI, the AI will not seek to transform humans into something else in order to satisfy our “preferences”.
The AI has to be something of an agent, so it’s model of human preferences has to be an agent-ish model.
“There is a utility function that will fit perfectly to every decision you make in your entire life, for example.”
Sure, but I don’t care about that. If two years from now a random glitch causes me to do something a bit different, which means that my full set of actions matches some slightly different utility function, I will not care at all.
Is that really the standard definition of agent though? Most textbooks I’ve seen talk of agents working towards the achievement of a goal, but it says nothing about the permanence of that goal system. I would expect an “idealized agent” to always take actions that maximize likelihood of achieving its goals, but that is orthogonal from whether the system of goals changes.
Then take my definition of agent in this post as “expected utility maximiser with a clear and distinct utility that is, in practice, Cartesianianly separated from the rest of the universe”, and I’ll try and be clearer in subsequent posts.
I think that any agent with a short single goal is dangerous, and such people are named “maniacs”. Addicts also have only one goal.
One way to try to create “safe agent” is to give it a very long list of goals. Human being comes with a complex set of biological drives, and culture provides complex set of values. This large set of values creates context for any value or action.
So replace the paperclip-tiling AI with the yak-shaving AI? :-D
Not all complex values are safe. For example, the negation of human values is exactly as complex as human values but is the most dangerous set of values possible.
This is true, as long as you do not allow any consistent way of aggregating the list (and humans do not have a way to do that, which prevents them from being dangerous.)