When I introspect on why I do not kill more people, I find it has not very much to do with my goals. Mostly my goals would not be furthered by murder and some goals, like staying out of prison would be endangered, but if I just took the option seriously it wouldn’t surprise me if murder would occasionally result in very nice payoffs.
I also don’t think it has much to with my values, because frankly, I am not sure I have any. Or maybe I do, but I certainly never think about them. I have moral intuitions and I use those to solve moral quandaries mostly by finding analogies that trigger a clear cut moral intuition about what to do.
Sure, if somebody asked me whether my values include “freedom”, “happiness” and “chocolate ice cream”, I’d say yes, but it doesn’t seem very helpful to frame my reasons for doing the things I do in terms of values—except maybe for propaganda purposes.
I think the best causal explanation, for why I don’t tend to kill any people, lies in my character. I am just not a particularly violent person.
But maybe that’s all the same? Maybe not being a violent person is equivalent to having the value of “peacefulness”, and the goal of peaceful behavior.
I don’t think that is the case. I can easily imagine having the same goals and values and being a much more violent man. Being a more violent man would probably result in more violent acts and much begging for forgiveness afterwards.
So what is a character trait, if it is separate from goals and values and can occasionally dictate behavior that violates both goals and values? To me many character traits make sense as priors over strategic behavior.
Neuroticism is the prior that the world is a dangerous place. Laziness the prior that conserving energy is of great importance. Openness corresponds to the exploration-exploitation tradeoff, while conscientiousness is the bet that care and attention to detail will pay off.
This is interesting because the underlying concepts are simple and relevant. Danger, speed, energy conservation, diligence, conflict, cooperation, etc: these are fundamental concepts in resource constrained multi-agent environments. Any sufficiently advanced agent trained in such environments will have a internal representation of the best guess of the current level of danger, for instance.
Given that if you encounter a sufficiently new and unexplored environment, it can take quite some time to find out how dangerous it is exactly. An agent will have a strong prior on “dangerousness” dependent on the previously encountered environments, and therefore exhibit the character trait “neuroticism” to a larger or lesser degree.
Given that these concepts are simple and universal (they mostly drop directly out of mathematics of game theory or optimization) they are probably very robustly learned. That makes it likely to my mind that it will be much easier to target “character traits” for optimisation into a certain direction, than values or goals.
Concretely, it might be much easier to push an agent to max out on “cooperation vs conflict”, for example by directly manipulating one particular dimension of one particular internal representation, than to let it learn “human values” from human feedback or point it at a very specific goal.
Character alignment
When I introspect on why I do not kill more people, I find it has not very much to do with my goals. Mostly my goals would not be furthered by murder and some goals, like staying out of prison would be endangered, but if I just took the option seriously it wouldn’t surprise me if murder would occasionally result in very nice payoffs.
I also don’t think it has much to with my values, because frankly, I am not sure I have any. Or maybe I do, but I certainly never think about them. I have moral intuitions and I use those to solve moral quandaries mostly by finding analogies that trigger a clear cut moral intuition about what to do.
Sure, if somebody asked me whether my values include “freedom”, “happiness” and “chocolate ice cream”, I’d say yes, but it doesn’t seem very helpful to frame my reasons for doing the things I do in terms of values—except maybe for propaganda purposes.
I think the best causal explanation, for why I don’t tend to kill any people, lies in my character. I am just not a particularly violent person.
But maybe that’s all the same? Maybe not being a violent person is equivalent to having the value of “peacefulness”, and the goal of peaceful behavior.
I don’t think that is the case. I can easily imagine having the same goals and values and being a much more violent man. Being a more violent man would probably result in more violent acts and much begging for forgiveness afterwards.
So what is a character trait, if it is separate from goals and values and can occasionally dictate behavior that violates both goals and values? To me many character traits make sense as priors over strategic behavior.
Neuroticism is the prior that the world is a dangerous place. Laziness the prior that conserving energy is of great importance. Openness corresponds to the exploration-exploitation tradeoff, while conscientiousness is the bet that care and attention to detail will pay off.
This is interesting because the underlying concepts are simple and relevant. Danger, speed, energy conservation, diligence, conflict, cooperation, etc: these are fundamental concepts in resource constrained multi-agent environments. Any sufficiently advanced agent trained in such environments will have a internal representation of the best guess of the current level of danger, for instance.
Given that if you encounter a sufficiently new and unexplored environment, it can take quite some time to find out how dangerous it is exactly. An agent will have a strong prior on “dangerousness” dependent on the previously encountered environments, and therefore exhibit the character trait “neuroticism” to a larger or lesser degree.
Given that these concepts are simple and universal (they mostly drop directly out of mathematics of game theory or optimization) they are probably very robustly learned. That makes it likely to my mind that it will be much easier to target “character traits” for optimisation into a certain direction, than values or goals.
Concretely, it might be much easier to push an agent to max out on “cooperation vs conflict”, for example by directly manipulating one particular dimension of one particular internal representation, than to let it learn “human values” from human feedback or point it at a very specific goal.