Hmm. When you say “human terminal values don’t actually exist at the initial time,” what do you mean by “exist”? IMO, they exist in the sense that they are implicit in the algorithm the human brain is executing. They are causally prior to behavior, in the sense that the algorithm is causally prior to the output of the algorithm.
That is, they are implicit rather than explicit because, indeed, we can in principle interpret the same algorithm as a consequentialist in different, mutually inconsistent, ways. However, not all interpretations are born equal: some will be more natural, some more contrived. I expect that some sort of Occam’s razor should select the interpretations that we would accept as “correct”: otherwise, why is the concept of values meaningful at all?
Indeed, if these values only appear in the end of some long reflection process, then why should I care about the outcome of this process? Unless I already posses the value of caring about this outcome, in which case we again conclude that the values already effectively exist at present.
(This feels at least partially like an argument about definitions but clarifying the definitions would probably be useful)
I think I was previously confusing terminal values with ambitious values, and am now not confusing them.
Ambitious values are about things like how the universe should be in the long run, and are coherent (e.g. they’re a utility function over physical universe states). Narrow values are about things like whether you’re currently having a nice time and being in control of your AI systems, and are not coherent. Ambitious and narrow values can be instrumental or terminal.
The human cognitive algorithm is causally prior to behavior. It is also causally prior to human ambitious values. But human ambitious values are not causally prior to human behavior. Making human preferences coherent can only be done through a reflection process, so ambitious values come at the end of this process and can’t go backwards in logical time to influence behavior.
I.e. algorithm → behavior, algorithm → ambitious values.
IRL says values → behavior, which is wrong in the case of ambitious values.
Indeed, if these values only appear in the end of some long reflection process, then why should I care about the outcome of this process? Unless I already posses the value of caring about this outcome, in which case we again conclude that the values already effectively exist at present.
Caring about this reflection process seems like a narrow value.
See my comment here about why narrow value learning is hard.
Hmm. When you say “human terminal values don’t actually exist at the initial time,” what do you mean by “exist”? IMO, they exist in the sense that they are implicit in the algorithm the human brain is executing. They are causally prior to behavior, in the sense that the algorithm is causally prior to the output of the algorithm.
That is, they are implicit rather than explicit because, indeed, we can in principle interpret the same algorithm as a consequentialist in different, mutually inconsistent, ways. However, not all interpretations are born equal: some will be more natural, some more contrived. I expect that some sort of Occam’s razor should select the interpretations that we would accept as “correct”: otherwise, why is the concept of values meaningful at all?
Indeed, if these values only appear in the end of some long reflection process, then why should I care about the outcome of this process? Unless I already posses the value of caring about this outcome, in which case we again conclude that the values already effectively exist at present.
(This feels at least partially like an argument about definitions but clarifying the definitions would probably be useful)
I think I was previously confusing terminal values with ambitious values, and am now not confusing them.
Ambitious values are about things like how the universe should be in the long run, and are coherent (e.g. they’re a utility function over physical universe states). Narrow values are about things like whether you’re currently having a nice time and being in control of your AI systems, and are not coherent. Ambitious and narrow values can be instrumental or terminal.
The human cognitive algorithm is causally prior to behavior. It is also causally prior to human ambitious values. But human ambitious values are not causally prior to human behavior. Making human preferences coherent can only be done through a reflection process, so ambitious values come at the end of this process and can’t go backwards in logical time to influence behavior.
I.e. algorithm → behavior, algorithm → ambitious values.
IRL says values → behavior, which is wrong in the case of ambitious values.
Caring about this reflection process seems like a narrow value.
See my comment here about why narrow value learning is hard.