I think I was previously confusing terminal values with ambitious values, and am now not confusing them.
Ambitious values are about things like how the universe should be in the long run, and are coherent (e.g. they’re a utility function over physical universe states). Narrow values are about things like whether you’re currently having a nice time and being in control of your AI systems, and are not coherent. Ambitious and narrow values can be instrumental or terminal.
The human cognitive algorithm is causally prior to behavior. It is also causally prior to human ambitious values. But human ambitious values are not causally prior to human behavior. Making human preferences coherent can only be done through a reflection process, so ambitious values come at the end of this process and can’t go backwards in logical time to influence behavior.
I.e. algorithm → behavior, algorithm → ambitious values.
IRL says values → behavior, which is wrong in the case of ambitious values.
Indeed, if these values only appear in the end of some long reflection process, then why should I care about the outcome of this process? Unless I already posses the value of caring about this outcome, in which case we again conclude that the values already effectively exist at present.
Caring about this reflection process seems like a narrow value.
See my comment here about why narrow value learning is hard.
I think I was previously confusing terminal values with ambitious values, and am now not confusing them.
Ambitious values are about things like how the universe should be in the long run, and are coherent (e.g. they’re a utility function over physical universe states). Narrow values are about things like whether you’re currently having a nice time and being in control of your AI systems, and are not coherent. Ambitious and narrow values can be instrumental or terminal.
The human cognitive algorithm is causally prior to behavior. It is also causally prior to human ambitious values. But human ambitious values are not causally prior to human behavior. Making human preferences coherent can only be done through a reflection process, so ambitious values come at the end of this process and can’t go backwards in logical time to influence behavior.
I.e. algorithm → behavior, algorithm → ambitious values.
IRL says values → behavior, which is wrong in the case of ambitious values.
Caring about this reflection process seems like a narrow value.
See my comment here about why narrow value learning is hard.