Finally, the human expresses a judgement about the states of M, mentally categorising a set of states as better than another. This is an anti-symmetric partial function J:S×S→R, a partial function that is non trivial on at least one pair of inputs.
I continue to be unsure if we can even claim anti-symmetry of the preference relation. For example, let SA be the state “I eat an apple” and SO the state “I eat an orange”, and today J(SA,SO) but tomorrow J(SO,SA), seemingly violating antisymmetry. Now of course maybe I misunderstood my own understanding of SA and SO such that they actually included a hidden-to-my-awareness property conditioning them on time or something else such that anti-symmetry is not violated, but the fact that there may be some property on the states that I didn’t think about at first that salvages anti-symmetry makes my worry that this model is confused in this and other ways because it was so easily to think of and construct something that seemingly violated the property but then on further reflection seems like it doesn’t.
That’s not a slam-dunk argument against this formalization. This is more me sharing some thoughts on my reservations of using this type of model. If we can so easily fail to notice something relevant about how we formalize some simple preferences, what else may we be failing to notice? And if so what happens if we build an AI based in part on this formalization? Will it also fail to account for relevant aspects of how human preferences are calculated because they are not easily visible to us in the model, or is that a failure of humans to understand themselves rather than the model? These are the things I’m wrestling with lately.
I also have some reservations about whether we can even really model humans has having discrete preferences that we can reason about in this way without getting ourselves into trouble and confused. Not to say that I doubt that this model often works, only that I worry that it’s missing some important details that are relevant for alignment and without accounting for them we will fail to produce aligned AI. I worry this because there doesn’t seem to be anything in the human mind that actually is a preference; preferences are more like reifications of a pattern of action that appears in humans. Getting closer to understanding the mechanism that produces the pattern we interpret as preferences seems valuable to me in this work because I worry we’re missing crucial details when we reason about preferences at the level of detail you pursue here.
I agree that viewing preferences as conditioned on the environment, up to and including the entire history of the observable universe, is a sensible improvement over many more simplistic models that result in clear violations of preference normativity and eliminates many of those violations. My concern is that, given that this is not so obvious as to be the normal way of thinking about preferences in all fields and was nonobvious enough that you had to write a post about the point, this makes me cautious about updating to thinking this is sufficient to make the current value abstraction you use sufficient for purposes of AI alignment. I basically view conditionality of preferences as neutral evidence about the explanatory power of the theory (for the purpose of AI alignment).
Valid point, though conditional meta-preferences are things I’ve already written about, and the issue of being wrong now about what your own preferences would be in the future, is also something I’ve addressed multiple times in different forms. Your example is particularly crisp, though.
I continue to be unsure if we can even claim anti-symmetry of the preference relation. For example, let SA be the state “I eat an apple” and SO the state “I eat an orange”, and today J(SA,SO) but tomorrow J(SO,SA), seemingly violating antisymmetry. Now of course maybe I misunderstood my own understanding of SA and SO such that they actually included a hidden-to-my-awareness property conditioning them on time or something else such that anti-symmetry is not violated, but the fact that there may be some property on the states that I didn’t think about at first that salvages anti-symmetry makes my worry that this model is confused in this and other ways because it was so easily to think of and construct something that seemingly violated the property but then on further reflection seems like it doesn’t.
That’s not a slam-dunk argument against this formalization. This is more me sharing some thoughts on my reservations of using this type of model. If we can so easily fail to notice something relevant about how we formalize some simple preferences, what else may we be failing to notice? And if so what happens if we build an AI based in part on this formalization? Will it also fail to account for relevant aspects of how human preferences are calculated because they are not easily visible to us in the model, or is that a failure of humans to understand themselves rather than the model? These are the things I’m wrestling with lately.
I also have some reservations about whether we can even really model humans has having discrete preferences that we can reason about in this way without getting ourselves into trouble and confused. Not to say that I doubt that this model often works, only that I worry that it’s missing some important details that are relevant for alignment and without accounting for them we will fail to produce aligned AI. I worry this because there doesn’t seem to be anything in the human mind that actually is a preference; preferences are more like reifications of a pattern of action that appears in humans. Getting closer to understanding the mechanism that produces the pattern we interpret as preferences seems valuable to me in this work because I worry we’re missing crucial details when we reason about preferences at the level of detail you pursue here.
I see the orange-apple preference reversal as another example of conditional preferences.
I agree that viewing preferences as conditioned on the environment, up to and including the entire history of the observable universe, is a sensible improvement over many more simplistic models that result in clear violations of preference normativity and eliminates many of those violations. My concern is that, given that this is not so obvious as to be the normal way of thinking about preferences in all fields and was nonobvious enough that you had to write a post about the point, this makes me cautious about updating to thinking this is sufficient to make the current value abstraction you use sufficient for purposes of AI alignment. I basically view conditionality of preferences as neutral evidence about the explanatory power of the theory (for the purpose of AI alignment).
Valid point, though conditional meta-preferences are things I’ve already written about, and the issue of being wrong now about what your own preferences would be in the future, is also something I’ve addressed multiple times in different forms. Your example is particularly crisp, though.