Meaning “simple utility function” by the phrase “utility function” might be a conceptual trap. It make s a big difference whether you consider a function with hundreds of terms of or billions of terms or even things that can not be expressed as a sum.
As a “tricky utility function”, “human utility function” is mostly fine. Simple utility functions are relevant to todays programming but I don’t know whether honing your concepts to apply better for AGI is served to make a cleanly cut concept that limits only that domain.
Some hidden assumtions might be things like “If humans have a utility function it can be written down”, “Figuring out a humans utility function is practical epistemological stance with a single agent encountering new humans”
If you take stuff like that out the “mere” existence of a function is not that weighty a point.
As you may already know, humans are made of atoms. Collections of atoms don’t have utility functions glued to them
Whole theories of physics can be formulated as a single action that is then extremised. Taking different theories as different answers to a question like “what happens next?” a single theorys formula is its “choice”. Thus it seems a lot like physical systems could be understood in terms of utility functions. An electron knows how an electron behaves, it does have a behaviour glued into it. If you just add a lot of electrons or protons (and other stuff that has similar laws) it is not like aggregation from the microbehaviours makes the function fail to be a function as a macrobehaviour.
I’ll reiterate that a problem with this is lack of uniqueness. There is not a thing that is the human utility function, even if you allow arbitrarily messy utility functions. If you assume that there is one, it turns out that this is a weighty meta-level commitment even if your class of utility functions is so broad as to be useless on the object level.
I think reflection could help a lot with this, deciding how to proceed in formulating preference based on currently available proxies for preference (with some updatelessness taking care of undue path sensitivity). At some point, preference mostly develops itself, without looking at external data.
If you can agree that putting two electrons in the same system can still be predicted by minimizing an action then you should agree that putting two humans in the same system can still be in principle justified how it plays out. Iterate a little bit and you have a predictable 6 billion human system.
So what operation are we doing where this particular object level is relevant?
I don’t understand what you mean, particularly the last question.
Yes, electrons and humans can be predicted by the laws of physics. The laws of physics are not uniquely specified by our observations, but they are significantly narrowed down by Occam’s razor. But how are you thinking this applies to alignment? We don’t want an AI to learn “humans are collections of atoms and what they really want is to follow the laws of physics.”
Questions like “what would this human do in a situation where there is a cat in a room” has a unique answer that reflects reality, as if that kidn of situation was ran then something would need to happen.
Sure if we start from high abstract values and then try to make them more concrete we might lose the way. If we can turn philosophies into feelings but do not know how to turn feelings into chemistry then there is a level of representation that might not be sufficient. But we know there is one level that is sufficient to describe action and that all the levels are somehow (maybe in an unknown way) connected (mostly stacked on top). So this incompatibility of representation can not be fundamental. Because if it was, then there would be a gap between the levels and the thing would not be connected anymore.
So there is no question “presented with this stimuli how would the human react?” that would be in principle unanswerable. If preferences are expressed as responces to choice situations this is a subcategory of reaction. Even if preferences are expressed as responces to philosophy prompts they would be a subcategory.
One could say that it is not super clarifying that if a two human system represented with philosophical stimuli of “Is candy worth 4$?” you get one human that says “yes” and another human that says “no”. But this is just a swiggle in the function. The function is being really inconvenient when you can’t use an approximation where you can think of just one “average human” and then all humans would reflect that very closely. But we are not promised that the function is a function of time of day or function of verbal short term memory or function of television broadcast data.
Maybe you are saying something like “genetic fitness doesn’t exist” because some animals are fit when they are small and some animals are fit when they are large, so there is no consistent account whether smallness is good or not. Then “human utility function doesn’t exist” because human A over here dares to have different opinions and strategies than human B over here and they do not end up mimicing each other. But like an animal lives or dies, a human will zig or zag. And it can not be that the zigging would fail to be a function of worldstate (with some QM assumed away to be non-significant (and even then maybe not)). What it can be is fail to be function of the world state as we understand it, or our computer system models it, or can be captured in the variables we are using. But then the question is whether we can make do with just these variables and not that there would be nothing to model.
In this language it could be rephrased:
If you think you have a good wide set of variables to come up with any needed solution function, you don’t. You have too few variables.
But the “function” in this sense is how the computer system models reality (or like attitudial modes it can take towards reality). But part of how we know that the setup is inadequate is that there is an entity outside of the system that is not reflected in it. Aka, this system can only zig or zag when we needed zog which it can not do. The thing that will keep on missing is the way that reality actually dances. Maybe in some small bubbles we can actually have totally capturing representations in the senses that we care. But there is a fact of the matter to the inquiry. For any sense we might care there is a slice of the whole thing that is sufficient for that. To express zog you need these features, to express zeg you need these other ones.
Human will is quite complex so we can reasonably expect to be spending quite a lot of time in undermodelling. But that is a very different thing from being unmodellable.
Questions like “what would this human do in a situation where there is a cat in a room” has a unique answer that reflects reality, as if that kidn of situation was ran then something would need to happen.
It’s not about what the human would do in a given situation. It’s about values – not everything we do reflects our values. Eating meat when you’d rather be vegetarian, smoking when you’d rather not, etc. How do you distinguish biases from fundamental intuitions? How do you infer values from mere observations of behavior? There are a bunch of problems described in this sequence. Not to mention stuff I discuss here about how values may remain under-defined even if we specify a suitable reflection procedure and have people undergo that procedure.
Ineffective values do not need to be considered for a utility function as they do not effect what gets strived for. If you say “I will choose B” and still choose A you are still choosing A. You are not required to be aware of your utility function.
That is a lot of material to go throught en masse, so I will need some sharper pointers of relevance to actually engage.
Ineffective values do not need to be considered for a utility function as they do not effect what gets strived for. If you say “I will choose B” and still choose A you are still choosing A. You are not required to be aware of your utility function.
Uff, a future where humans get more of what they’re striving for but without adjusting for biases and ineffectual values? Why would you care about saving our species, then?
It sounds like people are using “utility function” in different ways in this thread.
I do think that there is a lot of confusion and definitional ground work would probably bear fruit.
If one is trying to “save” some fictious homo economicus that significantly differs from human, that is not really humans.
A world view where humans-as-is is too broken to bother salvaging is rather bleak. I see that the transition away from biases can be modelled has having a utility function with biases and then describing a utility function “without biases” the “how the behaviour should be” and arguing what kind of tweaks we need to make into the gears so that we get from the first white box to the target white box. Part of this is getting the “broken state of humans” to be modelled accurately. If we can get a computer to follow that we would hit aligned exactly-medium-AI. Then we can ramp up the virtuosity of the behaviour (by providing a more laudable utility function).
There seems to be an approach where we just describe the “ideal behaviour utility function” and try to get the computers to do that. Without any of the humans having the capability to know or to follow such a utility function. First make it laudable and then make it reminiscent of humans (hopefully making it human approvable).
The exactly-medium-AI function is not problematically ambigious. “Ideal reasoning behaviour” is under significant and hard-to-reconcile difference of opinion. “Human utility function” refers to exactly-medium-AI but only run on carbon.
I would benefit and appriciate if anyone bothers to fish out conflicting or inconsistent use of the concept.
Meaning “simple utility function” by the phrase “utility function” might be a conceptual trap. It make s a big difference whether you consider a function with hundreds of terms of or billions of terms or even things that can not be expressed as a sum.
As a “tricky utility function”, “human utility function” is mostly fine. Simple utility functions are relevant to todays programming but I don’t know whether honing your concepts to apply better for AGI is served to make a cleanly cut concept that limits only that domain.
Some hidden assumtions might be things like “If humans have a utility function it can be written down”, “Figuring out a humans utility function is practical epistemological stance with a single agent encountering new humans”
If you take stuff like that out the “mere” existence of a function is not that weighty a point.
Whole theories of physics can be formulated as a single action that is then extremised. Taking different theories as different answers to a question like “what happens next?” a single theorys formula is its “choice”. Thus it seems a lot like physical systems could be understood in terms of utility functions. An electron knows how an electron behaves, it does have a behaviour glued into it. If you just add a lot of electrons or protons (and other stuff that has similar laws) it is not like aggregation from the microbehaviours makes the function fail to be a function as a macrobehaviour.
I’ll reiterate that a problem with this is lack of uniqueness. There is not a thing that is the human utility function, even if you allow arbitrarily messy utility functions. If you assume that there is one, it turns out that this is a weighty meta-level commitment even if your class of utility functions is so broad as to be useless on the object level.
I think reflection could help a lot with this, deciding how to proceed in formulating preference based on currently available proxies for preference (with some updatelessness taking care of undue path sensitivity). At some point, preference mostly develops itself, without looking at external data.
If you can agree that putting two electrons in the same system can still be predicted by minimizing an action then you should agree that putting two humans in the same system can still be in principle justified how it plays out. Iterate a little bit and you have a predictable 6 billion human system.
So what operation are we doing where this particular object level is relevant?
I don’t understand what you mean, particularly the last question.
Yes, electrons and humans can be predicted by the laws of physics. The laws of physics are not uniquely specified by our observations, but they are significantly narrowed down by Occam’s razor. But how are you thinking this applies to alignment? We don’t want an AI to learn “humans are collections of atoms and what they really want is to follow the laws of physics.”
Questions like “what would this human do in a situation where there is a cat in a room” has a unique answer that reflects reality, as if that kidn of situation was ran then something would need to happen.
Sure if we start from high abstract values and then try to make them more concrete we might lose the way. If we can turn philosophies into feelings but do not know how to turn feelings into chemistry then there is a level of representation that might not be sufficient. But we know there is one level that is sufficient to describe action and that all the levels are somehow (maybe in an unknown way) connected (mostly stacked on top). So this incompatibility of representation can not be fundamental. Because if it was, then there would be a gap between the levels and the thing would not be connected anymore.
So there is no question “presented with this stimuli how would the human react?” that would be in principle unanswerable. If preferences are expressed as responces to choice situations this is a subcategory of reaction. Even if preferences are expressed as responces to philosophy prompts they would be a subcategory.
One could say that it is not super clarifying that if a two human system represented with philosophical stimuli of “Is candy worth 4$?” you get one human that says “yes” and another human that says “no”. But this is just a swiggle in the function. The function is being really inconvenient when you can’t use an approximation where you can think of just one “average human” and then all humans would reflect that very closely. But we are not promised that the function is a function of time of day or function of verbal short term memory or function of television broadcast data.
Maybe you are saying something like “genetic fitness doesn’t exist” because some animals are fit when they are small and some animals are fit when they are large, so there is no consistent account whether smallness is good or not. Then “human utility function doesn’t exist” because human A over here dares to have different opinions and strategies than human B over here and they do not end up mimicing each other. But like an animal lives or dies, a human will zig or zag. And it can not be that the zigging would fail to be a function of worldstate (with some QM assumed away to be non-significant (and even then maybe not)). What it can be is fail to be function of the world state as we understand it, or our computer system models it, or can be captured in the variables we are using. But then the question is whether we can make do with just these variables and not that there would be nothing to model.
In this language it could be rephrased:
But the “function” in this sense is how the computer system models reality (or like attitudial modes it can take towards reality). But part of how we know that the setup is inadequate is that there is an entity outside of the system that is not reflected in it. Aka, this system can only zig or zag when we needed zog which it can not do. The thing that will keep on missing is the way that reality actually dances. Maybe in some small bubbles we can actually have totally capturing representations in the senses that we care. But there is a fact of the matter to the inquiry. For any sense we might care there is a slice of the whole thing that is sufficient for that. To express zog you need these features, to express zeg you need these other ones.
Human will is quite complex so we can reasonably expect to be spending quite a lot of time in undermodelling. But that is a very different thing from being unmodellable.
It’s not about what the human would do in a given situation. It’s about values – not everything we do reflects our values. Eating meat when you’d rather be vegetarian, smoking when you’d rather not, etc. How do you distinguish biases from fundamental intuitions? How do you infer values from mere observations of behavior? There are a bunch of problems described in this sequence. Not to mention stuff I discuss here about how values may remain under-defined even if we specify a suitable reflection procedure and have people undergo that procedure.
Ineffective values do not need to be considered for a utility function as they do not effect what gets strived for. If you say “I will choose B” and still choose A you are still choosing A. You are not required to be aware of your utility function.
That is a lot of material to go throught en masse, so I will need some sharper pointers of relevance to actually engage.
Uff, a future where humans get more of what they’re striving for but without adjusting for biases and ineffectual values? Why would you care about saving our species, then?
It sounds like people are using “utility function” in different ways in this thread.
I do think that there is a lot of confusion and definitional ground work would probably bear fruit.
If one is trying to “save” some fictious homo economicus that significantly differs from human, that is not really humans.
A world view where humans-as-is is too broken to bother salvaging is rather bleak. I see that the transition away from biases can be modelled has having a utility function with biases and then describing a utility function “without biases” the “how the behaviour should be” and arguing what kind of tweaks we need to make into the gears so that we get from the first white box to the target white box. Part of this is getting the “broken state of humans” to be modelled accurately. If we can get a computer to follow that we would hit aligned exactly-medium-AI. Then we can ramp up the virtuosity of the behaviour (by providing a more laudable utility function).
There seems to be an approach where we just describe the “ideal behaviour utility function” and try to get the computers to do that. Without any of the humans having the capability to know or to follow such a utility function. First make it laudable and then make it reminiscent of humans (hopefully making it human approvable).
The exactly-medium-AI function is not problematically ambigious. “Ideal reasoning behaviour” is under significant and hard-to-reconcile difference of opinion. “Human utility function” refers to exactly-medium-AI but only run on carbon.
I would benefit and appriciate if anyone bothers to fish out conflicting or inconsistent use of the concept.