Re the human flourishing example—it seems to me that a better choice of thought assessor / ultimate value is “Does this tend to increase the total subjective utility (weighted by amount of consciousness) of all sentient beings?” It’s simple, relies on probably-natural abstractions (utility, consciousness, sentient, agents), does not rely on arbitrary things that are hard to define like what exactly a “human” is, and I think most human morals (at least of the second order want-to-want kind) fall straight out of it.
Defining the utility function of an arbitrary agent is an issue, of course, but if an entity does not have coherent desires, their subagents could perhaps be factored into it, with moral relevance equal to that of the whole being multiplied by the “proportion” of their mind controlled by that subagent. But perhaps this is just CEV again. Actually, given that animals don’t particularly care about (or know about) the concept of uplifting and yet I consider it a moral imperative, I must actually want CEV after all. Heh.
There are some potential failures here of course. For instance, the AGI may inappropriately believe that agents exist which really do not—humans do this all the time, and take them into account in moral calculations—spirits for instance! Well, of course, spirits do exist, but only as self-replicating [via proselytization etc] subagents in the human brain, not as external entities with consciousness of their own, and have minimal moral relevance. But it’s probably possible to constrain it to only those entities which have a known, bounded physical location (allowing for such notions of “bounding” as would be necessary to locate a highly dispersed digital entity in space...), or some such thing.
Ultimately though, this is just a special case of the social instincts thing. I would just want it to be hardwired to feel things like lovingkindness, compassion, and sympathetic joy for all sentient beings, not just humans. A bodhisattva, in other words. :)
I agree that if we can make an AGI motivated by an arbitrary English-language sentence, “maximize human flourishing” is probably not the optimal choice. I was using that as an example / placeholder. As mentioned, I’m more interested in the other question, i.e. how do we make an AGI motivated by an arbitrary English-language sentence?
Also, my hunch is that, the more complicated the sentence, and the harder it is to find salient concrete examples of it, then the harder and more fraught it would be to make an AGI motivated by it. In that respect, “maximize human flourishing” would probably have an edge over “Does this tend to increase the total subjective utility (weighted by amount of consciousness) of all sentient beings?”, or CEV, etc.
Hmm. I’m not sure if I believe that. But I get what you mean. To me, English language sentences seem like they rely for their meaning on the life experience of English speakers and have far more complexity than they appear to have. Example: try to rigorously define “woman” in a way every English speaker would agree on. It’s very hard if not impossible.
As a result, I prefer trying to think of utility functions that at least in principle can be made mathematically rigorous. I think my example is actually far simpler than “maximize human flourishing”, in other words. And I really don’t want a difference in interpretation of words to lead to misalignment. But perhaps I misunderstand you and you have some notion that there’s a way around that problem?
To a first approximation, human motivations involve having a learned world-model, and then some things in the world-model get painted with positive valence (a.k.a. help push the value function higher). For example, if I’m in debt, I can kinda imagine myself being out of debt, and that mental image has a very positive valence (it’s an appealing thought!), and that positive valence in turn helps motivate me to make plans and take actions to bring that about. See Post #7 for a more fleshed-out example.
Nowhere in this picture has anything been made mathematically rigorous. Nowhere in this picture has anyone defined a utility function. Yet, humans are obviously capable of doing very impressive things. I assume that (by default) future programmers will make AGI motivations that work in similar ways.
If we could we could figure out how to make and implement rigorously-defined utility function such that the AGI does the things we want it to do, that would be ridiculously awesome. But I don’t know how. That is the topic of Section 14.5.
The problem is that the steering subsystem does not have a world model and can’t directly refer to anything in a learned world model. Insofar as we want to design a steering system to serve a particular goal, then, we have to design it in such a way that it doesn’t have to have any particular learned world model at all in order to recognize what behaviors move it towards versus further away from that goal.
Example: “am I eating sugar? if so, reward!” is a good steering mechanism, as a presumably simple algorithm in the brainstem is capable of recognizing whether sugar is being eaten or not, and correcting thought assessors appropriately. But, “is this increasing human flourishing? if so, reward!” is not, as I have no idea how to pick out what in the learned world model of the AGI corresponds to “human flourishing”.
But if we can mathematically define agency, consciousness, etc, then it might be possible to make a cascade of steering mechanisms in the “brain stem” that will make the AGI tend to pay attention to things that might be agents, tend to try to determine how conscious they are, tend to try to determine what they want, tend to take actions that give them what they want, etc, in such a way that it can learn in real time how best to do any of those things and we don’t have to worry what its world model actually looks like, as it will never contradict the definitions of those important concepts that we hardcoded for it. Does that make sense, and if so am I missing anything important?
I have no idea how to pick out what in the learned world model of the AGI corresponds to “human flourishing”.
Here’s a lousy way, but which has more than zero chance of working with a good deal more thought and if we can get past the various problems in Sections 14.3-14.4. The AGI watches lots of YouTube videos. Humans label the videos, second-by-second, when there are good examples of human flourishing, and/or when someone literally speaks the words “human flourishing”. These labels are used as supervisory signals that update a “human flourishing” thought assessor. That thought assessor would presumably would wind up most strongly linked to the “human flourishing” world-model concept if any (and also somewhat linked to related concepts like happiness and love and wisdom and whatnot). Then we deploy the AGI, giving it reward in proportion to how strongly each thought it thinks activates the “human flourishing” thought assessor.
It might be possible to make a cascade of steering mechanisms in the “brain stem” that will make the AGI tend to pay attention to things that might be agents, tend to try to determine how conscious they are, tend to try to determine what they want, tend to take actions that give them what they want, etc, in such a way that it can learn in real time how best to do any of those things and we don’t have to worry what its world model actually looks like, as it will never contradict the definitions of those important concepts that we hardcoded for it. Does that make sense, and if so am I missing anything important?
That sounds lovely, but I have no idea how one would write code for any of the things you mention. You should figure it out and then tell me :-P
Your human flourishing example sounds like it wouldn’t generalize well. As the AI’s capacities grow stronger it would start taking more and more work for humans to analyze its plans and determine how much flourishing is in them, and if it grows more intelligent after we deploy it we will have no way to determine if its thought assessor generalizes wrongly. This is, I would think, a rather basic and obvious flaw in relying on any part of the world model directly.
As for how to code that stuff, well, I’ll figure out how to do that after we’ve all figured out how to mathematically specify those things. :P
it would start taking more and more work for humans to analyze its plans and determine how much flourishing is in them
I’m not sure where you’re getting that. The thing I described in my last comment did not include the humans analyzing the AI’s plans, it only involved the humans labeling YouTube videos.
It would be lovely if humans could reliably analyze the AI’s plans. But I fear that our interpretability techniques will not be up to that challenge.
we will have no way to determine if its thought assessor generalizes wrongly
Re the human flourishing example—it seems to me that a better choice of thought assessor / ultimate value is “Does this tend to increase the total subjective utility (weighted by amount of consciousness) of all sentient beings?” It’s simple, relies on probably-natural abstractions (utility, consciousness, sentient, agents), does not rely on arbitrary things that are hard to define like what exactly a “human” is, and I think most human morals (at least of the second order want-to-want kind) fall straight out of it.
Defining the utility function of an arbitrary agent is an issue, of course, but if an entity does not have coherent desires, their subagents could perhaps be factored into it, with moral relevance equal to that of the whole being multiplied by the “proportion” of their mind controlled by that subagent. But perhaps this is just CEV again. Actually, given that animals don’t particularly care about (or know about) the concept of uplifting and yet I consider it a moral imperative, I must actually want CEV after all. Heh.
There are some potential failures here of course. For instance, the AGI may inappropriately believe that agents exist which really do not—humans do this all the time, and take them into account in moral calculations—spirits for instance! Well, of course, spirits do exist, but only as self-replicating [via proselytization etc] subagents in the human brain, not as external entities with consciousness of their own, and have minimal moral relevance. But it’s probably possible to constrain it to only those entities which have a known, bounded physical location (allowing for such notions of “bounding” as would be necessary to locate a highly dispersed digital entity in space...), or some such thing.
Ultimately though, this is just a special case of the social instincts thing. I would just want it to be hardwired to feel things like lovingkindness, compassion, and sympathetic joy for all sentient beings, not just humans. A bodhisattva, in other words. :)
I agree that if we can make an AGI motivated by an arbitrary English-language sentence, “maximize human flourishing” is probably not the optimal choice. I was using that as an example / placeholder. As mentioned, I’m more interested in the other question, i.e. how do we make an AGI motivated by an arbitrary English-language sentence?
Also, my hunch is that, the more complicated the sentence, and the harder it is to find salient concrete examples of it, then the harder and more fraught it would be to make an AGI motivated by it. In that respect, “maximize human flourishing” would probably have an edge over “Does this tend to increase the total subjective utility (weighted by amount of consciousness) of all sentient beings?”, or CEV, etc.
Hmm. I’m not sure if I believe that. But I get what you mean. To me, English language sentences seem like they rely for their meaning on the life experience of English speakers and have far more complexity than they appear to have. Example: try to rigorously define “woman” in a way every English speaker would agree on. It’s very hard if not impossible.
As a result, I prefer trying to think of utility functions that at least in principle can be made mathematically rigorous. I think my example is actually far simpler than “maximize human flourishing”, in other words. And I really don’t want a difference in interpretation of words to lead to misalignment. But perhaps I misunderstand you and you have some notion that there’s a way around that problem?
To a first approximation, human motivations involve having a learned world-model, and then some things in the world-model get painted with positive valence (a.k.a. help push the value function higher). For example, if I’m in debt, I can kinda imagine myself being out of debt, and that mental image has a very positive valence (it’s an appealing thought!), and that positive valence in turn helps motivate me to make plans and take actions to bring that about. See Post #7 for a more fleshed-out example.
Nowhere in this picture has anything been made mathematically rigorous. Nowhere in this picture has anyone defined a utility function. Yet, humans are obviously capable of doing very impressive things. I assume that (by default) future programmers will make AGI motivations that work in similar ways.
If we could we could figure out how to make and implement rigorously-defined utility function such that the AGI does the things we want it to do, that would be ridiculously awesome. But I don’t know how. That is the topic of Section 14.5.
The problem is that the steering subsystem does not have a world model and can’t directly refer to anything in a learned world model. Insofar as we want to design a steering system to serve a particular goal, then, we have to design it in such a way that it doesn’t have to have any particular learned world model at all in order to recognize what behaviors move it towards versus further away from that goal.
Example: “am I eating sugar? if so, reward!” is a good steering mechanism, as a presumably simple algorithm in the brainstem is capable of recognizing whether sugar is being eaten or not, and correcting thought assessors appropriately. But, “is this increasing human flourishing? if so, reward!” is not, as I have no idea how to pick out what in the learned world model of the AGI corresponds to “human flourishing”.
But if we can mathematically define agency, consciousness, etc, then it might be possible to make a cascade of steering mechanisms in the “brain stem” that will make the AGI tend to pay attention to things that might be agents, tend to try to determine how conscious they are, tend to try to determine what they want, tend to take actions that give them what they want, etc, in such a way that it can learn in real time how best to do any of those things and we don’t have to worry what its world model actually looks like, as it will never contradict the definitions of those important concepts that we hardcoded for it. Does that make sense, and if so am I missing anything important?
Here’s a lousy way, but which has more than zero chance of working with a good deal more thought and if we can get past the various problems in Sections 14.3-14.4. The AGI watches lots of YouTube videos. Humans label the videos, second-by-second, when there are good examples of human flourishing, and/or when someone literally speaks the words “human flourishing”. These labels are used as supervisory signals that update a “human flourishing” thought assessor. That thought assessor would presumably would wind up most strongly linked to the “human flourishing” world-model concept if any (and also somewhat linked to related concepts like happiness and love and wisdom and whatnot). Then we deploy the AGI, giving it reward in proportion to how strongly each thought it thinks activates the “human flourishing” thought assessor.
That sounds lovely, but I have no idea how one would write code for any of the things you mention. You should figure it out and then tell me :-P
Your human flourishing example sounds like it wouldn’t generalize well. As the AI’s capacities grow stronger it would start taking more and more work for humans to analyze its plans and determine how much flourishing is in them, and if it grows more intelligent after we deploy it we will have no way to determine if its thought assessor generalizes wrongly. This is, I would think, a rather basic and obvious flaw in relying on any part of the world model directly.
As for how to code that stuff, well, I’ll figure out how to do that after we’ve all figured out how to mathematically specify those things. :P
I’m not sure where you’re getting that. The thing I described in my last comment did not include the humans analyzing the AI’s plans, it only involved the humans labeling YouTube videos.
It would be lovely if humans could reliably analyze the AI’s plans. But I fear that our interpretability techniques will not be up to that challenge.
I agree, see §14.4.
Ah, sorry, I misunderstood you.