To a first approximation, human motivations involve having a learned world-model, and then some things in the world-model get painted with positive valence (a.k.a. help push the value function higher). For example, if I’m in debt, I can kinda imagine myself being out of debt, and that mental image has a very positive valence (it’s an appealing thought!), and that positive valence in turn helps motivate me to make plans and take actions to bring that about. See Post #7 for a more fleshed-out example.
Nowhere in this picture has anything been made mathematically rigorous. Nowhere in this picture has anyone defined a utility function. Yet, humans are obviously capable of doing very impressive things. I assume that (by default) future programmers will make AGI motivations that work in similar ways.
If we could we could figure out how to make and implement rigorously-defined utility function such that the AGI does the things we want it to do, that would be ridiculously awesome. But I don’t know how. That is the topic of Section 14.5.
The problem is that the steering subsystem does not have a world model and can’t directly refer to anything in a learned world model. Insofar as we want to design a steering system to serve a particular goal, then, we have to design it in such a way that it doesn’t have to have any particular learned world model at all in order to recognize what behaviors move it towards versus further away from that goal.
Example: “am I eating sugar? if so, reward!” is a good steering mechanism, as a presumably simple algorithm in the brainstem is capable of recognizing whether sugar is being eaten or not, and correcting thought assessors appropriately. But, “is this increasing human flourishing? if so, reward!” is not, as I have no idea how to pick out what in the learned world model of the AGI corresponds to “human flourishing”.
But if we can mathematically define agency, consciousness, etc, then it might be possible to make a cascade of steering mechanisms in the “brain stem” that will make the AGI tend to pay attention to things that might be agents, tend to try to determine how conscious they are, tend to try to determine what they want, tend to take actions that give them what they want, etc, in such a way that it can learn in real time how best to do any of those things and we don’t have to worry what its world model actually looks like, as it will never contradict the definitions of those important concepts that we hardcoded for it. Does that make sense, and if so am I missing anything important?
I have no idea how to pick out what in the learned world model of the AGI corresponds to “human flourishing”.
Here’s a lousy way, but which has more than zero chance of working with a good deal more thought and if we can get past the various problems in Sections 14.3-14.4. The AGI watches lots of YouTube videos. Humans label the videos, second-by-second, when there are good examples of human flourishing, and/or when someone literally speaks the words “human flourishing”. These labels are used as supervisory signals that update a “human flourishing” thought assessor. That thought assessor would presumably would wind up most strongly linked to the “human flourishing” world-model concept if any (and also somewhat linked to related concepts like happiness and love and wisdom and whatnot). Then we deploy the AGI, giving it reward in proportion to how strongly each thought it thinks activates the “human flourishing” thought assessor.
It might be possible to make a cascade of steering mechanisms in the “brain stem” that will make the AGI tend to pay attention to things that might be agents, tend to try to determine how conscious they are, tend to try to determine what they want, tend to take actions that give them what they want, etc, in such a way that it can learn in real time how best to do any of those things and we don’t have to worry what its world model actually looks like, as it will never contradict the definitions of those important concepts that we hardcoded for it. Does that make sense, and if so am I missing anything important?
That sounds lovely, but I have no idea how one would write code for any of the things you mention. You should figure it out and then tell me :-P
Your human flourishing example sounds like it wouldn’t generalize well. As the AI’s capacities grow stronger it would start taking more and more work for humans to analyze its plans and determine how much flourishing is in them, and if it grows more intelligent after we deploy it we will have no way to determine if its thought assessor generalizes wrongly. This is, I would think, a rather basic and obvious flaw in relying on any part of the world model directly.
As for how to code that stuff, well, I’ll figure out how to do that after we’ve all figured out how to mathematically specify those things. :P
it would start taking more and more work for humans to analyze its plans and determine how much flourishing is in them
I’m not sure where you’re getting that. The thing I described in my last comment did not include the humans analyzing the AI’s plans, it only involved the humans labeling YouTube videos.
It would be lovely if humans could reliably analyze the AI’s plans. But I fear that our interpretability techniques will not be up to that challenge.
we will have no way to determine if its thought assessor generalizes wrongly
To a first approximation, human motivations involve having a learned world-model, and then some things in the world-model get painted with positive valence (a.k.a. help push the value function higher). For example, if I’m in debt, I can kinda imagine myself being out of debt, and that mental image has a very positive valence (it’s an appealing thought!), and that positive valence in turn helps motivate me to make plans and take actions to bring that about. See Post #7 for a more fleshed-out example.
Nowhere in this picture has anything been made mathematically rigorous. Nowhere in this picture has anyone defined a utility function. Yet, humans are obviously capable of doing very impressive things. I assume that (by default) future programmers will make AGI motivations that work in similar ways.
If we could we could figure out how to make and implement rigorously-defined utility function such that the AGI does the things we want it to do, that would be ridiculously awesome. But I don’t know how. That is the topic of Section 14.5.
The problem is that the steering subsystem does not have a world model and can’t directly refer to anything in a learned world model. Insofar as we want to design a steering system to serve a particular goal, then, we have to design it in such a way that it doesn’t have to have any particular learned world model at all in order to recognize what behaviors move it towards versus further away from that goal.
Example: “am I eating sugar? if so, reward!” is a good steering mechanism, as a presumably simple algorithm in the brainstem is capable of recognizing whether sugar is being eaten or not, and correcting thought assessors appropriately. But, “is this increasing human flourishing? if so, reward!” is not, as I have no idea how to pick out what in the learned world model of the AGI corresponds to “human flourishing”.
But if we can mathematically define agency, consciousness, etc, then it might be possible to make a cascade of steering mechanisms in the “brain stem” that will make the AGI tend to pay attention to things that might be agents, tend to try to determine how conscious they are, tend to try to determine what they want, tend to take actions that give them what they want, etc, in such a way that it can learn in real time how best to do any of those things and we don’t have to worry what its world model actually looks like, as it will never contradict the definitions of those important concepts that we hardcoded for it. Does that make sense, and if so am I missing anything important?
Here’s a lousy way, but which has more than zero chance of working with a good deal more thought and if we can get past the various problems in Sections 14.3-14.4. The AGI watches lots of YouTube videos. Humans label the videos, second-by-second, when there are good examples of human flourishing, and/or when someone literally speaks the words “human flourishing”. These labels are used as supervisory signals that update a “human flourishing” thought assessor. That thought assessor would presumably would wind up most strongly linked to the “human flourishing” world-model concept if any (and also somewhat linked to related concepts like happiness and love and wisdom and whatnot). Then we deploy the AGI, giving it reward in proportion to how strongly each thought it thinks activates the “human flourishing” thought assessor.
That sounds lovely, but I have no idea how one would write code for any of the things you mention. You should figure it out and then tell me :-P
Your human flourishing example sounds like it wouldn’t generalize well. As the AI’s capacities grow stronger it would start taking more and more work for humans to analyze its plans and determine how much flourishing is in them, and if it grows more intelligent after we deploy it we will have no way to determine if its thought assessor generalizes wrongly. This is, I would think, a rather basic and obvious flaw in relying on any part of the world model directly.
As for how to code that stuff, well, I’ll figure out how to do that after we’ve all figured out how to mathematically specify those things. :P
I’m not sure where you’re getting that. The thing I described in my last comment did not include the humans analyzing the AI’s plans, it only involved the humans labeling YouTube videos.
It would be lovely if humans could reliably analyze the AI’s plans. But I fear that our interpretability techniques will not be up to that challenge.
I agree, see §14.4.
Ah, sorry, I misunderstood you.