For example, an instrumental goal I have is to acquire resources such that I have influence over the future, which is a long-term goal.
This is a bit tangential but I’ve noticed that this instrumental goal can be interpreted in a narrow way or a broader/more ambitious way. For example if the AI just learns a list of useful resources I instrumentally value and how many utils per gram (or whatever suitable unit of measurement) each resource is worth under various kinds of circumstances it has seen in the past, that would be relatively easy to learn but is not going to generalize well. Or alternatively the AI could learn “resources” as a general concept, and be able to infer what counts as resources in a new environment it hasn’t seen before and how different resources should be traded off against each other as circumstances change. This would be (putting aside human safety problems) really useful in a general and robust way but it’s unclear to me that we can realistically hope for such a thing. (For example is it possible to determine the the relative values of different resources in a novel situation if you don’t at least have a rough idea what they’ll ultimately be used for?)
Do you see the former as the end goal of narrow value learning, or the latter? When you talk about narrow value learning AI preserving option value, do you have the former or the latter kind of narrow value learning in mind?
BTW what is the success story for narrow value learning? Is it the same as for norm-following AI? Is the success story for all of the approaches described in this sequence essentially #4 in this list?
I don’t think that this is inevitable, and expect that the actual answer depends strongly on the details of how the narrow value learning is done.
Do you see the former as the end goal of narrow value learning, or the latter? When you talk about narrow value learning AI preserving option value, do you have the former or the latter kind of narrow value learning in mind?
Closer to the latter, i.e. the broad/ambitious kind. I think it’s a generally reasonable model to imagine human-like narrow value learning capabilities. When I talk about option value, I also have the broad version in mind.
BTW what is the success story for narrow value learning? Is it the same as for norm-following AI? Is the success story for all of the approaches described in this sequence essentially #4 in this list?
Yeah, it’s either #4 or #5 (it remains to be seen whether narrow value learning is sufficient for #5, but it seems possible to me).
It’s also plausible to me that the success story for today’s research is “research on narrow value learning illuminates a lot of important properties that we would want our AI systems to have, which influences AGI research positively”. That is, as a result of the research on narrow value learning, the AGI systems we build will be better at narrow value learning than they otherwise would have been, even though they aren’t using a specific narrow value learning algorithm that we develop. And this in turn leads to success story #4 or #5.
This also feels like it illustrates a difference in our thinking—you seem to be thinking about explicit techniques for alignment, whereas I’m often thinking of implicit ones. It might also be something more like outside view vs. inside view (where somewhat paradoxically I’m on the outside view side)? I’m currently far too confused about this distinction to explain it well, but I wanted to flag it as a thing causing confusion in our dialogue that I hope to resolve. (And I would write a post about it if I did.)
I think it’s a generally reasonable model to imagine human-like narrow value learning capabilities.
I think I’m much more skeptical about this. Humans generally have a fairly good idea of other humans’ “terminal values” and their narrow value learning is strongly informed by that. I don’t see how the more ambitious kind of narrow value learning could work without this knowledge. As I wrote in the previous comment, “For example is it possible to determine the the relative values of different resources in a novel situation if you don’t at least have a rough idea what they’ll ultimately be used for?”
Maybe you’re imagining that the AI has learned an equally good idea of humans’ “terminal values” but they’re just being used to help with narrow value learning instead of being maximized directly, similar to how a human assistant doesn’t try to directly maximize their boss’s terminal values? So essentially “narrow value learning” is like an explicit algorithmic implementation of corrigibility (instead of learning corrigibility from humans like in IDA). Is this a correct view of what you have in mind?
Yeah, it’s either #4 or #5 (it remains to be seen whether narrow value learning is sufficient for #5, but it seems possible to me).
I guess there’s also hope that it could be used in some hybrid approach to help achieve any of the other positive outcomes.
That is, as a result of the research on narrow value learning, the AGI systems we build will be better at narrow value learning than they otherwise would have been, even though they aren’t using a specific narrow value learning algorithm that we develop.
Do you have an example of something like this happening in the past that could help me understand what you mean here?
I think I’m much more skeptical about this. Humans generally have a fairly good idea of other humans’ “terminal values” and their narrow value learning is strongly informed by that. I don’t see how the more ambitious kind of narrow value learning could work without this knowledge. As I wrote in the previous comment, “For example is it possible to determine the the relative values of different resources in a novel situation if you don’t at least have a rough idea what they’ll ultimately be used for?”
Maybe you’re imagining that the AI has learned an equally good idea of humans’ “terminal values” but they’re just being used to help with narrow value learning instead of being maximized directly, similar to how a human assistant doesn’t try to directly maximize their boss’s terminal values? So essentially “narrow value learning” is like an explicit algorithmic implementation of corrigibility (instead of learning corrigibility from humans like in IDA). Is this a correct view of what you have in mind?
Partly I want to claim “explicit vs implicit” and table it for now.
But yes, I am expecting that the AI has learned some idea of “terminal values” that helps with learning narrow values, eg. the AI can at least predict that we personally don’t want to die, it seems likely that we want sentience and conscious experience to continue on into the future, we probably want happiness rather than suffering, etc. but still not be able to turn it into a function to be maximized directly.
It seems probably true that most of the hope that I’m expressing here can be thought of as “let’s use narrow value learning to create an algorithmic implementation of corrigibility”. I feel much better about that description of my position than any other so far, though it still feels slightly wrong in a way I can’t put my finger on.
I guess there’s also hope that it could be used in some hybrid approach to help achieve any of the other positive outcomes.
Yeah, that seems right. I was describing success stories that could potentially occur with only narrow value learning.
Do you have an example of something like this happening in the past that could help me understand what you mean here?
The VNM rationality theorem has (probably) helped me be more effective at my goals (eg. by being more willing to maximize expected donation dollars rather than putting a premium on low risk) even though I am not literally running expected utility maximization.
I could believe that the knowledge of Dijkstra’s algorithm significantly influenced the design of the Internet (specifically the IP layer), even though the Internet doesn’t use it.
Insights from social science about what makes a “good explanation” are influencing interpretability research currently.
Einstein was probably only able to come up with the theory of relativity because he already understood Newton’s theory, even though Newton’s theory was in some sense wrong.
Ok, I think I mostly understand now, but it seems like I had to do a lot of guessing and asking questions to figure out what your hopes are for the future of narrow value learning and how you see it potentially fit into the big picture for long term AI safety, which are important motivations for this part of the sequence. Did you write about them somewhere that I missed, or were you planning to write about them later? If later, I think it would have been better to write about them at the same time that you introduced narrow value learning, so readers have some idea of why they should pay attention to it. (This is mostly feedback for future reference, but I guess you could also add to previous posts for the benefit of future readers.)
Yeah, this seems right. I didn’t include them because it’s a lot more fuzzy and intuition-y than everything else that I’ve written. (This wasn’t an explicit, conscious choice; more like when I generated the list of things I wanted to write about, this wasn’t on it because it was insufficiently crystallized.) I agree that it really should be in the sequence somewhere, I’ll probably add it to the post on narrow value learning some time after the sequence is done.
AI safety without goal-directed behaviorvery vaguely gestures in the right direction, but there’s no reasonable way for a reader to figure out my hopes for narrow value learning from that post alone.
This is a bit tangential but I’ve noticed that this instrumental goal can be interpreted in a narrow way or a broader/more ambitious way. For example if the AI just learns a list of useful resources I instrumentally value and how many utils per gram (or whatever suitable unit of measurement) each resource is worth under various kinds of circumstances it has seen in the past, that would be relatively easy to learn but is not going to generalize well. Or alternatively the AI could learn “resources” as a general concept, and be able to infer what counts as resources in a new environment it hasn’t seen before and how different resources should be traded off against each other as circumstances change. This would be (putting aside human safety problems) really useful in a general and robust way but it’s unclear to me that we can realistically hope for such a thing. (For example is it possible to determine the the relative values of different resources in a novel situation if you don’t at least have a rough idea what they’ll ultimately be used for?)
Do you see the former as the end goal of narrow value learning, or the latter? When you talk about narrow value learning AI preserving option value, do you have the former or the latter kind of narrow value learning in mind?
BTW what is the success story for narrow value learning? Is it the same as for norm-following AI? Is the success story for all of the approaches described in this sequence essentially #4 in this list?
Understood, thanks for the clarification.
Closer to the latter, i.e. the broad/ambitious kind. I think it’s a generally reasonable model to imagine human-like narrow value learning capabilities. When I talk about option value, I also have the broad version in mind.
Yeah, it’s either #4 or #5 (it remains to be seen whether narrow value learning is sufficient for #5, but it seems possible to me).
It’s also plausible to me that the success story for today’s research is “research on narrow value learning illuminates a lot of important properties that we would want our AI systems to have, which influences AGI research positively”. That is, as a result of the research on narrow value learning, the AGI systems we build will be better at narrow value learning than they otherwise would have been, even though they aren’t using a specific narrow value learning algorithm that we develop. And this in turn leads to success story #4 or #5.
This also feels like it illustrates a difference in our thinking—you seem to be thinking about explicit techniques for alignment, whereas I’m often thinking of implicit ones. It might also be something more like outside view vs. inside view (where somewhat paradoxically I’m on the outside view side)? I’m currently far too confused about this distinction to explain it well, but I wanted to flag it as a thing causing confusion in our dialogue that I hope to resolve. (And I would write a post about it if I did.)
I think I’m much more skeptical about this. Humans generally have a fairly good idea of other humans’ “terminal values” and their narrow value learning is strongly informed by that. I don’t see how the more ambitious kind of narrow value learning could work without this knowledge. As I wrote in the previous comment, “For example is it possible to determine the the relative values of different resources in a novel situation if you don’t at least have a rough idea what they’ll ultimately be used for?”
Maybe you’re imagining that the AI has learned an equally good idea of humans’ “terminal values” but they’re just being used to help with narrow value learning instead of being maximized directly, similar to how a human assistant doesn’t try to directly maximize their boss’s terminal values? So essentially “narrow value learning” is like an explicit algorithmic implementation of corrigibility (instead of learning corrigibility from humans like in IDA). Is this a correct view of what you have in mind?
I guess there’s also hope that it could be used in some hybrid approach to help achieve any of the other positive outcomes.
Do you have an example of something like this happening in the past that could help me understand what you mean here?
Partly I want to claim “explicit vs implicit” and table it for now.
But yes, I am expecting that the AI has learned some idea of “terminal values” that helps with learning narrow values, eg. the AI can at least predict that we personally don’t want to die, it seems likely that we want sentience and conscious experience to continue on into the future, we probably want happiness rather than suffering, etc. but still not be able to turn it into a function to be maximized directly.
It seems probably true that most of the hope that I’m expressing here can be thought of as “let’s use narrow value learning to create an algorithmic implementation of corrigibility”. I feel much better about that description of my position than any other so far, though it still feels slightly wrong in a way I can’t put my finger on.
Yeah, that seems right. I was describing success stories that could potentially occur with only narrow value learning.
The VNM rationality theorem has (probably) helped me be more effective at my goals (eg. by being more willing to maximize expected donation dollars rather than putting a premium on low risk) even though I am not literally running expected utility maximization.
I could believe that the knowledge of Dijkstra’s algorithm significantly influenced the design of the Internet (specifically the IP layer), even though the Internet doesn’t use it.
Insights from social science about what makes a “good explanation” are influencing interpretability research currently.
Einstein was probably only able to come up with the theory of relativity because he already understood Newton’s theory, even though Newton’s theory was in some sense wrong.
Ok, I think I mostly understand now, but it seems like I had to do a lot of guessing and asking questions to figure out what your hopes are for the future of narrow value learning and how you see it potentially fit into the big picture for long term AI safety, which are important motivations for this part of the sequence. Did you write about them somewhere that I missed, or were you planning to write about them later? If later, I think it would have been better to write about them at the same time that you introduced narrow value learning, so readers have some idea of why they should pay attention to it. (This is mostly feedback for future reference, but I guess you could also add to previous posts for the benefit of future readers.)
Yeah, this seems right. I didn’t include them because it’s a lot more fuzzy and intuition-y than everything else that I’ve written. (This wasn’t an explicit, conscious choice; more like when I generated the list of things I wanted to write about, this wasn’t on it because it was insufficiently crystallized.) I agree that it really should be in the sequence somewhere, I’ll probably add it to the post on narrow value learning some time after the sequence is done.
AI safety without goal-directed behavior very vaguely gestures in the right direction, but there’s no reasonable way for a reader to figure out my hopes for narrow value learning from that post alone.