I think I’m much more skeptical about this. Humans generally have a fairly good idea of other humans’ “terminal values” and their narrow value learning is strongly informed by that. I don’t see how the more ambitious kind of narrow value learning could work without this knowledge. As I wrote in the previous comment, “For example is it possible to determine the the relative values of different resources in a novel situation if you don’t at least have a rough idea what they’ll ultimately be used for?”
Maybe you’re imagining that the AI has learned an equally good idea of humans’ “terminal values” but they’re just being used to help with narrow value learning instead of being maximized directly, similar to how a human assistant doesn’t try to directly maximize their boss’s terminal values? So essentially “narrow value learning” is like an explicit algorithmic implementation of corrigibility (instead of learning corrigibility from humans like in IDA). Is this a correct view of what you have in mind?
Partly I want to claim “explicit vs implicit” and table it for now.
But yes, I am expecting that the AI has learned some idea of “terminal values” that helps with learning narrow values, eg. the AI can at least predict that we personally don’t want to die, it seems likely that we want sentience and conscious experience to continue on into the future, we probably want happiness rather than suffering, etc. but still not be able to turn it into a function to be maximized directly.
It seems probably true that most of the hope that I’m expressing here can be thought of as “let’s use narrow value learning to create an algorithmic implementation of corrigibility”. I feel much better about that description of my position than any other so far, though it still feels slightly wrong in a way I can’t put my finger on.
I guess there’s also hope that it could be used in some hybrid approach to help achieve any of the other positive outcomes.
Yeah, that seems right. I was describing success stories that could potentially occur with only narrow value learning.
Do you have an example of something like this happening in the past that could help me understand what you mean here?
The VNM rationality theorem has (probably) helped me be more effective at my goals (eg. by being more willing to maximize expected donation dollars rather than putting a premium on low risk) even though I am not literally running expected utility maximization.
I could believe that the knowledge of Dijkstra’s algorithm significantly influenced the design of the Internet (specifically the IP layer), even though the Internet doesn’t use it.
Insights from social science about what makes a “good explanation” are influencing interpretability research currently.
Einstein was probably only able to come up with the theory of relativity because he already understood Newton’s theory, even though Newton’s theory was in some sense wrong.
Ok, I think I mostly understand now, but it seems like I had to do a lot of guessing and asking questions to figure out what your hopes are for the future of narrow value learning and how you see it potentially fit into the big picture for long term AI safety, which are important motivations for this part of the sequence. Did you write about them somewhere that I missed, or were you planning to write about them later? If later, I think it would have been better to write about them at the same time that you introduced narrow value learning, so readers have some idea of why they should pay attention to it. (This is mostly feedback for future reference, but I guess you could also add to previous posts for the benefit of future readers.)
Yeah, this seems right. I didn’t include them because it’s a lot more fuzzy and intuition-y than everything else that I’ve written. (This wasn’t an explicit, conscious choice; more like when I generated the list of things I wanted to write about, this wasn’t on it because it was insufficiently crystallized.) I agree that it really should be in the sequence somewhere, I’ll probably add it to the post on narrow value learning some time after the sequence is done.
AI safety without goal-directed behaviorvery vaguely gestures in the right direction, but there’s no reasonable way for a reader to figure out my hopes for narrow value learning from that post alone.
Partly I want to claim “explicit vs implicit” and table it for now.
But yes, I am expecting that the AI has learned some idea of “terminal values” that helps with learning narrow values, eg. the AI can at least predict that we personally don’t want to die, it seems likely that we want sentience and conscious experience to continue on into the future, we probably want happiness rather than suffering, etc. but still not be able to turn it into a function to be maximized directly.
It seems probably true that most of the hope that I’m expressing here can be thought of as “let’s use narrow value learning to create an algorithmic implementation of corrigibility”. I feel much better about that description of my position than any other so far, though it still feels slightly wrong in a way I can’t put my finger on.
Yeah, that seems right. I was describing success stories that could potentially occur with only narrow value learning.
The VNM rationality theorem has (probably) helped me be more effective at my goals (eg. by being more willing to maximize expected donation dollars rather than putting a premium on low risk) even though I am not literally running expected utility maximization.
I could believe that the knowledge of Dijkstra’s algorithm significantly influenced the design of the Internet (specifically the IP layer), even though the Internet doesn’t use it.
Insights from social science about what makes a “good explanation” are influencing interpretability research currently.
Einstein was probably only able to come up with the theory of relativity because he already understood Newton’s theory, even though Newton’s theory was in some sense wrong.
Ok, I think I mostly understand now, but it seems like I had to do a lot of guessing and asking questions to figure out what your hopes are for the future of narrow value learning and how you see it potentially fit into the big picture for long term AI safety, which are important motivations for this part of the sequence. Did you write about them somewhere that I missed, or were you planning to write about them later? If later, I think it would have been better to write about them at the same time that you introduced narrow value learning, so readers have some idea of why they should pay attention to it. (This is mostly feedback for future reference, but I guess you could also add to previous posts for the benefit of future readers.)
Yeah, this seems right. I didn’t include them because it’s a lot more fuzzy and intuition-y than everything else that I’ve written. (This wasn’t an explicit, conscious choice; more like when I generated the list of things I wanted to write about, this wasn’t on it because it was insufficiently crystallized.) I agree that it really should be in the sequence somewhere, I’ll probably add it to the post on narrow value learning some time after the sequence is done.
AI safety without goal-directed behavior very vaguely gestures in the right direction, but there’s no reasonable way for a reader to figure out my hopes for narrow value learning from that post alone.