It does not make any effort to pursue wildly different short-term goals than I would in order to better realize my long-term values, though it may help me correct some errors that I would be able to recognize as such.
which made me think that when you say “short-term” or “narrow” (I’m assuming you use these interchangeably?) values you are talking about an AI that doesn’t do anything the end user can’t understand the rationale of. But then I read Concrete approval-directed agents where you wrote:
Efficacy: By getting help from additional approval-directed agents, the human operator can evaluate proposals as if she were as smart as those agents. In particular, the human can evaluate the given rationale for a proposed action and determine whether the action really does what the human wants.
and this made me think that you’re also including AIs that do things that the user can merely evaluate the rationale of (i.e., not be able to have an internal understanding of, even hypothetically). Since this “evaluable” interpretation also seems more compatible with strategy-stealing (because an AI that only performs actions that a human can understand can’t “steal” a superhuman strategy), I’m currently guessing this is what you actually have in mind, at least when you’re thinking about how to make a corrigible AI competitive.
Like I mentioned above, I mostly think of narrow value learning is a substitute for imitation learning or approval-direction, realistically to be used as a distillation step rather than as your whole AI. In particular, an agent trained with narrow value learning absolutely is probably not aligned+competitive in a way that might allow you to apply this kind of strategy-stealing argument.
In concrete approval-directed agents I’m talking about a different design, it’s not related to narrow value learning.
I don’t use narrow and short-term interchangeably. I’ve only ever used it in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning.
Ah, that clears up a lot of things for me. (I saw your earlier comment but was quite confused by it due to not realizing your narrow / short-term distinction.) One reason I thought you used “short-term” and “narrow” interchangeably is due to Act-based agents where you seemed to be doing that:
These proposals all focus on the short-term instrumental preferences of their users. [...]
What is “narrow” anyway?
There is clearly a difference between act-based agents and traditional rational agents. But it’s not entirely clear what the key difference is.
And in that post it also seemed like “narrow value learners” were meant to be the whole AI since it talked a lot about “users” of such AI.
(In that post I did use narrow in the way we are currently using short-term, contrary to my claim the grandparent. Sorry for the confusion this caused.)
The reason I brought up this distinction was that in Ambitious vs. narrow value learning you wrote:
which made me think that when you say “short-term” or “narrow” (I’m assuming you use these interchangeably?) values you are talking about an AI that doesn’t do anything the end user can’t understand the rationale of. But then I read Concrete approval-directed agents where you wrote:
and this made me think that you’re also including AIs that do things that the user can merely evaluate the rationale of (i.e., not be able to have an internal understanding of, even hypothetically). Since this “evaluable” interpretation also seems more compatible with strategy-stealing (because an AI that only performs actions that a human can understand can’t “steal” a superhuman strategy), I’m currently guessing this is what you actually have in mind, at least when you’re thinking about how to make a corrigible AI competitive.
Like I mentioned above, I mostly think of narrow value learning is a substitute for imitation learning or approval-direction, realistically to be used as a distillation step rather than as your whole AI. In particular, an agent trained with narrow value learning absolutely is probably not aligned+competitive in a way that might allow you to apply this kind of strategy-stealing argument.
In concrete approval-directed agents I’m talking about a different design, it’s not related to narrow value learning.
I don’t use narrow and short-term interchangeably. I’ve only ever used it in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning.
Ah, that clears up a lot of things for me. (I saw your earlier comment but was quite confused by it due to not realizing your narrow / short-term distinction.) One reason I thought you used “short-term” and “narrow” interchangeably is due to Act-based agents where you seemed to be doing that:
And in that post it also seemed like “narrow value learners” were meant to be the whole AI since it talked a lot about “users” of such AI.
(In that post I did use narrow in the way we are currently using short-term, contrary to my claim the grandparent. Sorry for the confusion this caused.)