It’s not really clear to me what one would want out of “more explicit models of this instrumental preference for autonomy.” It’s a complex and messy preference that is tied up with other similarly complex and messy preferences. It probably doesn’t have a simple or natural definition in any reasonable ontology.
What concrete questions about this preference would you hope to answer?
To the extent this preference causes a system to have good behavior, it will be because it affects humans’ behavior, e.g. a human would predictably and systematically decline actions that significantly reduce their own autonomy. So we need to set up our system so that these effects on human behavior lead to it also avoiding actions that significantly reduce the user’s autonomy.
You seem to have in mind a particular version, where the agent infers some latent structure which can then be used to correctly evaluate situations unlike those that appear in the training data (and in particular, to compare the plans put forth by an AI rather than those put forth by a human). So maybe you want to know something about what kind of concepts can robustly transfer from one domain to another quite different domain. It feels to me like you are only going to find bad news here, unless you first make some significant conceptual contributions in AI. So it seems like the first step would be to look for some good news anywhere and see what kind of good news you have to work with. (Or to wait until the AI community produces some good news on its own, and work on other problems in the meantime.)
A somewhat different angle:
The “instrumental goal pursuer” is no more or less dynamically inconsistent than the human. The human wouldn’t lock the current goal in place, and so obviously any preferences that successfully explain the human’s short-term behavior won’t lock the current goal in place. This is a simple observation that already appears in the training set.
This doesn’t require learning a complex concept of autonomy. It just requires learning a model of human preferences that roughly reproduces human behavior. If you don’t get this kind of thing right, then it seems pretty clear that you aren’t going to get useful behavior out of the system in general. Now you may take this as a general argument against value learning, or that value learning will be difficult, but it doesn’t seem like we should consider these kinds of preferences as any different from normal preferences.
A model I might want to make would be something like a hierarchical planning algorithm. It would have some supergoal, and then find subgoals of the supergoal. If the system just naively optimized for the subgoals, then it might do silly things like lock its subgoals in place. Instead, this algorithm should prefer plans that maximize the agent’s autonomy (in case the agent changes subgoals). If this model works, then maybe we can use it to derive a partial solution to the hard problem of corrigibility. So the real question I want to answer is something like “what kind of AI would a agent with a preference for autonomy choose to build”; I suspect that this AI design will be corrigible in some way. I think this is even useful if the agent is much simpler than a human.
You seem to have in mind a particular version, where the agent infers some latent structure which can then be used to correctly evaluate situations unlike those that appear in the training data (and in particular, to compare the plans put forth by an AI rather than those put forth by a human). So maybe you want to know something about what kind of concepts can robustly transfer from one domain to another quite different domain.
Yeah, this seems accurate. I think this goes back to you being slightly more pessimistic than me about making progress on ontology identification (though I’m still somewhat pessimistic).
This doesn’t require learning a complex concept of autonomy. It just requires learning a model of human preferences that roughly reproduces human behavior.
Right, a good supervised learner should learn this. This is more of a problem if we’re using the model’s internal representation, not just its predictions.
This is more of a problem if we’re using the model’s internal representation, not just its predictions.
But you aren’t directly using the model’s internal representation, are you? You are using it only to make predictions about the human’s preferences in some novel domain (e.g. over the consequences of novel kinds of plans).
It seems like it would be cleaner to discuss the whole thing in the context of transfer to a new domain, rather than talking about directly using the learned representation, unless I am missing some advantage of this framing.
Are you hoping to do transfer learning for human preferences in a way that depends on having a detailed understanding of those preferences (e.g. that depends in particular on a detailed understanding of the human preference for autonomy)? I would be very surprised by that. It seems like if you succeed you must be able to robustly transfer lots of human judgments to unfamiliar situations. And for that kind of solution, it’s not clear how an understanding of particular aspects of human preferences really helps.
It seems like it would be cleaner to discuss the whole thing in the context of transfer to a new domain, rather than talking about directly using the learned representation, unless I am missing some advantage of this framing.
I agree with this. Problems with learning this preference should cause the system to make bad predictions (I think I was confused when I wrote that this problem only shows up with internal representations). Now that I think about it, it seems like you’re right that a system that correctly learns abstract human preferences would also learn the preference for autonomy. So this is really a special case of zero-shot transfer learning of abstract preferences. My main motivation for specifically studying the preference for autonomy is that maybe you can turn a simple version of it into a model for corrigibility.
Are you hoping to do transfer learning for human preferences in a way that depends on having a detailed understanding of those preferences (e.g. that depends in particular on a detailed understanding of the human preference for autonomy)?
I think I mostly want some story for why the preference for autonomy is even in the model’s hypothesis space. It seems that if we’re already confident that the system can learn abstract preferences, then we could also be confident that the system can learn the preference for autonomy; but maybe it’s more of a problem if we aren’t confident of this (e.g. the system is only supposed to learn and optimize for fairly concrete preferences).
I don’t think the difference is pessimism about ontology identification per se. You overall approach (if successful) seems like it would do zero-shot transfer learning. My perspective would be something like: OK, let’s try and understand when we can do zero-shot transfer learning at all, and what assumptions we need to rely on (incidentally, I am also pessimistic about this). You are instead focusing on a different simplification of the problem, one which (I feel) is less likely to be connected to the most important underlying difficulties, and less likely to quickly provide information about whether the overall approach can work.
It’s not really clear to me what one would want out of “more explicit models of this instrumental preference for autonomy.” It’s a complex and messy preference that is tied up with other similarly complex and messy preferences. It probably doesn’t have a simple or natural definition in any reasonable ontology.
What concrete questions about this preference would you hope to answer?
To the extent this preference causes a system to have good behavior, it will be because it affects humans’ behavior, e.g. a human would predictably and systematically decline actions that significantly reduce their own autonomy. So we need to set up our system so that these effects on human behavior lead to it also avoiding actions that significantly reduce the user’s autonomy.
You seem to have in mind a particular version, where the agent infers some latent structure which can then be used to correctly evaluate situations unlike those that appear in the training data (and in particular, to compare the plans put forth by an AI rather than those put forth by a human). So maybe you want to know something about what kind of concepts can robustly transfer from one domain to another quite different domain. It feels to me like you are only going to find bad news here, unless you first make some significant conceptual contributions in AI. So it seems like the first step would be to look for some good news anywhere and see what kind of good news you have to work with. (Or to wait until the AI community produces some good news on its own, and work on other problems in the meantime.)
A somewhat different angle:
The “instrumental goal pursuer” is no more or less dynamically inconsistent than the human. The human wouldn’t lock the current goal in place, and so obviously any preferences that successfully explain the human’s short-term behavior won’t lock the current goal in place. This is a simple observation that already appears in the training set.
This doesn’t require learning a complex concept of autonomy. It just requires learning a model of human preferences that roughly reproduces human behavior. If you don’t get this kind of thing right, then it seems pretty clear that you aren’t going to get useful behavior out of the system in general. Now you may take this as a general argument against value learning, or that value learning will be difficult, but it doesn’t seem like we should consider these kinds of preferences as any different from normal preferences.
A model I might want to make would be something like a hierarchical planning algorithm. It would have some supergoal, and then find subgoals of the supergoal. If the system just naively optimized for the subgoals, then it might do silly things like lock its subgoals in place. Instead, this algorithm should prefer plans that maximize the agent’s autonomy (in case the agent changes subgoals). If this model works, then maybe we can use it to derive a partial solution to the hard problem of corrigibility. So the real question I want to answer is something like “what kind of AI would a agent with a preference for autonomy choose to build”; I suspect that this AI design will be corrigible in some way. I think this is even useful if the agent is much simpler than a human.
Yeah, this seems accurate. I think this goes back to you being slightly more pessimistic than me about making progress on ontology identification (though I’m still somewhat pessimistic).
Right, a good supervised learner should learn this. This is more of a problem if we’re using the model’s internal representation, not just its predictions.
But you aren’t directly using the model’s internal representation, are you? You are using it only to make predictions about the human’s preferences in some novel domain (e.g. over the consequences of novel kinds of plans).
It seems like it would be cleaner to discuss the whole thing in the context of transfer to a new domain, rather than talking about directly using the learned representation, unless I am missing some advantage of this framing.
Are you hoping to do transfer learning for human preferences in a way that depends on having a detailed understanding of those preferences (e.g. that depends in particular on a detailed understanding of the human preference for autonomy)? I would be very surprised by that. It seems like if you succeed you must be able to robustly transfer lots of human judgments to unfamiliar situations. And for that kind of solution, it’s not clear how an understanding of particular aspects of human preferences really helps.
I agree with this. Problems with learning this preference should cause the system to make bad predictions (I think I was confused when I wrote that this problem only shows up with internal representations). Now that I think about it, it seems like you’re right that a system that correctly learns abstract human preferences would also learn the preference for autonomy. So this is really a special case of zero-shot transfer learning of abstract preferences. My main motivation for specifically studying the preference for autonomy is that maybe you can turn a simple version of it into a model for corrigibility.
I think I mostly want some story for why the preference for autonomy is even in the model’s hypothesis space. It seems that if we’re already confident that the system can learn abstract preferences, then we could also be confident that the system can learn the preference for autonomy; but maybe it’s more of a problem if we aren’t confident of this (e.g. the system is only supposed to learn and optimize for fairly concrete preferences).
I don’t think the difference is pessimism about ontology identification per se. You overall approach (if successful) seems like it would do zero-shot transfer learning. My perspective would be something like: OK, let’s try and understand when we can do zero-shot transfer learning at all, and what assumptions we need to rely on (incidentally, I am also pessimistic about this). You are instead focusing on a different simplification of the problem, one which (I feel) is less likely to be connected to the most important underlying difficulties, and less likely to quickly provide information about whether the overall approach can work.