However, this also feels different from corrigibility, in that it feels more like a limitation put on the AI system, while corrigibility seems more like a property of the AI’s “motivational system”. This might be fine, since the AI might just not be goal-directed. One other benefit of corrigibility is that if you are “somewhat” corrigible, then you would like to become more corrigible, since that is what the human would prefer; informed-consent-AI doesn’t seem to have an analogous benefit.
You could definitely think of it as a limitation to put on a system, but I actually wasn’t thinking of it that way when I wrote the post. I was trying to imagine something which only operates from this principle. Granted, I didn’t really explain how that could work. I was imagining that it does something like sample from a probability distribution which is (speaking intuitively) the intersection of what you expect it to do and what you would like it to do.
(It now seems to me that although I put “non-consequentialist” in the title of the post, I didn’t explain the part where it isn’t consequentialist very well. Which is fine, since the post was very much just spitballing.)
I was imagining that it does something like sample from a probability distribution which is (speaking intuitively) the intersection of what you expect it to do and what you would like it to do.
When I use the word “limitation”, I intend to include things like this. I think the way I’m using it is something like “a method of restricting possible behaviors that is not implemented through changing the “motivational system”″. The idea being that if the AI is badly motivated, and our limitation doesn’t eliminate all of the bad behaviors (a very strong condition that seems hard to meet), then it is likely that the AI will end up picking out one of the bad behaviors. This is similar in spirit to the ideas behind nearest unblocked strategies.
Here, we are eliminating bad behaviors by choosing a particular set of behaviors (ones that violate your expectations of what the robot will do) and not considering them at all. This seems like it is not about the “motivational system”, and if this were implemented in a robot that does have a separate “motivational system” (i.e. it is goal-directed), I worry about a nearest unblocked strategy.
This seems like it is not about the “motivational system”, and if this were implemented in a robot that does have a separate “motivational system” (i.e. it is goal-directed), I worry about a nearest unblocked strategy.
I am confused about where you think the motivation system comes into my statement. It sounds like you are imagining that what I said is a constraint, which could somehow be coupled with a seperate motivation system. If that’s your interpretation, that’s not what I meant at all, unless random sampling counts as a motivation system. I’m saying that all you do is sample from what’s consented to.
But, maybe what you are saying is that in “the intersection of what the user expects and what the user wants”, the first is functioning as a constraint, and the second is functioning as a motivation system (basically the usual IRL motivation system). If that’s what you meant, I think that’s a valid concern. What I was imagining is that you are trying to infer “what the user wants” not in terms of end goals, but rather in terms of actions (really, policies) for the AI. So, it is more like an approval-directed agent to an extent. If the human says “get me groceries”, the job of the AI is not to infer the end state the human is asking the robot to optimize for, but rather, to infer the set of policies which the human is trying to point at.
There’s no optimization on top of this finding perverse instantiations of the constraints; the AI just follows the policy which it infers the human would like. Of course the powerful learning system required for this to work may perversely instantiate these beliefs (ie, there may be daemons aka inner optimizers).
(The most obvious problem I see with this approach is that it seems to imply that the AI can’t help the human do anything which the human doesn’t already know how to do. For example, if you don’t know how to get started filing your taxes, then the robot can’t help you. But maybe there’s some way to differentiate between more benign cases like that and less benign cases like using nanotechnology to more effectively get groceries?)
A third interpretation of your concern is that you’re saying that if the thing is doing well enough to get groceries, there has to be powerful optimization somewhere, and wherever it is, it’s going to be pushing toward perverse instantiations one way or another. I don’t have any argument against this concern, but I think it mostly amounts to a concern about inner optimizers.
(I feel compelled to mention again that I don’t feel strongly that the whole idea makes any sense. I just want to convey why I don’t think it’s about constraining an underlying motivation system.)
But, maybe what you are saying is that in “the intersection of what the user expects and what the user wants”, the first is functioning as a constraint, and the second is functioning as a motivation system (basically the usual IRL motivation system).
This is basically what I meant. Thanks for clarifying that you meant something else.
The most obvious problem I see with this approach is that it seems to imply that the AI can’t help the human do anything which the human doesn’t already know how to do.
Yeah, this is my concern with the thing you actually meant. (It’s also why I incorrectly assumed that “what the user wants” was meant to be goal-directed optimization, as opposed to about policies the user approves of.) It could work combined with something like amplification where you get to assume that the overseer is smarter than the agent, but then it’s not clear if the part about “what the user expects” buys you anything over the “what the user wants” part.
A third interpretation of your concern is that you’re saying that if the thing is doing well enough to get groceries, there has to be powerful optimization somewhere, and wherever it is, it’s going to be pushing toward perverse instantiations one way or another. I don’t have any argument against this concern, but I think it mostly amounts to a concern about inner optimizers.
This does seem like a concern, but it wasn’t the one I was thinking about. It also seems like a concern about basically any existing proposal. Usually when talking about concerns I don’t bring up the ones that are always concerns, unless someone explicitly claims that their solution obviates that concern.
You could definitely think of it as a limitation to put on a system, but I actually wasn’t thinking of it that way when I wrote the post. I was trying to imagine something which only operates from this principle. Granted, I didn’t really explain how that could work. I was imagining that it does something like sample from a probability distribution which is (speaking intuitively) the intersection of what you expect it to do and what you would like it to do.
(It now seems to me that although I put “non-consequentialist” in the title of the post, I didn’t explain the part where it isn’t consequentialist very well. Which is fine, since the post was very much just spitballing.)
When I use the word “limitation”, I intend to include things like this. I think the way I’m using it is something like “a method of restricting possible behaviors that is not implemented through changing the “motivational system”″. The idea being that if the AI is badly motivated, and our limitation doesn’t eliminate all of the bad behaviors (a very strong condition that seems hard to meet), then it is likely that the AI will end up picking out one of the bad behaviors. This is similar in spirit to the ideas behind nearest unblocked strategies.
Here, we are eliminating bad behaviors by choosing a particular set of behaviors (ones that violate your expectations of what the robot will do) and not considering them at all. This seems like it is not about the “motivational system”, and if this were implemented in a robot that does have a separate “motivational system” (i.e. it is goal-directed), I worry about a nearest unblocked strategy.
I am confused about where you think the motivation system comes into my statement. It sounds like you are imagining that what I said is a constraint, which could somehow be coupled with a seperate motivation system. If that’s your interpretation, that’s not what I meant at all, unless random sampling counts as a motivation system. I’m saying that all you do is sample from what’s consented to.
But, maybe what you are saying is that in “the intersection of what the user expects and what the user wants”, the first is functioning as a constraint, and the second is functioning as a motivation system (basically the usual IRL motivation system). If that’s what you meant, I think that’s a valid concern. What I was imagining is that you are trying to infer “what the user wants” not in terms of end goals, but rather in terms of actions (really, policies) for the AI. So, it is more like an approval-directed agent to an extent. If the human says “get me groceries”, the job of the AI is not to infer the end state the human is asking the robot to optimize for, but rather, to infer the set of policies which the human is trying to point at.
There’s no optimization on top of this finding perverse instantiations of the constraints; the AI just follows the policy which it infers the human would like. Of course the powerful learning system required for this to work may perversely instantiate these beliefs (ie, there may be daemons aka inner optimizers).
(The most obvious problem I see with this approach is that it seems to imply that the AI can’t help the human do anything which the human doesn’t already know how to do. For example, if you don’t know how to get started filing your taxes, then the robot can’t help you. But maybe there’s some way to differentiate between more benign cases like that and less benign cases like using nanotechnology to more effectively get groceries?)
A third interpretation of your concern is that you’re saying that if the thing is doing well enough to get groceries, there has to be powerful optimization somewhere, and wherever it is, it’s going to be pushing toward perverse instantiations one way or another. I don’t have any argument against this concern, but I think it mostly amounts to a concern about inner optimizers.
(I feel compelled to mention again that I don’t feel strongly that the whole idea makes any sense. I just want to convey why I don’t think it’s about constraining an underlying motivation system.)
This is basically what I meant. Thanks for clarifying that you meant something else.
Yeah, this is my concern with the thing you actually meant. (It’s also why I incorrectly assumed that “what the user wants” was meant to be goal-directed optimization, as opposed to about policies the user approves of.) It could work combined with something like amplification where you get to assume that the overseer is smarter than the agent, but then it’s not clear if the part about “what the user expects” buys you anything over the “what the user wants” part.
This does seem like a concern, but it wasn’t the one I was thinking about. It also seems like a concern about basically any existing proposal. Usually when talking about concerns I don’t bring up the ones that are always concerns, unless someone explicitly claims that their solution obviates that concern.