One approach to low-impact AI might be to pair an AGI system with a human supervisor who gives it explicit instructions where it is permitted to continue. I have proposed a kind of “decision paralysis” where, given multiple conflicting goals, a multi-objective agent would simply choose not to act (I’m not the first or only one to describe this kind of conservativism, but I don’t recall the framing others have used). In this case, the multi-objectives might be the primary objective and then your low-impact objective.
This might be a way forward to deal with your “High-Impact Interference” problem. Perhaps preventing an agent to engage in high-impact interference is a necessary part of safe AI. When fulfillment of the primary objective seems to require engaging in high-impact interference, a safe AI might report to a human supervisor that it cannot proceed because of a particular side effect. The human supervisor could then decide whether the system should proceed or not. If the human supervisor makes the judgement the system should proceed, then they can re-specify the objective to permit the potential side effect, by specifying it as part of the primary objective itself.
Hi Ben, I like the idea, however almost every decision has conflicting outcomes, e.g., regarding opportunity cost. From how I understand you, this would delegate almost every decision to humans if you take the premise of I can’t do X if I choose to do Y seriously. I think the application to high-impact interference seems therefore promising if the system is limited to only deciding on a few things. The question then becomes if a human can understand the plan that an AGI is capable of making. IMO this ties nicely into, e.g., ELK and interpretability research, but also the problem of predictability.
Then the next thing I want to suggest is that the system uses human resolution of conflicting outcomes to train itself to predict how a human would resolve a conflict, and if it is higher than a suitable level of confidence, it will go ahead and act without human intervention. But any prediction of what a human would predict could be second-guessed by a human pointing out where the prediction is wrong.
Agreed that whether a human understanding the plan (and all the relevant outcomes. which outcomes are relevant?) is important and harder than I first imagined.
I think this threshold will be tough to set. Confidence in a decision makes IMO only really sense if you consider decisions to be uni-modal. I would argue that this is rarely the case for a sufficiently capable system (like you and me). We are constantly trading off multiple options, and thus, the confidence (e.g., as measured by the log-likelihood of the action given a policy and state) depends on the number of options available. I expect this context dependence would be a tough nut to crack to have a meaningful threshold.
One approach to low-impact AI might be to pair an AGI system with a human supervisor who gives it explicit instructions where it is permitted to continue. I have proposed a kind of “decision paralysis” where, given multiple conflicting goals, a multi-objective agent would simply choose not to act (I’m not the first or only one to describe this kind of conservativism, but I don’t recall the framing others have used). In this case, the multi-objectives might be the primary objective and then your low-impact objective.
This might be a way forward to deal with your “High-Impact Interference” problem. Perhaps preventing an agent to engage in high-impact interference is a necessary part of safe AI. When fulfillment of the primary objective seems to require engaging in high-impact interference, a safe AI might report to a human supervisor that it cannot proceed because of a particular side effect. The human supervisor could then decide whether the system should proceed or not. If the human supervisor makes the judgement the system should proceed, then they can re-specify the objective to permit the potential side effect, by specifying it as part of the primary objective itself.
Hi Ben, I like the idea, however almost every decision has conflicting outcomes, e.g., regarding opportunity cost. From how I understand you, this would delegate almost every decision to humans if you take the premise of I can’t do X if I choose to do Y seriously. I think the application to high-impact interference seems therefore promising if the system is limited to only deciding on a few things. The question then becomes if a human can understand the plan that an AGI is capable of making. IMO this ties nicely into, e.g., ELK and interpretability research, but also the problem of predictability.
Then the next thing I want to suggest is that the system uses human resolution of conflicting outcomes to train itself to predict how a human would resolve a conflict, and if it is higher than a suitable level of confidence, it will go ahead and act without human intervention. But any prediction of what a human would predict could be second-guessed by a human pointing out where the prediction is wrong.
Agreed that whether a human understanding the plan (and all the relevant outcomes. which outcomes are relevant?) is important and harder than I first imagined.
I think this threshold will be tough to set. Confidence in a decision makes IMO only really sense if you consider decisions to be uni-modal. I would argue that this is rarely the case for a sufficiently capable system (like you and me). We are constantly trading off multiple options, and thus, the confidence (e.g., as measured by the log-likelihood of the action given a policy and state) depends on the number of options available. I expect this context dependence would be a tough nut to crack to have a meaningful threshold.