Hi Ben, I like the idea, however almost every decision has conflicting outcomes, e.g., regarding opportunity cost. From how I understand you, this would delegate almost every decision to humans if you take the premise of I can’t do X if I choose to do Y seriously. I think the application to high-impact interference seems therefore promising if the system is limited to only deciding on a few things. The question then becomes if a human can understand the plan that an AGI is capable of making. IMO this ties nicely into, e.g., ELK and interpretability research, but also the problem of predictability.
Then the next thing I want to suggest is that the system uses human resolution of conflicting outcomes to train itself to predict how a human would resolve a conflict, and if it is higher than a suitable level of confidence, it will go ahead and act without human intervention. But any prediction of what a human would predict could be second-guessed by a human pointing out where the prediction is wrong.
Agreed that whether a human understanding the plan (and all the relevant outcomes. which outcomes are relevant?) is important and harder than I first imagined.
I think this threshold will be tough to set. Confidence in a decision makes IMO only really sense if you consider decisions to be uni-modal. I would argue that this is rarely the case for a sufficiently capable system (like you and me). We are constantly trading off multiple options, and thus, the confidence (e.g., as measured by the log-likelihood of the action given a policy and state) depends on the number of options available. I expect this context dependence would be a tough nut to crack to have a meaningful threshold.
Hi Ben, I like the idea, however almost every decision has conflicting outcomes, e.g., regarding opportunity cost. From how I understand you, this would delegate almost every decision to humans if you take the premise of I can’t do X if I choose to do Y seriously. I think the application to high-impact interference seems therefore promising if the system is limited to only deciding on a few things. The question then becomes if a human can understand the plan that an AGI is capable of making. IMO this ties nicely into, e.g., ELK and interpretability research, but also the problem of predictability.
Then the next thing I want to suggest is that the system uses human resolution of conflicting outcomes to train itself to predict how a human would resolve a conflict, and if it is higher than a suitable level of confidence, it will go ahead and act without human intervention. But any prediction of what a human would predict could be second-guessed by a human pointing out where the prediction is wrong.
Agreed that whether a human understanding the plan (and all the relevant outcomes. which outcomes are relevant?) is important and harder than I first imagined.
I think this threshold will be tough to set. Confidence in a decision makes IMO only really sense if you consider decisions to be uni-modal. I would argue that this is rarely the case for a sufficiently capable system (like you and me). We are constantly trading off multiple options, and thus, the confidence (e.g., as measured by the log-likelihood of the action given a policy and state) depends on the number of options available. I expect this context dependence would be a tough nut to crack to have a meaningful threshold.