(BTW Paul, if you’re reading this, Issa and I and a few others have been chatting about this on MIRIxDiscord. I’m sure you’re more than welcome to join if you’re interested, but I figured you probably don’t have time for it. PM me if you do want an invite.)
Issa, I think my current understanding of what Paul means is roughly the same as yours, and I also share your confusion about “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy”.
To summarize my own understanding (quoting myself from the Discord), what Paul means by “satisfying short-term preferences-on-reflection” seems to cash out as “do the action for which the AI can produce an explanation such that a hypothetical human would evaluate it as good (possibly using other AI assistants), with the evaluation procedure itself being the result of a hypothetical deliberation which is controlled by the preferences-for-deliberation that the AI learned/inferred from a real human.”
(I still have other confusions around this. For example is the “hypothetical human” here (the human being predicted in Issa’s 3) a hypothetical end user evaluating the action based on what they themselves want, or is it a hypothetical overseer evaluating the action based on what the overseer thinks the end user wants? Or is the “hypothetical human” just a metaphor for some abstract, distributed, or not recognizably-human deliberative/evaluative process at this point?)
Thanks to Wei Dai for bringing up this distinction, although I may have misunderstood the actual distinction.
I think maybe it would make sense to further break (6) down into 2 sub-dimensions: (6a) understandable vs evaluable and (6b) how much AI assistance. “Understandable” means the human achieves an understanding of the (outer/main) AI’s rationale for action within their own brain, with or without (other) AI assistance (which can for example answer questions for the human or give video lectures, etc.). And “evaluable” means the human runs or participates in a procedure that returns a score for how good the action is, but doesn’t necessarily achieve a holistic understanding of the rationale in their own brain. (If the external procedure involves other real or hypothetical humans, then it gets fuzzy but basically I want to rule out Chinese Room scenarios as “understandable”.) Based on https://ai-alignment.com/concrete-approval-directed-agents-89e247df7f1b I’m guessing Paul has “evaluable” and “with AI assistance” in mind here. (In other words I agree with what you mean by “long in sense (6)”.)
(BTW Paul, if you’re reading this, Issa and I and a few others have been chatting about this on MIRIxDiscord. I’m sure you’re more than welcome to join if you’re interested, but I figured you probably don’t have time for it. PM me if you do want an invite.)
Issa, I think my current understanding of what Paul means is roughly the same as yours, and I also share your confusion about “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy”.
To summarize my own understanding (quoting myself from the Discord), what Paul means by “satisfying short-term preferences-on-reflection” seems to cash out as “do the action for which the AI can produce an explanation such that a hypothetical human would evaluate it as good (possibly using other AI assistants), with the evaluation procedure itself being the result of a hypothetical deliberation which is controlled by the preferences-for-deliberation that the AI learned/inferred from a real human.”
(I still have other confusions around this. For example is the “hypothetical human” here (the human being predicted in Issa’s 3) a hypothetical end user evaluating the action based on what they themselves want, or is it a hypothetical overseer evaluating the action based on what the overseer thinks the end user wants? Or is the “hypothetical human” just a metaphor for some abstract, distributed, or not recognizably-human deliberative/evaluative process at this point?)
I think maybe it would make sense to further break (6) down into 2 sub-dimensions: (6a) understandable vs evaluable and (6b) how much AI assistance. “Understandable” means the human achieves an understanding of the (outer/main) AI’s rationale for action within their own brain, with or without (other) AI assistance (which can for example answer questions for the human or give video lectures, etc.). And “evaluable” means the human runs or participates in a procedure that returns a score for how good the action is, but doesn’t necessarily achieve a holistic understanding of the rationale in their own brain. (If the external procedure involves other real or hypothetical humans, then it gets fuzzy but basically I want to rule out Chinese Room scenarios as “understandable”.) Based on https://ai-alignment.com/concrete-approval-directed-agents-89e247df7f1b I’m guessing Paul has “evaluable” and “with AI assistance” in mind here. (In other words I agree with what you mean by “long in sense (6)”.)