I was really surprised that the “background problem” is almost the same problem as in value learning in some formulations of bounded rationality. In information-theoretic bounded rationality formalism, the bounded agent acts based on combination of prior (representing previous knowledge) and utilities (what the agent wants). (It seems in some cases of updating humans, it is possible to disentangle the two.)
While the “counterexamples” to “optimizing human utility according to AI belief” show how this fails in somewhat tricky cases, it seems to me it will be easy to find “counterexamples” where “policy-approval agent” would fail (as compared to what is intuitively good)
From an “engineering perspective”, if I was forced to choose something right now, it would be an AI “optimizing human utility according to AI beliefs” but asking for clarification when such choice diverges too much from the “policy-approval”.
While the “counterexamples” to “optimizing human utility according to AI belief” show how this fails in somewhat tricky cases, it seems to me it will be easy to find “counterexamples” where “policy-approval agent” would fail (as compared to what is intuitively good)
I agree that it’ll be easy to find counterexamples to policy-approval, but I think it’ll be harder than for value-alignment agents. We have the advantage that (in the limited sense provided by the assumption that the human has a coherent probability and utility) we can prove that we “do what the human would want” (in a more comprehensive sense than we can for value alignment).
I was really surprised that the “background problem” is almost the same problem as in value learning in some formulations of bounded rationality. In information-theoretic bounded rationality formalism, the bounded agent acts based on combination of prior (representing previous knowledge) and utilities (what the agent wants). (It seems in some cases of updating humans, it is possible to disentangle the two.)
While the “counterexamples” to “optimizing human utility according to AI belief” show how this fails in somewhat tricky cases, it seems to me it will be easy to find “counterexamples” where “policy-approval agent” would fail (as compared to what is intuitively good)
From an “engineering perspective”, if I was forced to choose something right now, it would be an AI “optimizing human utility according to AI beliefs” but asking for clarification when such choice diverges too much from the “policy-approval”.
I agree that it’ll be easy to find counterexamples to policy-approval, but I think it’ll be harder than for value-alignment agents. We have the advantage that (in the limited sense provided by the assumption that the human has a coherent probability and utility) we can prove that we “do what the human would want” (in a more comprehensive sense than we can for value alignment).