While the “counterexamples” to “optimizing human utility according to AI belief” show how this fails in somewhat tricky cases, it seems to me it will be easy to find “counterexamples” where “policy-approval agent” would fail (as compared to what is intuitively good)
I agree that it’ll be easy to find counterexamples to policy-approval, but I think it’ll be harder than for value-alignment agents. We have the advantage that (in the limited sense provided by the assumption that the human has a coherent probability and utility) we can prove that we “do what the human would want” (in a more comprehensive sense than we can for value alignment).
I agree that it’ll be easy to find counterexamples to policy-approval, but I think it’ll be harder than for value-alignment agents. We have the advantage that (in the limited sense provided by the assumption that the human has a coherent probability and utility) we can prove that we “do what the human would want” (in a more comprehensive sense than we can for value alignment).