Ah, I agree that this proposal may have better ways to relax the assumption that the human has a utility function than value-learning does. I wanted to focus on the simpler case here. Perhaps I’ll write a follow-up post considering the generalization.
Maybe I’ll try to insert an example where the policy approval agent does something the human wouldn’t into this post, though.
Here’s a first stab: suppose that the AI has a subroutine which solves complex planning problems. Furthermore, the human trusts the subroutine (does not expect it to be cleverly choosing plans which solve the problems as stated but cause other problems). The human is smart enough to formulate day-to-day management problems which arise at work as formally-specified planning problems, and would like to be told what the answer to those problems are. In this case, the AI will tell the human those answers.
This also illustrates a limited way the policy-approval agent can avoid over-optimizing simplified problem statements: if the human does not trust the planning subroutine (expects it to goodhart or such), then the AI will not use such a subroutine.
(This isn’t maximally satisfactory, since the human may easily be mistaken about what subroutines to trust. I think the AI can do a little better than this, but maybe not in a way which addresses the fundamental issue.)
Ah, I agree that this proposal may have better ways to relax the assumption that the human has a utility function than value-learning does. I wanted to focus on the simpler case here. Perhaps I’ll write a follow-up post considering the generalization.
Maybe I’ll try to insert an example where the policy approval agent does something the human wouldn’t into this post, though.
Here’s a first stab: suppose that the AI has a subroutine which solves complex planning problems. Furthermore, the human trusts the subroutine (does not expect it to be cleverly choosing plans which solve the problems as stated but cause other problems). The human is smart enough to formulate day-to-day management problems which arise at work as formally-specified planning problems, and would like to be told what the answer to those problems are. In this case, the AI will tell the human those answers.
This also illustrates a limited way the policy-approval agent can avoid over-optimizing simplified problem statements: if the human does not trust the planning subroutine (expects it to goodhart or such), then the AI will not use such a subroutine.
(This isn’t maximally satisfactory, since the human may easily be mistaken about what subroutines to trust. I think the AI can do a little better than this, but maybe not in a way which addresses the fundamental issue.)