A useful thing would be an example of when a policy approval agent would do something that a human wouldn’t, and what gains in efficiency the policy approval agent has over a normal human acting.
I feel that the formulation “the humans have a utility function” may obscure part of what’s going on. Part of the advantages of approval agents is that they allow humans to express their sometimes incoherent meta-preferences as well (“yeah, I want to do X, but don’t force me to do it”). Assuming the human preferences are already coherent reduces the attraction of the approach.
Ah, I agree that this proposal may have better ways to relax the assumption that the human has a utility function than value-learning does. I wanted to focus on the simpler case here. Perhaps I’ll write a follow-up post considering the generalization.
Maybe I’ll try to insert an example where the policy approval agent does something the human wouldn’t into this post, though.
Here’s a first stab: suppose that the AI has a subroutine which solves complex planning problems. Furthermore, the human trusts the subroutine (does not expect it to be cleverly choosing plans which solve the problems as stated but cause other problems). The human is smart enough to formulate day-to-day management problems which arise at work as formally-specified planning problems, and would like to be told what the answer to those problems are. In this case, the AI will tell the human those answers.
This also illustrates a limited way the policy-approval agent can avoid over-optimizing simplified problem statements: if the human does not trust the planning subroutine (expects it to goodhart or such), then the AI will not use such a subroutine.
(This isn’t maximally satisfactory, since the human may easily be mistaken about what subroutines to trust. I think the AI can do a little better than this, but maybe not in a way which addresses the fundamental issue.)
Iterated distillation and amplification seems like an example of a thing that is like policy approval, and it could do lots of things that a human is unable to, such as becoming really good at chess or Go. (You can imagine removing the distillation steps if those seem too different from policy approval, and the point still applies.)
Hey there!
A useful thing would be an example of when a policy approval agent would do something that a human wouldn’t, and what gains in efficiency the policy approval agent has over a normal human acting.
I feel that the formulation “the humans have a utility function” may obscure part of what’s going on. Part of the advantages of approval agents is that they allow humans to express their sometimes incoherent meta-preferences as well (“yeah, I want to do X, but don’t force me to do it”). Assuming the human preferences are already coherent reduces the attraction of the approach.
Ah, I agree that this proposal may have better ways to relax the assumption that the human has a utility function than value-learning does. I wanted to focus on the simpler case here. Perhaps I’ll write a follow-up post considering the generalization.
Maybe I’ll try to insert an example where the policy approval agent does something the human wouldn’t into this post, though.
Here’s a first stab: suppose that the AI has a subroutine which solves complex planning problems. Furthermore, the human trusts the subroutine (does not expect it to be cleverly choosing plans which solve the problems as stated but cause other problems). The human is smart enough to formulate day-to-day management problems which arise at work as formally-specified planning problems, and would like to be told what the answer to those problems are. In this case, the AI will tell the human those answers.
This also illustrates a limited way the policy-approval agent can avoid over-optimizing simplified problem statements: if the human does not trust the planning subroutine (expects it to goodhart or such), then the AI will not use such a subroutine.
(This isn’t maximally satisfactory, since the human may easily be mistaken about what subroutines to trust. I think the AI can do a little better than this, but maybe not in a way which addresses the fundamental issue.)
Iterated distillation and amplification seems like an example of a thing that is like policy approval, and it could do lots of things that a human is unable to, such as becoming really good at chess or Go. (You can imagine removing the distillation steps if those seem too different from policy approval, and the point still applies.)
I think there are interesting connections between HCH/IDA and policy approval, which I hope to write more about some time.