I agree these are legitimate concerns… these are the kind of “deep” arguments I find more persuasive.
In that thread, johnswentworth wrote:
In particular, even if we have a reward signal which is “close” to incentivizing alignment in some sense, the actual-process-which-generates-the-reward-signal is likely to be at least as simple/natural as actual alignment.
I’d solve this by maintaining uncertainty about the “reward signal”, so the AI tries to find a plan which looks good under both alignment and the actual-process-which-generates-the-reward-signal. (It doesn’t know which is which, but it tries to learn a sufficiently diverse set of reward signals such that alignment is in there somewhere. I don’t think we can do any better than this, because the entire point is that there is no way to disambiguate between alignment and the actual-process-which-generates-the-reward-signal by gathering more data. Well, I guess maybe you could do it with interpretability or the right set of priors, but I would hesitate to make those load-bearing.)
(BTW, potentially interesting point I just thought of. I’m gonna refer to actual-process-which-generates-the-reward-signal as “approval”. Supposing for a second that it’s possible to disambiguate between alignment and approval somehow, and we successfully aim at alignment and ignore approval. Then we’ve got an AI which might deliberately do aligned things we disapprove of. I think this is not ideal, because from the outside this behavior is also consistent with an AI which has learned approval incorrectly. So we’d want to flip the off switch for the sake of caution. Therefore, as a practical matter, I’d say that you should aim to satisfy both alignment and approval anyways. I suppose you could argue that on the basis of the argument I just gave, satisfying approval is therefore part of alignment and thus this is an unneeded measure, but overall the point is that aiming to satisfy both alignment and approval seems to have pretty low costs.)
(I suppose technically you can disambiguate between alignment and approval if there are unaligned things that humans would approve of—I figure you solve this problem by making your learning algorithm robust against mislabeled data.)
Anyway, you could use a similar approach for the nice plans problem, or you could formalize a notion of “manipulation” which is something like: conditional on the operator viewing this plan, does their predicted favorability towards subsequent plans change on expectation?
Edit: Another thought is that the delta between “approval” and “alignment” seems like the delta between me and my CEV. So to get from “approval” to “alignment”, you could ask your AI to locate the actual-process-which-generates-the-labels, and then ask it about how those labels would be different if we “knew more, thought faster, were more the people we wished we were” etc. (I’m also unclear why you couldn’t ask a hyper-advanced language model what some respected moral philosophers would think if they were able to spend decades contemplating your question or whatever.)
Another edit: You could also just manually filter through all the icky plans until you find one which is non-icky.
(Very interested in hearing objections to all of these ideas.)
I agree these are legitimate concerns… these are the kind of “deep” arguments I find more persuasive.
In that thread, johnswentworth wrote:
I’d solve this by maintaining uncertainty about the “reward signal”, so the AI tries to find a plan which looks good under both alignment and the actual-process-which-generates-the-reward-signal. (It doesn’t know which is which, but it tries to learn a sufficiently diverse set of reward signals such that alignment is in there somewhere. I don’t think we can do any better than this, because the entire point is that there is no way to disambiguate between alignment and the actual-process-which-generates-the-reward-signal by gathering more data. Well, I guess maybe you could do it with interpretability or the right set of priors, but I would hesitate to make those load-bearing.)
(BTW, potentially interesting point I just thought of. I’m gonna refer to actual-process-which-generates-the-reward-signal as “approval”. Supposing for a second that it’s possible to disambiguate between alignment and approval somehow, and we successfully aim at alignment and ignore approval. Then we’ve got an AI which might deliberately do aligned things we disapprove of. I think this is not ideal, because from the outside this behavior is also consistent with an AI which has learned approval incorrectly. So we’d want to flip the off switch for the sake of caution. Therefore, as a practical matter, I’d say that you should aim to satisfy both alignment and approval anyways. I suppose you could argue that on the basis of the argument I just gave, satisfying approval is therefore part of alignment and thus this is an unneeded measure, but overall the point is that aiming to satisfy both alignment and approval seems to have pretty low costs.)
(I suppose technically you can disambiguate between alignment and approval if there are unaligned things that humans would approve of—I figure you solve this problem by making your learning algorithm robust against mislabeled data.)
Anyway, you could use a similar approach for the nice plans problem, or you could formalize a notion of “manipulation” which is something like: conditional on the operator viewing this plan, does their predicted favorability towards subsequent plans change on expectation?
Edit: Another thought is that the delta between “approval” and “alignment” seems like the delta between me and my CEV. So to get from “approval” to “alignment”, you could ask your AI to locate the actual-process-which-generates-the-labels, and then ask it about how those labels would be different if we “knew more, thought faster, were more the people we wished we were” etc. (I’m also unclear why you couldn’t ask a hyper-advanced language model what some respected moral philosophers would think if they were able to spend decades contemplating your question or whatever.)
Another edit: You could also just manually filter through all the icky plans until you find one which is non-icky.
(Very interested in hearing objections to all of these ideas.)