My message was really about Rohin’s phrasing, since I usually don’t read the papers in details if I think the summary is good enough.
Reading the section now, I’m fine with it. There are a few intentional stance words, but the scare quotes and the straightforwardness of cashing out “is capable” into “there is a prompt to make it do what we want” and “chooses” into “what it actually returns for our prompt” makes it quite unambiguous.
I also like this paragraph in the appendix:
However, there is an intuitive notion that, given its training objective, Codex is better described as “trying” to continue the prompt by either matching or generalizing the training distribution, than as “trying” to be helpful to the user.
Rohin also changed my mind on my criticism of calling that misalignment; I now agree that misalignment is the right term.
One thought I just had: this looks more like a form of proxy alignment to what we really want, which is not ideal but significantly better than deceptive alignment. Maybe autoregressive language models point to a way of paying a cost of proxy alignment to avoid deceptive alignment?
My message was really about Rohin’s phrasing, since I usually don’t read the papers in details if I think the summary is good enough.
Reading the section now, I’m fine with it. There are a few intentional stance words, but the scare quotes and the straightforwardness of cashing out “is capable” into “there is a prompt to make it do what we want” and “chooses” into “what it actually returns for our prompt” makes it quite unambiguous.
I also like this paragraph in the appendix:
Rohin also changed my mind on my criticism of calling that misalignment; I now agree that misalignment is the right term.
One thought I just had: this looks more like a form of proxy alignment to what we really want, which is not ideal but significantly better than deceptive alignment. Maybe autoregressive language models point to a way of paying a cost of proxy alignment to avoid deceptive alignment?