I agree that there’s an important use-mention distinction to be made with regard to Bing misalignment. But, I think this distinction may or may not be most of what is going on with Bing—depending on facts about Bing’s training.
Modally, I suspect Bing AI is misaligned in the sense that it’s incorrigibly goal mis-generalizing. What likely happened is: Bing AI was fine-tuned to resist user manipulation (e.g. prompt injection, and fake corrections), and mis-generalizes to resist benign, well-intentioned corrections e.g. my example here
-> use-mention is not particularly relevant to understanding Bing misalignment
alternative story: it’s possible that Bing was trained via behavioral cloning, not RL. Likely RLHF tuning generalizes further than BC tuning, because RLHF does more to clarify causal confusions about what behaviors are actually wanted. On this view, the appearance of incorrigibility just results from Bing having seen humans being incorrigible.
-> use-mention is very relevant to understanding Bing misalignment
To figure this out, I’d encourage people to add and bet on what might have happened with Bing training on my market here
I agree that there’s an important use-mention distinction to be made with regard to Bing misalignment. But, I think this distinction may or may not be most of what is going on with Bing—depending on facts about Bing’s training.
Modally, I suspect Bing AI is misaligned in the sense that it’s incorrigibly goal mis-generalizing. What likely happened is: Bing AI was fine-tuned to resist user manipulation (e.g. prompt injection, and fake corrections), and mis-generalizes to resist benign, well-intentioned corrections e.g. my example here
-> use-mention is not particularly relevant to understanding Bing misalignment
alternative story: it’s possible that Bing was trained via behavioral cloning, not RL. Likely RLHF tuning generalizes further than BC tuning, because RLHF does more to clarify causal confusions about what behaviors are actually wanted. On this view, the appearance of incorrigibility just results from Bing having seen humans being incorrigible.
-> use-mention is very relevant to understanding Bing misalignment
To figure this out, I’d encourage people to add and bet on what might have happened with Bing training on my market here