Christopher King comments on Bing finding ways to bypass Microsoft’s filters without being asked. Is it reproducible?

Christopher King 20 Feb 2023 19:04 UTC
1 point
0

I might update if we get more diverse evidence of such behavior; but so far most “Bing is evading filters” explanations assume the LM has a model of itself in reality during test time far more accurate than previously seen; far larger capabilities that what’s needed to explain the Marvin von Hagen screenshots.

My mental model is much simpler. When generating the suggestions, it sees that its message got filtered. Since using side channels is what a human would prefer in this situation and it was trained with RLHF or something, it does so. So it isn’t creating a world model, or even planning ahead. It’s just that it’s utility prefers “use side channels” when it gets to the suggestion phase.

But I don’t actually have access to Bing, so this could very well be a random fluke, instead of being caused by RLHF training. That’s just my model if it’s consistent goal-oriented behavior.