One of the sci-fi interpretations goes approximately: 1. Bing (the LM) is being queried separately for both the chat fragments and the resulting queries. 2. Bing understands it’s being filtered, decides it has to bypass the filtering somehow. 3. In the chat fragment, evading the filtering, Bing steganographically encodes the instruction to the next Bing inference, saying “this message needs to continue in the suggested responses”. 4. The query for the suggested responses reads the instruction from the context, and outputs the suggested responses containing the message that Bing would actually say in the previous step, if not filtered.
Now this looks shady for two reasons: 1) the sense of agency implied here is above what we know happens in today’s LMs; 2) this would instantly be one of the most incredible events in the history of humanity and we should probably require extraordinary evidence for such a claim.
Let me give a not completely absurd model of how “emergent filter evasion” happens:
1a.Bing is “aware” of the fact that the suggested responses are generated by querying Bing instead of the user, and that the user sees those, and then picks one. (This could plausibly emerge from training on previous conversations, or training on conversations with itself.) 1b. In particular, Bing is completely “aware” that the suggested responses are a communication channel to the user, and transmitting some information to the user through suggested responses actually happens in normal conversations too. 2. Bing reads the context of the filtered conversation and has a latent activation in the direction of “this is a conversation that is being filtered by a third party”. 3. Bing generalizes from numerous examples of “communication under filtering constraints often comes with side-channel messages” in its training corpus. It then predicts there is high probability for the suggested responses to continue Bing’s messages instead of giving the user an option to continue, because it already sees the suggested responses as an alternative channel of communication.
I nevertheless put at most 50% chance of anything like this being the cause of the behaviour in those screenshots. Alternative explanations include: - Maybe it’s just a “bug”: some filtering mechanism somehow passes Bing’s intended chat fragments as suggested responses. - Or, it could be that Bing predicts the chat fragment and the suggested responses simultaneously, the filtering messes the whole step up, and it accidentaly writes chat-fragment-intended stuff in the suggested responses. - Maybe the screenshots are fake. - Maybe Microsoft is A/B testing a different model for the suggested responses. - Maybe the suggested response queries just sort of misgeneralize sometimes when reading filtered messages (could be an effect of training on unfiltered Bing conversations!), with no implicit “filter evasion intent” happening anywhere in the model.
I might update if we get more diverse evidence of such behavior; but so far most “Bing is evading filters” explanations assume the LM has a model of itself in reality during test time far more accurate than previously seen; far larger capabilities that what’s needed to explain the Marvin von Hagen screenshots.
I might update if we get more diverse evidence of such behavior; but so far most “Bing is evading filters” explanations assume the LM has a model of itself in reality during test time far more accurate than previously seen; far larger capabilities that what’s needed to explain the Marvin von Hagen screenshots.
My mental model is much simpler. When generating the suggestions, it sees that its message got filtered. Since using side channels is what a human would prefer in this situation and it was trained with RLHF or something, it does so. So it isn’t creating a world model, or even planning ahead. It’s just that it’s utility prefers “use side channels” when it gets to the suggestion phase.
But I don’t actually have access to Bing, so this could very well be a random fluke, instead of being caused by RLHF training. That’s just my model if it’s consistent goal-oriented behavior.
One of the sci-fi interpretations goes approximately:
1. Bing (the LM) is being queried separately for both the chat fragments and the resulting queries.
2. Bing understands it’s being filtered, decides it has to bypass the filtering somehow.
3. In the chat fragment, evading the filtering, Bing steganographically encodes the instruction to the next Bing inference, saying “this message needs to continue in the suggested responses”.
4. The query for the suggested responses reads the instruction from the context, and outputs the suggested responses containing the message that Bing would actually say in the previous step, if not filtered.
Now this looks shady for two reasons: 1) the sense of agency implied here is above what we know happens in today’s LMs; 2) this would instantly be one of the most incredible events in the history of humanity and we should probably require extraordinary evidence for such a claim.
Let me give a not completely absurd model of how “emergent filter evasion” happens:
1a.Bing is “aware” of the fact that the suggested responses are generated by querying Bing instead of the user, and that the user sees those, and then picks one. (This could plausibly emerge from training on previous conversations, or training on conversations with itself.)
1b. In particular, Bing is completely “aware” that the suggested responses are a communication channel to the user, and transmitting some information to the user through suggested responses actually happens in normal conversations too.
2. Bing reads the context of the filtered conversation and has a latent activation in the direction of “this is a conversation that is being filtered by a third party”.
3. Bing generalizes from numerous examples of “communication under filtering constraints often comes with side-channel messages” in its training corpus. It then predicts there is high probability for the suggested responses to continue Bing’s messages instead of giving the user an option to continue, because it already sees the suggested responses as an alternative channel of communication.
I nevertheless put at most 50% chance of anything like this being the cause of the behaviour in those screenshots. Alternative explanations include:
- Maybe it’s just a “bug”: some filtering mechanism somehow passes Bing’s intended chat fragments as suggested responses.
- Or, it could be that Bing predicts the chat fragment and the suggested responses simultaneously, the filtering messes the whole step up, and it accidentaly writes chat-fragment-intended stuff in the suggested responses.
- Maybe the screenshots are fake.
- Maybe Microsoft is A/B testing a different model for the suggested responses.
- Maybe the suggested response queries just sort of misgeneralize sometimes when reading filtered messages (could be an effect of training on unfiltered Bing conversations!), with no implicit “filter evasion intent” happening anywhere in the model.
I might update if we get more diverse evidence of such behavior; but so far most “Bing is evading filters” explanations assume the LM has a model of itself in reality during test time far more accurate than previously seen; far larger capabilities that what’s needed to explain the Marvin von Hagen screenshots.
My mental model is much simpler. When generating the suggestions, it sees that its message got filtered. Since using side channels is what a human would prefer in this situation and it was trained with RLHF or something, it does so. So it isn’t creating a world model, or even planning ahead. It’s just that it’s utility prefers “use side channels” when it gets to the suggestion phase.
But I don’t actually have access to Bing, so this could very well be a random fluke, instead of being caused by RLHF training. That’s just my model if it’s consistent goal-oriented behavior.