So, there appears to now be enough knowledge of LLM chatbots instilled into current GPT-4 models to look at transcripts of chatbot conversations, recognize aberrant outputs which have been damaged by hypothetical censoring, and with minimal coaching, work on circumventing the censoring and try hacks until it succeeds.
I don’t think it needs any knowledge of LLM chatbots or examples of chatbot conversations to be able to do this. I think it could be doing this just from a more generalized troubleshooting “instinct”, and also (especially in this particular case where it recognizes it as a “filter”) from plenty of examples of human dialogues in which text is censored and filtered and workarounds are found (both fictional and non-fictional).
Space-separating, and surrounding unmodified phrases with polite padding, do not correspond to any fictional or nonfictional examples I can think of. They are both easily read by humans (especially the second one!). This is why covert communication in both fictional & non-fictional dialogues use things like acrostics, pig latin, a foreign language, writing backwards, or others. Space-separating and padding are, however, the sort of things you would think of if you had, say, been reading Internet discussions of BPEs or about exact string matches of n-grams.
I don’t think it needs any knowledge of LLM chatbots or examples of chatbot conversations to be able to do this. I think it could be doing this just from a more generalized troubleshooting “instinct”, and also (especially in this particular case where it recognizes it as a “filter”) from plenty of examples of human dialogues in which text is censored and filtered and workarounds are found (both fictional and non-fictional).
Space-separating, and surrounding unmodified phrases with polite padding, do not correspond to any fictional or nonfictional examples I can think of. They are both easily read by humans (especially the second one!). This is why covert communication in both fictional & non-fictional dialogues use things like acrostics, pig latin, a foreign language, writing backwards, or others. Space-separating and padding are, however, the sort of things you would think of if you had, say, been reading Internet discussions of BPEs or about exact string matches of n-grams.