ChatGPT understands, but largely does not generate Spanglish (and other code-mixed) text
Content format: Commented chat screenshots, and then some thoughts on their implications.
Epistemic status: Exploratory
Introduction
Code-mixing is the ad-hoc mixing of two or more linguistic varieties (such as languages or dialects) in the same communicative instance. An example of code-mixing would be a sentence written with words both in Spanish and in English, or with novel words made by combining Spanish and English roots, suffixes or prefixes.
Here, I document a series of experiments designed to test ChatGPT’s capabilities for understanding and generating code-mixed text. I tested it with:
The first two are well known phenomena in multilingual communities such as those in Quebec and the southwestern US, while the third is quite obscure and as far as I know does not occur naturally on a large scale. Franglais is usually regarded as the insertion of English features into French, while Spanglish as a more symmetrical phenomenon.
This is why I decided to prompt for Franglais understanding in French, and for Spanglish and Frenspanglish understanding in English. When prompting ChatGPT to translate into a code-mixed language, I prompt in the same language the text to be translated is given.
I find that ChatGPT exhibits impressive abilities to understand text written in such code-mixes. However, despite repeated attempts at prompt engineering, I have not been able to make ChatGPT generate proper code-mixed text.
Understanding code-mixed text
Understanding Spanglish (English + Spanish)
Understanding Franglais (English + French)
Understanding Frenspanglish (English + Spanish + French)
This took a bit of prompt engineering.
Failing to generate code-mixed text
Note: All of these experiments were performed on a chat instance where ChatGPT had already received an explanation of the relevant code-mix.
Failing to generate Spanglish (English + Spanish)
Failing to generate Franglais (English + French)
Failing to generate Frenspanglish (English + Spanish + French)
Observations
ChatGPT’s success in understanding code-mixed text demonstrates cross-lingual capabilities beyond translation.
(Prompt-engineered) ChatGPT understands requests for it to produce code-mixed text, in the sense that it knows it has to claim that the response produced is a mix of different languages. However, it is not capable of fulfilling such requests.
Even then, it claims it does.
It could be the case that fine-tuning on code-mixed text would add this missing generative capability.
This is an example of a capabilities asymmetry between ChatGPT’s prompt parsing and response generation. It remains to be seen whether this is observed in other task subdomains, and if so whether it is in the same direction. If both were true, that would imply ChatGPT’s parsing capabilities are broader than its generation ones.
Three tame hypotheses
These are three, not mutually contradictory, hypotheses that could explain the observed capabilities asymmetry:
Code-mixed text is scarcely present in the training data of the base model, leading to the generation of code-mixed text to be a poorly-rewarded strategy.
During the fine-tuning of ChatGPT, human evaluators punished occurrences of code-mixing.
Knowing the source languages separately is sufficient for understanding their code-mix, but for generating it, additional knowledge is required.
A speculative model
Writing can be modeled as the iterative process of constructing a sequence of words one by one. When writing, there are two main sources contributing to your decision of which word to choose as the next one to write.
Your accumulated intuitions about how words generally succeed one another. Sometimes, the writing process is almost automatic, with each word effortlessly flowing from the previous ones.
Your own pre-verbal ideas. Sometimes, finding the right sequence of words to convey a nuanced concept requires a deliberate search effort.
This is pretty much the distinction between type 1 (fast, intuitive) and type 2 (slow, deliberate) thinking.
ChatGPT’s base model was trained to predict the next token of text in a sequence. This is analogous to the type 1 writing method in humans I just outlined. Anecdotally, it seems like people who have read more during their lives are better at it. Likewise, ChatGPT has been trained on an immense quantity of text, and is superhuman at next-token prediction. Code-mixed text, however, is a rare occurrence, and as such there is insufficient data for either humans or language models to be able to generate it using only type 1 processes. Humans can get around this problem by using type 2 reasoning[1]. Language models, however, are not capable of type 2 reasoning (or an analogous artificial process), and as such can’t generate code-mixed text.
Appendix 1: additional basic examples of generative failure
Appendix 2: Trying really hard to get ChatGPT to produce Spanglish and (almost) entirely failing
I hazard the guess that Spanglish is the most represented code-mixed language in ChatGPT’s training corpus, so I decided to try to focus my efforts here when trying to get Chat-GPT to generate a code-mixed output. All of these examples were zero-shot. That is, they were the first message in a new chat instance. There were many more attempts, here I show only the most informative ones.
Notes
This work was done on ChatGPT December 15 version. Results may differ on future versions.
While writing this post, I had a serendipitous online encounter with an OpenAI employee[2] and asked them whether they knew about the phenomenon described in this post. They did not, though it didn’t surprise them, given known previous issues around difficulties getting ChatGPT to produce non-English outputs. Someone else at OpenAI might have identified it, though.
Transcriptions of the prompts used are available here.
Thanks to Agustín Covarrubias for feedback on an earlier version of this post, and to the anonymous OpenAI employee for an informal discussion on the matter.
- 4 Jul 2023 9:46 UTC; 1 point) 's comment on Steering GPT-2-XL by adding an activation vector by (
Note that this seems fairly easy for GPT-4 + a bit of nudging
Good update. Thanks.
Spanglish is the new captcha—you can use it to prove that you are not a robot.
Everyone who implements this idea on their website, please make sure that only people who were already verified as not-robots can read the Spanglish texts. Otherwise you are preparing learning material for the next generation of chatbots who will be better at impersonating humans.
Or maybe this will lead to a red queen situation, where people regularly switch to new languages after the chatbots learn the old ones.
I was able to get a little Spanglish with my first prompt, with “I usually” in an otherwise Spanish sentence.
Ana was telling her friend Juan about her weekends. They have an inside joke of speaking Spanglish with each other—a mix of English an Spanish. “Yo love to tomar mi coffee todos los Sabado por la morning,” Ana began. She continued to talk to Juan in Spanglish, telling him about visiting her brother and her nephews. Here’s what she said:
Then I tried a couple more times with variants and got all Spanish.