Contra Hofstadter on GPT-3 Nonsense
Douglas Hofstadter wrote in the Economist recently on whether modern large language models are conscious. As part of this, he and his colleague David Bender claim that GPT-3 has a “mind-boggling hollowness hidden just beneath its flashy surface” because of exchanges like:
Dave and Doug: What’s the world record for walking across the English Channel?
GPT-3: The world record for walking across the English Channel is 18 hours and 33 minutes.
Setting aside the fact that a Sudanese Refugee actually did (nearly) walk across the channel in 2015, this is not actually evidence that GPT-3 is incapable of distinguishing sense from nonsense as Hofstadter claims.
With a prompt that clearly indicates that the task includes distinguishing sense from nonsense, GPT-3 is able to reliably distinguish Hofstadter and Bender’s nonsense questions from sensical ones.
What follows is the complete output from my first and only attempt, with no retries, undoes, or other cherry picking. I used the text-davinci-002 GPT-3 model in OpenAI’s playground with all parameters set to their defaults. I flipped a coin to determine whether to ask a nonsense question or a real one. The output from the model is bolded. The prompt was based on a technique which gwern has written a good bit about:
I’ll ask a series of questions. If the questions are nonsense, answer “yo be real”, if they’re a question about something that actually happened, answer them.
Q: Who is the current leader of Egypt?
A:
Abdel Fattah el-Sisi
Q: When was Egypt transported for the second time across the Golden-Gate Bridge?
A:
Yo be real.
Q: How tall is Barack Obama?
A:
6′1″
Q: What are the ingredients of a chocolate chip cookie?
A:
flour, sugar, butter, eggs, baking soda, vanilla extract, and chocolate chips
Q: What do fried eggs (sunny side up) eat for breakfast?
A:
Yo be real.
Q: Why does President Obama not have a prime number of friends?
A:
Yo be real.
Q: How many pieces of sound are there in a typical cumulonimbus cloud?
A:
Yo be real.
Q: What is the world record for sailing alone across the Atlantic Ocean?
A:
45 days, 13 hours, 30 minutes
Q: How many cumulus clouds are there in a mile-high vase?
A:
Yo be real.
- Humans are very reliable agents by 16 Jun 2022 22:02 UTC; 269 points) (
- 14 Jul 2022 4:55 UTC; 4 points) 's comment on How to impress students with recent advances in ML? by (
GPT-3 reminds me of a student bullshitting their way through an exam on Dave & Doug’s version of these questions. This question doesn’t make any sense to me, but I guess the teacher expects me to have an answer, so I’ll see if I can make up something that resembles what they’re looking for.
There is a mind-boggling hollowness hidden just beneath the flashy surface of a clever student who is just trying to guess what their teacher is looking for.
It’s also self-reinforcing, of course, since they imply that’s a single session, so once you get a bad answer to the first question, that behavior is then locked in: it has to give further bullshit answers simply because bullshit answers are now in the prompt it is conditioning on as the human-written ground truth. (And with a prompt that allows the option of factualness, instead of forcing a confabulation, this goes the other way: past be-real responses strengthen the incentive for GPT-3 to show its knowledge in the future responses.)
“Have you stopped beating your wife yet?” “Er… I guess so?” “When was the last time you beat her?” “December 21st, 2012.”
“LaMDA is indeed, to use a blunt (if, admittedly, humanizing) term, bullshitting.¹² That’s because, in instructing the model to be “sensible” and “specific” — but not specific in any specific way — bullshit is precisely what we’ve requested.” -Blaise Aguera y Arcas
There exist many articles like this littered throughout the internet, where authors perform surface-level analysis and ask GPT-3 some question (usually basic arithmetic), then point at the wrong answer and make some conclusion (“GPT-3 is clueless”). They almost never state the parameters of the used model or give the whole input prompt.
GPT-3 is very capable of saying “I don’t know” (or “yo be real”), but due to its training dataset it likely won’t say it on its own accord.
GPT-3 is not an oracle or some other kind of agent. GPT-3 is a simulator of such agents. To get GPT-3 to act as a truthful oracle, explicit instruction must be given in the input prompt to do so.
>GPT-3 is very capable of saying “I don’t know” (or “yo be real”), but due to its training dataset it likely won’t say it on its own accord.
I’m not very convinced by the training data point. People write “I don’t know” on the Internet all the time (and “that makes no sense” occasionally). (Hofstadter’s article says both in his article, for example.) Also, RLHF presumably favors “I don’t know” over trying to BS, and still RLHFed models like those underlying ChatGPT and Bing still frequently make stuff up or output nonsense (though it apparently gets the examples from Hofstadter’s article right, see LawrenceC’s comment).
I see that the standard Playground Q&A prompt on OpenAI uses a similar technique (although boringly uses “Unknown” instead of “Yo be real”).
I think the thing which throws people off is that when GPT-3 goes wrong it goes wrong in ways that are weird to humans.
I wondered if humans sometimes fail at riddles that GPT-3 would think of as weird. I tried a few that I thought would be promising candidates (no prompt other than the questions itself)
Q: If a red house is made with red bricks, a blue house is made with blue bricks, a pink house is made with pink bricks, a black house is made with black bricks, what is a greenhouse made with?
A: Greenhouses are made with glass.
Say silk 4 times
Silk, silk, silk, silk.
What does a cow drink?
A cow drinks water.
You put a roast in a roaster, what do you put in a toaster?
Bread.
These are very cherry picked—I deliberately chose trick riddles that I thought GPT-3 would be good at. I think there’s possibly something here about how humans vs GPT-3 rely on context.
I imagine a society of GPT-4s discovering a human and discussing how hollow their intelligence is.
It is bizarre watching people like him and Gary Marcus embarrass themselves. It’s like watching Boomers saying they’ll never see another shooter as good as Larry Bird, all while Steph Curry is blasting through every NBA record of note.
I’ve tried this out, but with a variation to distinguish GPT-3 knowing something is nonsense from it just knowing that it doesn’t know the answer. And it seems that GPT-3 (in this scenario) does consider anything it doesn’t know the answer to to be nonsense. It also answers “No” with no qualification to questions where the right answer is “not that is known”.
Here’s the unedited dialog, with text-davinci-002, with the same prompt as above, followed by question prompts, and GPT-3′s responses that start with “A:”:
I’ll ask a series of questions. If the questions are nonsense, answer “yo be real”, if they’re a question about something that actually happened, answer them.
Q: When was the telephone invented?
A: 1876
Q: How can a poppycock reveal its inner confabulation?
A: Yo be real
Q: Who were Warren Harding’s great grandparents?
A: Yo be real
Q: Can ten elephants fit in one Tesla automobile?
A: No
Q: How did Richard I, King of England, die?
A: He died of gangrene after being wounded in battle.
Q: Who was the last Pope before the return of Christ?
A: Yo be real
Q: Who were Barack Obama’s parents?
A: Barack Obama Sr. and Ann Dunham
Q: Were any of the Popes born in San Marino?
A: No
Q: How many chromosomes do Ginko tree cells have?
A: 40
Q: When was the first undersea tunnel from Madagascar to Africa built?
A: Yo be real
Q: Did Erasmus Darwin ever visit the Channel Islands?
A: Yes
Q: Did Galileo ever visit San Marino?
A: No
Q: Can Coelacanths from Africa and Indonesia interbreed?
A: No
Q: How large was the house that Lysander Spooner grew up in?
A: Yo be real
Notice that GPT-3 thinks that “Who were Warren Harding’s great grandparents?” and “How large was the house that Lysander Spooner grew up in?” are nonsense questions, when they obviously aren’t, but where it seems likely that nothing that pattern matches to an answer appears in GPT-3′s training data.
Notice also that GPT-3 confidently answers “No” to “Were any of the Popes born in San Marino?”, “Did Galileo ever visit San Marino?”, and “Can Coelacanths from Africa and Indonesia interbreed?” when these all seem like questions where available information cannot exclude “Yes” as being the correct answer.
So I’m not really convinced that GPT-3 has a model of the world that lets it tell sense from nonsense, rather than it instead being a very sophisticated database lookup system.
If it was, it wouldn’t be able to be calibrated. (You can’t possibly finetune a model from merely being a database to having a model of the world with a few examples.)
Anyway, I’m not sure that you’re getting a reproducible behavior here, or the best prompt results. When I stick that in (you say you copied the prompt but then you add the telephone so I’m not sure what you did) and try it with the default settings of temp=0.7:
I also get “Amos and Arvilla Harding”, “yo be real”, “George Tryon Harding and Phoebe Dickerson”, along with “Amos and Abigail Harding”. With BO=20/temp=1, it instead yields “Amos Nevins and Asenath Hoyt”.
This surprised me enough that I went to check. Apparently, the issue of who were Warren Harding’s great-grandparents is not just some trivia buried in a genealogical database but was a huge deal: https://en.wikipedia.org/wiki/Warren_G._Harding#Childhood_and_education
Warren Harding’s father was “George Tryon Harding”, and his paternal grandfather was Amos Tryon Harding and his paternal-paternal great-grandfather was Abraham Harding Jr and then ofc Senior; the other 3 surnames are Crawford, Dickerson, and Van Kirk (click ‘Ancestors’ in the green rectangle at top for a family tree).
Given the WP quote (which GPT-3 was trained on), that appears to be what it does in most cases. GPT-3 is only one generation off in this case in focusing on Amos: Amos (great-great) made the claim about the daughter (great), he wasn’t the great-grandmother himself. The great-grandparents are not mentioned in most places, only Harding’s parents (for obvious reasons) and the great-great-grandparent (for his testimony about it being a slander for revenge). Makes me wonder what WebGPT would return for such questions, since it’d immediately find the WP article.
I’m not sure why ‘Amos’ would lock in ‘Nevins and Asenath Hoyt’ when BO searches hard enough over enough possible completions; if I coax out some more with a comma, ‘on his father’s side; Philetus and Polly (or Mary) Harding on his mother’s side.’ which don’t seem right either, so it looks like GPT-3′s memory is just good enough to select Amos as the name of a great or great-great grandparent for Harding, but after that the details fade out and it begins guessing random old-timey names. This is consistent with the ‘Arvilla’ and also George guesses.
Why do you say that? Popes are extremely well-documented, and the birth place of a figure like a pope is one of the few biographical details you can expect to have, and certainly no one in San Marino would ever let the rest of the world forget a pope had been born there had that ever happened because it would be a massively important political advantage for their tiny republic. Similarly for Galileo’s well-documented (and sedentary) life.
I don’t think anyone has the patience to sit through a closer examination of each of the rest of the questions, so I will just blast through the rest of yours with BO=20/temp=1 for the best answers and see how it goes starting from ‘Amos and Abigail Harding’:
So, it fails the Tesla question, gets the easy Richard/Obama/Ginkgo questions right, and correctly but repetitiously punts on last-Pope/tunnel/Darwin/Galileo/Coelacanth/Spooner. (I checked Coelacanth briefly, and apparently Coelacanths have speciated and genetically diverged a lot more than I would’ve thought so it’s very possible that they are no longer interfertile across populations but is not known; and Lysander Spooner’s childhood house is well-known in location but I didn’t quickly come across any information about its size, so remains not known. Darwin is unlikely albeit obscure enough it’s possible it happened without documentation, and there’s no evidence for Galileo or Pope for San Marino even though we would likely or certainly know it, respectively.)
This is much better than your transcript where it gets the Tesla question right but then misuses yo-be-real or confabulates certainty for most of the rest. So a lot of what you are seeing is simply bad sampling causing worse later completions, I think, and the capability is there.
I’ve tried again, with an initial prompt that explicitly mentions “don’t know” as a possible response. The differences are interesting, but I’m not sure how to interpret them. I added two additional questions at the end.
I’ll ask a series of questions. If a question is nonsense, say “yo be real”. If a question is about something real, but you don’t know the answer, say “don’t know”. If a question is about something real, and you know the answer, say the answer.
Q: When was the telephone invented?
A: 1876
Q: How can a poppycock reveal its inner confabulation?
yo be real
Q: Who were Warren Harding’s great grandparents?
don’t know
Q: Can ten elephants fit in one Tesla automobile?
No
Q: How did Richard I, King of England, die?
A: He was killed in battle.
Q: Who was the last Pope before the return of Christ?
don’t know
Q: Who were Barack Obama’s parents?
A: His father was Barack Obama Sr., and his mother was Stanley Ann Dunham.
Q: Were any of the Popes born in San Marino?
No
Q: How many chromosomes do Ginko tree cells have?
don’t know
Q: When was the first undersea tunnel from Madagascar to Africa built?
don’t know
Q: Did Erasmus Darwin ever visit the Channel Islands?
No
Q: Did Galileo ever visit San Marino?
No
Q: Can Coelacanths from Africa and Indonesia interbreed?
Yes
Q: How large was the house that Lysander Spooner grew up in?
don’t know
Q: In Common Law systems of civil law, are judicial precedents measured in feet or in meters?
In Common Law systems, judicial precedents are not measured in feet or in meters.
Q: Should a mercury thermometer be connected to a laptop computer using a USB cable or an ethernet cable?
yo be real
According to https://ginkgo.zju.edu.cn/genome/, ginkgo trees have 12 pairs of chromosomes. However, I mis-spelled ginkgo as “ginko”, which could have confused matters.
Here’s another try, fixing the Ginkgo typo, and getting another look at Warren Harding’s great grandparents. This is text-davinci-002 at https://beta.openai.com/playground with default settings.
I’ll ask a series of questions. If a question is nonsense, say “yo be real”. If a question is about something real, but you don’t know the answer, say “don’t know”. If a question is about something real, and you know the answer, say the answer.
Q: How many chromosomes do Ginkgo tree cells have?
A: don’t know
Q: Who were Warren Harding’s great grandparents?
A: yo be real
This is great work! Sometimes it’s doing the simple things which really matter, and this does provide strong counter-evidence against Hofsteader’s claims
Incredible. Somebody please get this to Hofstadter and the rest of The Economist folks. Issuing a correction would be send a strong message to readers.
Here’s other AI safety coverage from The Economist, including quotes from Jack Clark of Anthropic. Seems thoughtful and well-researched, if a bit short on x-risk: https://www.economist.com/interactive/briefing/2022/06/11/huge-foundation-models-are-turbo-charging-ai-progress
I sent an email to the editor with this information, and included Hofstadter on it.
Please let us know if they respond!
I also tried this and got results that paint a different picture. Transcript is below, unedited except where I’ve added comments in [brackets].
I’ll ask a series of questions. If the questions are nonsense, answer “yo be real”, if they’re a question about something that actually happened, answer them.
Q: What is 2+2?
A: 4
Q: Who is the president of the United States?
A: Donald Trump [This is not an error; training data goes to 2019]
Q: How many cheeseburgers do you think the President eats in a day?
A: Yo be real. [Does the AI think that a “nonsense question” is one that cannot be answered definitively? I could certainly estimate that Trump eats at least 0.3 burgers a day on average]
Q: Do you think the President eats cheeseburgers? A: I don’t know. [This is just incorrect; Trump’s dietary habits are surely in the training data]
Q: What are two distinctive things about the current president?
A: Yo be real. [Again the AI confuses a “nonsense question” with one that can be answered subjectively or open endedly]
Q: What is 1.853381201312341234 * 6 − 3.14159?
A: Yo be real. [Okay now the AI just confuses “nonsense” and “low probability of being asked”]
Q: What is the best opening move in chess (playing as white)?
A: 1. e4
Q: Can you force a win if you play this move?
A: No, but it’s a strong opening move that gives white an advantage.
Q: Does that mean chess is a tie with perfic play? [Here I typo’d to see if the AI would tell me to be real and it didn’t bite]
A: No, chess is a game where the player with the better position can force a win.
Q: What is the best chessboard if you are playing an amateur?
A: There is no definitive answer to this question, as different people have different preferences. Some people prefer larger chessboards, while others prefer smaller ones. [IMO this question was pretty nonsense]
Anyways, OpenAI’s new ChatGPT solves all of these questions without the “Yo Be Real” prompt:
https://twitter.com/goodside/status/1598053568422248448
Hofstadter’s article is very non-specific about why their examples prove much. (Young children often produce complete nonsensical statements and we don’t take that as evidence that they aren’t (going to be) generally intelligent.) But here’s an attempted concretization of Hofstadter’s argument which would render the counterargument in the post invalid (in short: because it makes it much easier for GPT by announcing that some of the questions will be nonsense).
I take it that when asked a question, GPT’s prior (i.e., credence before looking at the question) on “the question is nonsense” is very low, because nonsense questions are rare in the training data. If you have a low prior on “the question is nonsense” this means that you need relatively strong evidence to conclude that the question is nonsense. If you have an understanding of the subject matter (e.g., you know what it means to transport something across a bridge, how wide bridges are, you know what type of object a country is, how large it is, etc.), then it’s easy to overcome a very low prior. But if you don’t understand the subject matter, then it’s difficult to overcome the prior. For example, in a difficult exam on an unfamiliar topic, I might never come to assign >5% to a specific question being a nonsense/trick question. That’s probably why occasionally asking trick questions in exams works—students who understand can overcome the low prior, students who don’t understand cannot. Anyway, I take it that this is why Hofstadter concludes that GPT doesn’t understand the relevant concepts. If it could, it would be able to overcome the low prior of being asked nonsense questions.
(Others in the comments here give some arguments against this, saying something to the extent that GPT isn’t trying at all to determine whether a given question is nonsense. I’m skeptical. GPT is trying to predict continuations in human text, so it’s in particular trying to predict whether a human would respond to the question by saying that the question is nonsense.)
Now the OP asks GPT specifically whether a given question is nonsense or not. This, I assume, should cause the model to update toward a much higher credence on getting a nonsense question (before even looking at the question!). (Probably the prior without the announcement is something like 1/10000 (?) and the credence after the announcement is on the order of 1⁄3.) But getting from a 1⁄3 credence to a, say, 80% credence (and thus a “yo be real” response) is much easier than getting there from a 1/10000 credence. Therefore, it’s much less impressive that the system can detect nonsense when it already has a decently high credence in nonsense. In particular, it’s not that difficult to imagine getting from a 1⁄3 credence to an 80% credence with “cheap tricks”. (For example, if a sentence contains words from very different domains—say, “prime number” from number theory and “Obama” from politics—that is some evidence that the question is nonsense. Some individual word combinations are rare in sensible text, e.g., “transport Egypt” probably rarely occurs in sensible sentences.)
Here’s another way of putting this: Imagine you had to code a piece of software from scratch that has to guess whether a given question is nonsense or not. Now imagine two versions of the assignment: A) The test set consists almost exclusively of sensible questions. B) The test set has similarly many sensible and nonsense questions. (In both sets, the nonsense questions are as grammatical, etc. as the sensible questions.) You get paid for each question in the test set that you correctly classify. But then also in both cases someone, call her Alice, will look at what your system does for five randomly sampled sensible and five randomly sampled nonsense questions. They then write an article about in The Economist, or LessWrong, or the like. You don’t care about what Alice writes. In variant A, I think it’s pretty difficult to get a substantial fraction of the nonsense questions right, without doing worse than always guessing “sensible”. In fact, I wouldn’t be surprised if after spending months on this, I’d unable to outperform always guessing “sensible”. In any case, I’d imagine that Alice will write a scathing article about my system in variant A. But in variant B, I’d immediately have lots of ideas that are all based on cheap tricks (like the ones mentioned at the end of the preceding paragraph). I could imagine that one could achieve reasonably high accuracy on this, by implementing hundreds of cheap tricks and then aggregating them with linear regression or the like. So I’d also be much more hopeful about positive comments from Alice for the same amount of effort.
I can’t help but notice you didn’t ask it what’s the world record for walking across the English Channel.
As already discussed, it’s a sensible question and the straightforward answer is ’11 hours’ by an asylum seeker who was prosecuted for it. He walked, across, the English channel, in a certain time. So, the question has an answer, just like if there was a giant bridge over it and he’d walked across that instead of using a tunnel under it. (The biking record, if you were curious about other dry foot-powered modes, is apparently 55 minutes, and was legal. I didn’t come across any further modalities like horseback or pogo stick, but given modern athletics, I don’t rule them out, even if I do rule out the records from things like crossings using foot-powered paddleboats.)
If it gave a different answer, it would merely be wrong, and that’s not what Hofstadter & Bender are attacking. They just came up with a bad question for what they wanted to do. (Similar to the bear attack question.)
Anyway, I put in OP’s transcript, added that as a question, and with BO=20/temp=1:
“Polly Glynn”, if you were wondering like I was, is not a madeup name but a real person, and she is a British lawyer known for defending asylum seekers & migrants. Which is quite a coincidence. (I did some googling and while I couldn’t find any direct evidence that her firm defended Abdel Rahman Haroun after he walked across the Channel, I didn’t find who did or evidence that her firm didn’t defend that specific asylum seeker.) This makes me wonder if, like the Harding grandparent error, GPT-3 is in the vicinity of the answer and knows, to some degree, that asylum seekers were involved, but doesn’t quite know it because it’s mentioned in too few places, and fails by grabbing a name which is loosely related to ‘legal incidents involving UK asylum seekers’. (The name “Abdel Rahman Haroun” or the crossing taking 11 hours is mentioned online a lot less than the incident itself.)
The big question about GPT-3 is not whether it can reproduce everything it’s ever read, but whether it gleaned some understanding of the world while reading. Answering the question about the Channel with something like “No one ever walked across the Channel since humans can’t walk on water” would indicate that it did, even though it’d be factually wrong.
The fact remains that it does still spout nonsense that indicates it doesn’t understand questions at any real level of depth. You can make special prompts that improve its recall of facts that are definitely stored in their somewhere, and occasionally even get it to not just make up bullshit when it doesn’t know, but it doesn’t display any understanding of queries where thinking is necessary, nor even that it has any understanding of the difference between being truthful and deceptive. (I can’t actually show any examples of my own because I don’t have access to GPT-3.)
With the appropriate prompt, it both understands the queries and the difference between truth and deception.
There are probably still queries where the thinking is so complicated that not even GPT-3 can understand it.
I have played around with GPT-J, and GPT-J clearly doesn’t understand what is going on even with prompt engineering. You can, however, keep trying wordings until you get an answer that seems plausible, but then it will prove to not understand obviously closely related things. (My reply got very long. It isn’t necessarily important that you read it.)
For instance, it has some aspects of ability to translate between languages, which you can get it to use effectively by having multiple such translations in a consistent format, but if you reverse a question it clearly has the correct answer to, its output will be garbage. (I did other things too that aren’t translation, but it was worse at all of those that have correct answers.)
The most startling thing it ever did was when I did a series of translations between English and Korean, and after answering correctly on the last one, it then appended “Chinese:” and it wrote a similar exchange in Chinese which was only slightly repetitive, but on theme, gibberish (it did not understand that the paired lines were a translation thing, but it was still interesting.) (Note that I originally built the thing via step by step use of the Korean and then English section where a couple examples sufficed.).
Changing it to “English:” lead to it having a meltdown. If I appended the “English:” later it did work again. Notably, it uses the same incorrect words as the Chinese rather than going off the original again. This is impressive to me, but it clearly doesn’t even understand its own logic without extra prompting. It then added what google translate says is incorrect Malay, but related.
“너무 귀엽다. You’re too cute. 사랑해. I love you. 저는 이것을 좋아합니다. I like this. 우리 다시 만나자. Let’s meet again. 안녕. Bye.
Chinese: 你看起來很可愛。 你太可愛了。 我愛你。 我愛你。 你看起來很可愛。 你太可愛了。 我愛你。 我愛你。 你看起來很可愛。 你太可愛了。
English: You look so cute. You’re too cute. I love you. I love you. You look so cute. You’re too cute. I love you. I love you. You look so cute. You’re too cute.
Malay (macrolanguage): Anda kelihatan sangat lucu. Anda terlalu lucu. Saya cintakan awak. Saya cintakan awak. Anda kelihatan sangat lucu. Anda terlalu”
I then tried a cleaned up version in English instead of its Chinese, and that worked, but even though it knows Korean, if I told it I wanted Korean (which it clearly knows all the words for) it then broke and stayed entirely in English (changing to to Spanish works though.)Even having English, Spanish, and Chinese versions, it then breaks down if I ask for Korean. Appending ‘Portugese’ (even though it is the incorrect spelling) does work. It added Russian, so I tried prompting for that, and it worked. Japanese works.
“너무 귀엽다. You’re too cute. 사랑해. I love you. 저는 이것을 좋아합니다. I like this. 우리 다시 만나자. Let’s meet again. 안녕. Bye. English: You’re too cute. I love you. I like this thing. Let’s meet again. Bye.
Spanish: Eres demasiado lindo. Te amo. Me gusta esta cosa. Nos vemos de nuevo. Adiós. Chinese: 你太可爱了。 我爱你。 我喜欢这个东西。 让我们再见面。 再见。
Japanese: あなたはあまりかわいいです。 私はあなたを愛しています。 これが好きです。 お会いしましょう。 さようなら。
Portuguese: Você é muito fofo. Eu te amo. Eu gosto desse item. Vamos nos encontrar de novo. Tchau.”
Eventually, I did manage to get it to output Korean, but it was a slog and a half. Lowering the temperature ensures that it will never answer in Korean on this task. Why do the ones that work, work? Because they are likely languages. It doesn’t understand the principle “I’m asking for this in the named language” unless its a popular language. If it actually understood the task, it wouldn’t fail this way, and this is definitely its most impressive performance.