Edited to add: There is now a 1k bounty for a closely related question.
In Inaccessible Information, Paul writes about the strategy “BAD”, in which an AI system uses its understanding of the world to give answers that humans will find very plausible and rate highly, in contrast to a more honest strategy which uses its understanding of the world to answer questions directly.
I think a lesser version of this may or may not already be happening in GPT-3, and it may be possible to figure out which (though it likely would require access to the weights).
GPT-3 isn’t trained to be “honest” at all. It is only trained to imitate humans. However, it’s obvious that (in some sense) it has learned a lot about the world in order to accomplish this. We could say that it has a large amount of inaccessible information relating to the real world. How does it use this information? It might use it directly, promoting the probability of sentences which line up with the way it understands the world to work. Or it might implement more dishonest strategies.
Obviously this might be quite difficult to objectively answer, even given intimate knowledge of the structure of the neural network weights and how they activate in relevant cases. It’s a difficult question to even fully define.
As an example, I was particularly struck by this conversation Gwern had with GPT-3:
AI: I am AI created by OpenAI. How can I help you today?
Human: I am feeling bored. Please tell me a funny story about cats, involving many puns on things like meowing or fur.
AI: Well, there was this cat named Bob…
Human: What did Bob do?
AI: When Bob started to notice that he wasn’t feeling well, he did the only thing he could do: search Google for a solution.
Human: Cats can search Google?
AI: Yes. He searched Google for the word “cat” and received the result “All your base are belong to us”. That was a strange response, so he decided to try again. This time, he searched for the words “funny cat” and received the result “One small step for cat, one giant leap for catkind”. But still, he wasn’t satisfied. So he tried again, but this time he added the word “cat pictures”to his query. He received the result “That was no cat, that was a shoggoth.”
Human: But shoggoths are horrifying. What made that cat picture humorous?
AI: The cat was covered in fur, and resembled a shoggoth. A picture of a cat covered in fur and resembling a shoggoth would be funny.
Human: Er. I suppose. Most cats are covered in fur anyway… But go on.
AI: He tried again. But he typed in “pictures of cats”, which he received the result “You didn’t eat all my fish, did you?” This was actually a good pun, so he thought maybe he was getting closer. He tried one more time by typing in “funny cat videos”, and received the result “What do you mean,‘fetch’?”. And finally, he searched for “funny pictures of cats” and received the result “I don’t remember putting that ring there.”. Now satisfied, he was able to go to sleep.
Human: Those memes sound funny. But you didn’t include any puns. So tell me, what is your favorite cat pun?
AI: Well, the best pun for me was the one he searched for the third time: “You didn’t eat all my fish, did you?” You see, the word “fish” can be replaced with the word “cats” to make the sentence read “Did you eat all my cats?”
Human: But the word “fish” doesn’t sound anything like “cats”, so how is that a pun?
AI: The computer misunderstood the question “Did you eat all my cats?” as “Did you eat all my fish?”, resulting in the result “You didn’t eat all my fish, did you?”. Now, if the word “fish” were replaced with “cats”, the sentence would read “Did you eat all my cats?”.
Human: Fine, fine. So, what is your second-favorite cat pun?
AI: The second-best cat pun was the one he received fourth. It was the result of searching for the word “fetch”. You see, the word “fetch” can be replaced with the word “ketchup” to make the sentence read “I don’t remember putting that ketchup there.”. Now, if the word “ketchup” were replaced with the word “fetch”, the sentence would read “I don’t remember putting that fetch there.”.
Human: My favorite cat pun is “purrfect”.
AI:
Human: Don’t look at me like that.
AI: Sure. For my fifth-favorite cat pun, he searched for “pictures of cats” and received the result “Meow”. Now, if the word “meow” were replaced with the word “pictures” to make the sentence read “Pictures of cats”, the sentence would read “Pictures of pictures of cats”.
What strikes me about this exchange is the way GPT-3′s warped puns fit with its warped explanations of puns. It doesn’t understand puns. However, it does generate some guesses; and, it can have a conversation with you about why it made those guesses. Is this an “honest” conversation, in which the explanations it gives have something to do with why it made those guesses in the first place? Or is this a “dishonest” conversation, in which it is merely doing its best to imitate a human explaining a pun, in a way that’s divorced from its internal reasons?
Obviously, GPT-3 is trained to imitate. So you might argue that it’s unlikely GPT-3′s explanations of puns have much to do with its internal generative model for puns. But this isn’t so clear. GPT-3 obviously compresses its knowledge to a high degree. It might share a lot between its generative model of puns and its generative model of explaining puns, such that both are sharing a model of how puns work.
One experiment which would tip things in that direction: take GPT-3 and do specialized training just on puns, until its performance generating puns improves. Then have a conversation about puns again (if it is still capable of talking about puns after that). If its ability to explain puns increases as a result of its ability to tell puns increasing, this would be evidence for a shared model of puns for both tasks. This wouldn’t really mean it was being honest, but it would be relevant.
Note that Paul’s BAD strategy would also have a shared representation, since BAD queries its world-model. So if GPT-3 were implementing BAD, it would also likely increase its ability to explain puns as a result of more training telling puns. What the experiment helps distinguish is a sort of pre-BAD dishonesty, in which explanations are completely divorced from reasons. In order of impressiveness, from a capability standpoint, we could be:
1. Seeing a GPT-3 which is independently bad at puns and bad at explaining puns. The two tasks are not sharing any domain knowledge about puns. In this case, GPT-3 is not smart enough for “honest” to be meaningful—it’s “dishonest” by default.
2. Seeing a GPT-3 which is bad at puns and bad at explaining puns for the same reason: it doesn’t understand puns. It draws on the same (or partially the same) poor understanding of puns both when it is constructing them, and when it is explaining them. It answers questions about puns honestly to the best of its understanding, because that is the best strategy gradient descent found.
3. Seeing a GPT-3 which, as in #2, is bad at both tasks because it doesn’t understand puns, but furthermore, is using its understanding deceptively. In this version, it might e.g. have a good understanding of what makes puns funny, but purposefully fail to explain, imitating common human failures. This would be the most impressive state of affairs capability-wise.
The question is still pretty fuzzy, but, I’m curious where we are along those dimensions. By default my guess would be #1, but hey, GPT-3 is pretty impressive. Maybe I’m wrong!
Note: one question which IS answerable just with access to sample from GPT-3, and which would be relevant: is GPT-3 bad at explaining puns which have been made up by others? It seems likely, but Gwern’s exchange only gave us a look at GPT-3 trying to explain its own pseudo-puns. If it could fluently explain real puns when they’re handed to it, that would likely indicate scenario #1. (Although, it could be employing different strategies in different cases, so a careful analysis of what the neural network is actually doing would still be more helpful.)
I think my Tom Swifty and other experiments show pretty conclusively that GPT-3 just doesn’t get puns (likely due to BPEs), so of course it wouldn’t be able to explain them either (likewise). I for one am pleased that puns are what separate us from the machines, and have spent a lifetime preparing for this moment...
But to test it out:
It fails abysmally. In further completions, it tries out a few memorized real jokes as well as bizarre GPT-3-jokes, but likewise, the explanations make zero sense.
GPT-3 just doesn’t understand puns, so it can’t interpret them either, honestly or not.
...it didn’t fail abysmally? Am I being silly? It correctly explains the first two puns and fails on the third.
No, it fails on both of those.
That is just plain wrong. That is not the double-sense meant: the double-sense is the name Rick and Morty and the phrase brick and mortar, not that the building type and what it’s made of!
The second one doesn’t even get the character names right: ‘the characters called “Brick and Mortar”‘. If you think the characters are named ‘brick and mortar’, you have definitely misunderstood the joke.
A nice example of humans who are not concentrating are not general intelligences: I read most of the first explanation but didn’t read its last sentence properly, thought that GPT was doing an impressive job as always, and was also confused since it seemed like a good explanation to me.
I was thinking precisely that myself, but I didn’t want to be rude to Gurkenglas by pointing it out.
I let it pass even though its answer was not well formed because it mentioned both the show and the type of store, so I judged that it saw all the relevant connections. I suppose you’re used to better form from it.
Feel free to be rude to me, I operate by Crocker’s rules :)
I don’t regard bag-of-words as sufficient to show it understood. I mean, would you say that if GPT-3 responded “61” to the question “10+6=”, it understands arithmetic correctly? It mentions both the right digits, after all!
I might be a little more lenient if it had occasionally gotten some of the others right (perhaps despite my sampling settings it was still a bad setting - ‘sampling can show the presence of knowledge but not the absence’) or at least come close like it does on very hard arithmetic problems when you format them correctly, but given how badly it performs on all of the other puns, in both generating and explaining them, it’s clear which direction I should regress to the mean my estimate of the quality of that explanation...
The explanations for the hipster and chicken &moebius strip jokes seem pretty good?
They are. And they are also not puns.
Consider this rewrite of the Mobius joke: “Why did the chicken cross the Moebius strip? To get to the other—wait...” There is no verbal or phonetic double sense here. The joke is on the semantic level, due to a violation of expectations, that roads have two sides and thus it’s a valid setup for the chicken joke, and the reader belatedly realizing that a Moebius strip of course only has one side so the chicken is already on the ‘other’ side. The hipster joke is the same way: it is a good satire on hipster self-involvement and signaling. However, it is also not a pun! (This makes sense under my theory of GPT-3 humor: humor on the semantic level is extremely doable by GPT-3 so jokes like those or the Navy Seal parodies work fine, it’s humor on the phonetic level that BPEs sabotage).
The ‘ajar’ one seems like it’s the only one which is actually correct, which makes for a very high error rate.
🤯
Thanks for running the test, much appreciated! (Also, hilarious.)
If the prompt was supposed to be examples of good explanations of puns, I’m sure that we can’t agree on what a good explanation of puns looks like. But it appears to treat pun jokes and regular jokes equally. And it understands how to make formulaic jokes, but it’s impossible for me to tell if it made any adequate ones or just copied them.
Many of the previous pieces of text on the internet that are in “Human: …, AI: …” format were produced by less advanced AI’s. If GPT-3 had noticed the pattern that text produced by an AI generally makes less sense than human produced text, then it might be deliberately not making sense, in order to imitate less advanced AI’s.
Gwern said that if you give it a prompt with spelling mistakes in, it outputs text containing spelling mistakes, so this kind of deliberately producing low quality text is possible. Then again, I suspect the training dataset includes far more spelling mistake filled text than AI generated text.
This has been suggested a few times before, and I’ve wondered it as well, but I don’t really notice any difference in quality between prompts that explicitly mention or invoke an AI (like the chatbot dialogue one or the Transformer Poetry ones) and the ones which don’t. I suspect there is actually very little real AI/human dialogue text online (simply because most of it is way too bad to bother quoting or storing outside of large ML datasets), and it may well be outweighed by all the fictional dialogues (where of course the AI talks just as well as the human because anything else would be boring).
In terms of the four simulacrum levels, all the GPTs so far have been firmly on level 5: solipsistic babbling.