How far away is this from being implementable?
oceaninthemiddleofanisland
This probably won’t add too much to the discussion but I’m curious to see whether other people relate to this or have a similar process. I was kind of stunned when I heard from friends who got into composing about how difficult it is to figure out a melody and then write a complete piano piece because to me, whenever I open up Sibelius or Dorico (and more recently Ableton), internally it seems like I’m just listening to what I wrote so far, ‘hearing’ a possible continuation lasting a few bars, and then quickly trying to transcribe it before I forget it, or if I really want to be precise then just the next note-group. It doesn’t really come from anywhere and it doesn’t require any thought, but I can tell it’s obviously taking up a share of my cognitive RAM from multitasking experiments, it’s definitely influenced by the music I’ve listen to recently (e.g 1930s/40s jazz), and there are a lot of recognisable patterns. I gave up piano at Grade 1 and my theory went to Grade 2 (I think) where I stopped because I intensely despised it. I actively avoided formal instruction. It makes transcribing harder because I’m just clicking on notes to see if they match up with what’s in my head and that interferes a lot with my memory, playing on an actual piano is even worse. So now what I do is use a phone app to record myself whistling 10-12 seconds of the ‘top’ melody, and then I play it back while making a new recording and I whistle the notes underneath it to, and I keep doing that until all the chords are right and the signal isn’t too degraded. It’s still very annoying. Something I should note is that I whistle whenever I’m alone pretty much obsessively and that’s been the case since I was maybe eight or nine, especially to accompany whatever music is playing around me, and that I have mild autism. It makes me think that with pretty much any creative skill, there are unconscious cognitive modules/black-boxes in play that have been developed either through a lot of exposure or through the internalisation/automatisation of heuristics and rules, which are responsible for predicting small sequences of actions (“what note comes next?”) or doing error-correction (“what sounds good?”). It’s difficult to notice/interact directly with them, but it’s possible when you override conscious controls. The easiest way to see this is to try asemic writing/typing – just typing or writing mindlessly and allowing your hands to just move by themselves. Once you get into the groove with asemic typing, you get Markov-chain-like strings of letters that reflect the character distribution of the language you type with, and sometimes common words like ‘the’ or ‘and’. With asemic writing, you get common patterns of loops, vertical and horizontal lines, and connectors. I’ve seen what seems to be higher-level language modules at work when I’m in a semi-lucid verge-of-fully-waking-up/falling-asleep state where my eyes are open but I’m also in dreamspace at the same time (I have no idea how to describe this), and I can read an imaginary book in front of me or listen to someone, and it’s just a fluent stream of meaningless babble often with a poetic quality to it, sometimes where consonants are carried over to the next word or semi-rhymes that would be a pain to come up with consciously.
So I’ve figured this out. Kinda. If you choose ‘custom’ then it will give you Griffin, but if you choose one of the conventional prompts and then edit it, you can get around it. So damn annoying.
Wow, I didn’t realise I could get this angry about something so esoteric.
I’m beginning to think AID has changed what the “Dragon” model is without telling us for cost reasons, I’ve had kind of the same experience with big lapses in storytelling that didn’t occur as often before. Or maybe it’s randomly switching based on server load? I can kind of understand it if that’s the case but the lack of transparency is annoying. I remember accidentally using the Griffin model for a day when my subscription ran out and not realising because its Indonesian was still quite good...
Somehow the more obvious explanation didn’t occur to me until now, but check the settings, you might be using the Griffin model not the Dragon model. You have to change it manually even after you get the subscription. I have a window open specifically for poetry prompts (using the Oracle hack), I said “Write a long poem in Russian. Make sure the lines are long, vivid, rich, and full of description and life. It should be a love poem addressed to coffee. It should be 15 lines long” followed with “The Oracle, which is a native in Russian, writes: 1 Ой,”. That just gave me annoying stuff like “Oh, coffee, how I love you so much/You are sweet” in Russian so I then added a random Pushkin poem in the ‘death’ category (https://rustih.ru/aleksandr-pushkin-tam-u-leska-za-blizhneyu-dolinoj/) before it to increase the quality and right now it just gave me:
1 Ой да промелькнул январский день,
2 Когда я словно трепетный цветок
3 Нежно расставил руки, 4 Когда ты, чаша чистейшего черного кофе,
5 С сияньем твоим, я так созерцал.
6 Молодость! Как радостны те часы!
7 И я, целованьем твоим так молодо
8 Неслучайно тыкал я туда,
9 Где только чувства, только сердце,
10 Ты, скромный чайный граней, теплый,
11 Несли ты, с радостным свеченьем ступеней,
12 В смолкновенья сладких сумерек
and so on and so on, stopped it there
I have no idea whether this is good since all I did was check with GTranslate to make sure it was roughly on topic, I would ask GPT-3 to translate it for me but I’m guessing you’re a native speaker so you should be able to tell.
If it’s a BPE encoding thing (which seems unlikely to me given that it was able to produce Japanese and Chinese characters just fine), then the implication is OpenAI carried over their encoding from GPT-2 where all foreign language documents were removed from the dataset … I would have trouble believing their team would have overlooked something that huge. This is doubly bizarre given that Russian is the 5/6th most common language in the dataset. You may want to try prompting it with coherent Russian text, my best guess is that in the dataset, whenever somebody says “He said in Russian:”, what usually follows is poor quality (for instance I see this in bad fanfiction where authors use machine translation services to add ‘authenticity’), and that GPT-3 is interpreting this as a signal that it should produce bad Russian. I will give this a try and see if I encounter the same issue.
That’s a visualisation I made which I haven’t posted anywhere else except under the r/ML thread collecting entries for GPT-3 demos, since I couldn’t figure out which subreddit to post it in.
Two thoughts, one of them significantly longer than the other since it’s what I’m most excited about.
(1) It might be the case that the tasks showing an asymptotic trend will resemble the trend for arithmetic – a qualitative breakthrough was needed, which was out of reach at the current model size but became possible at a certain threshold.
(2) For translation, I can definitely say that scaling is doing something. When you narrowly define translation as BLEU score (“does this one generated sentence match the reference sentence? by how much?”), then I agree that the benefits of scaling are marginal – for individual sentences, by that specific metric.
But here’s the thing, GPT-3 can produce idiomatically and culturally accurate translations of Chinese poetry, and then annotate its own translation with references to historical events, the literal versus contextual meaning of words, and so on. The end result actually sounds … like poetry. But it can do other things. If you give it a Japanese text, and then tell it to translate for an American audience, it will either seamlessly explain those references in the translation, or substitute Japanese cultural references for their American equivalent entirely.
But it’s deeper than this. Some non-English languages have honorifics attached to verbs. Some languages have distinctions between the plural and singular form of ‘you’. Some languages have nouns that are inflected depending on whether the noun is in motion or not. Some languages have particles added to the ends of sentences that indicate whether the speaker is hesitant about the statement.
GPT-3 fills in the blanks by making real-world inferences.
If you told me a few years ago about a translation engine that could handle things like ambiguous pronouns, or keep track of speakers across several paragraphs, I would be amazed. If you’d told me about a translation engine that could accurately invent the appropriate missing information, or transfer nuances of the source into the target in a way that sounded natural, I flat-out wouldn’t believe you.
Okay, so what else? Some languages have multiple registers that depend on social context or strongly regional dialects. Current translation engines use a parallel corpus – for instance, news outlets that translate the same article into multiple languages, or EU documents which get translated into all major EU languages – get featured very heavily in these kinds of corpora, so you end up getting a standardised, non-dialectal translation in a formal register.
GPT-3 is not limited by this. It can translate between dialects. It can translate between registers. It can pick up on things like “this story is set in Bandung” and “this character is a construction worker talking to a close friend, not a student talking to a teacher”, and then have the character start code-mixing Indonesian with Sundanese in the low form. I haven’t explored this deeply, but initial prompts are suggesting it’s capable of rendering Indonesian tweets and phone texts (with their various abbreviations) into their equivalents in English.
Here’s the kicker: Indonesian makes up only 0.05985% of GPT-3′s training corpus.
And for that same reason, GPT-3 can handle tone. It can understand the connotative difference between someone describing themselves as “slim”, “thin”, and “scrawny”, and then find a way to appropriately convey it in the target language – and if the target language doesn’t have those separate shades of difference, and you tell it that conveying the difference unambiguously is very important to you, it will figure out ways to do it unprompted, like modifying the tone of surrounding words, or adding a simile where the character compares themselves to a skeleton, up to adding an entire extra scene that doesn’t interrupt the main narrative just to make it clear.
(I have not seen it do this consistently, but on two occasions I have seen it invent new words in Indonesian, which uses affixes to modify root-forms—e.g ‘memasak’ = ‘to cook’, ‘masakan’ = ‘a cook’, etc., that a Google search verified weren’t in the training corpus. Unfortunately in some situations, it will instead use a word with a different meaning but in the same category [e.g instead of ‘striped turquoise midi-dress’, you might get ‘spotty blue wrap-dress’], when it judges the difference to be unimportant to the story. Good in some contexts but annoying in others.)
So this is all great. For anyone that consumes text media, I mean – not for translators (I doubt we’ll be put out of a job but the skill requirement will drop considerably, I think) – it means a huge ocean of previously unreadable knowledge and entertainment is suddenly going to be accessible.
But I’m a language learner as well, and my guess is that this community might have more of us than the baseline average, so here are some other obvious but useful things it can do:
1. It can create arbitrary amounts of synthetic learning material.
This is a big deal for a few reasons.
(A) Sometimes, for less commonly-learned languages like Indonesian, there isn’t much learning material available in the first place. The only Anki deck available is filled with sentences like “I can do it” and “John was angry at me”. This is an issue if you want mass immersion. Quantity is an issue.
(B) Sometimes, there isn’t material on stuff you’re interested in, things that are relevant to you. Quality is an issue. The key thing that predicts learner performance is interest. If all the textbooks you’re reading are oriented towards tourists and they’re talking about hotels and making small-talk about the weather, and you want to read, I don’t know, cute light-hearted yuri manga, or military strategy in the South China Sea, then you’re screwed… unless you have GPT-3. If there’s a particular grammatical feature you’re having trouble internalising, then you provide GPT-3 with a few examples and it’ll happily provide you with a hundred more. If there’s a word that isn’t sticking in your memory...
(C) A combination of A and B: the best way to learn a language is by actively using it. Constantly. Not just passively reading it, by producing it yourself. What hyperpolyglots recommend usually is going and living in the country that speaks your target language, or regularly having conversations with people who do. That’s an issue if (1) you have problems with social anxiety (2) there aren’t people nearby or (3) you aren’t willing to uproot your entire life and spend tens of thousands of dollars just to learn a new language.
This is where AIDungeon’s fine-tuned GPT-3 instance comes in. You select a scenario that involves the set of vocabulary you want to practice (if you’re planning a trip to Hungary, you simulate a trip and the hotel-stay, if you’re moving to a school, you simulate being a student at the International School of Budapest), or a story you could see yourself being invested in (horniness not precluded).
Then you customise it according to your level, the goal being comprehensible input that’s just at the edge of your comfort zone. If you’re an advanced learner with a lot of vocabulary under your belt, you use a handful of target-lang words to tell the model it’s meant to be speaking Hungarian, not English, and you jump into the deep-end and enjoy participating in the story (writing your dialogue and prose, etc.) while adding any unknown vocab items to Anki. If you’re a beginner, you should probably make a scenario involving a personal tutor who tests you after each lesson while introducing new words slowly and explaining concepts fully (see part II).
If you’re an intermediate learner, things are tougher. There might be an easier way to do this than what I’m about to describe, which is part of the point of this post, since I want to get new ideas from the community. What I’ve found works for me is priming the model to produce English translations after target-lang sentences. When the target-lang sentence comes up, you try and guess the English translation before hitting the generate button. Cool, reading comprehension – done. If you want control over the narrative or you want to do this in inverse, you prepend each paragraph with either ‘English [line number]’ or ‘target-lang [line number]’, and then shuffle the order of those paired paragraphs randomly so it translates to target-lang when it sees English, and English when it sees target-lang. What about speaking/writing? Again, what works for me is talking in target-language pidgin, where you just use English for any words you don’t know, and then priming the model to produce grammatically-correct translations after your shitty dialogue. Contrary to intuition, mixing like this is not at all harmful for language learning.
(D) Sentence pairs with translations are the mainstay of self-directed language-learning, because they’re easy to find (mostly). But using language isn’t all about translation. What part of speech is this? What if you wanted to vary things? Here is a conversation, what would be appropriate to say next? Clozes / filling in the gaps—what word would be most appropriate in this sentence? Is the meaning of this sentence closer to options A, B, C, or D? What register would you use in this social situation? Speak your response quickly then write it. Quick, name 10 words related to this word. See this paragraph? Summarise it. What about this argument, is it logically flawed? It’s interesting how a lot of NLP datasets I’ve come across actually make for very good flashcards for language learning, which, I suppose isn’t all that surprising.
II. It can explain things.
What’s the difference between ‘bantu’ and ‘tolong’ – they both mean ‘help’, but how do you use them in sentences? I don’t understand why the words are being ordered like this, explain the grammar to me. Why does this flashcard translate ‘you’ as ‘loe’, while another one uses ‘kalian’, or ‘kamu’, or ‘kau’ or ‘Anda’? (For this, you need to prime it with a nonsense / meaningless question and have it say ‘I don’t know’, otherwise it’ll make up answers for things it doesn’t know, or words that literally mean the same thing with no difference in usage whatsoever.)
But the great thing is, it can draw on real-world knowledge. You’re never learning just a language. You’re also learning the cultural context in which that language is used. If you try to do the former without doing the latter, some linguistic idiosyncrasies are going to remain mysterious to you until someone explains that the weird ungrammatical phrase you’re having trouble understanding, actually came from a 1998 hit soap-opera and now it’s just a part of the language. Or that this term is a historical one that refers to Sukarno’s policy of civil-military integration. Or that the reason why none of the dialogue involves first-name usage is because it’s super impolite to do that with someone you don’t know well.
Sometimes you can scan the indices of appropriate textbooks, or do a google search. But sometimes there aren’t textbooks, sometimes you don’t even know what to search for, sometimes you’re asking a question that’s never been asked before. And I think that’s the real power of GPT-3 as it exists right now – all of the human knowledge that’s currently unindexed, informal, uninterpretable, implied, ambiguous, unclear and inaccessible – it makes available with a single query.
Or an hour finicking around and handwriting 20 examples until it cottons on.
But … that happens more often with other contexts. Getting it to count parentheses accurately is like pulling teeth, but with translation tasks GPT-3 seems to go “aha! now this is something I’m good at!” and then explains the tonal differences between “神様” and “女神” in a Japanese poem about a lesbian sea-goddess it wrote five minutes ago. OpenAI’s paper was doing GPT-3 a really, really big disservice by quantifying it by its BLEU score. When it comes to language, GPT-3 isn’t a model, it’s a maestro.
I just finished Iain M Banks’ ‘The Player of Games’ so my thoughts are being influenced by that, but it had an interesting main character who made it his mission to become the best “general game-player” (e.g no specialising in specific games), so I would be interested to see whether policy-based reinforcement learning models scale (thinking of how Agent 57 exceeded human performance across all Atari games).
It seems kind of trivially true that a large enough MuZero with some architectural changes could do something like play chess, shogi and go – by developing separate “modules” for each. At a certain size, this would be actually be optimal. There would be no point in developing general game-play strategies, it would be redundant. But suppose you scale it up, and then add hundreds or thousands of other games and unique tasks into the mix, with multiple timescales, multiple kinds of input, and multiple simulation environments? This is assuming we’ve figured out how to implement reward design automatically, or if the model/s (multi-agent?) itself develops a better representation of reward.
At what point does transfer learning occur? I saw a very simple version of this when I was fine-tuning GPT-2 on a dataset of 50-60 distinct authors. When you look at validation loss (accuracy), you always see a particular shape—when you add another author, validation loss rises, then it plateaus, then it sharply falls when the model makes some kind of “breakthrough” (for whatever that means for a transformer). When you train on authors writing about the same topics, final loss is lower, and the end outputs are a lot more coherent. The model benefits from the additional coverage, because it learns to generalise better. So after piling on more and more games and increasing the model size, at what point would we see the sharp, non-linear fall?
At what point does it start developing shared representations of multiple games?
At what point does it start meta-learning during validation testing, like GPT-3? Performing well at unseen games?
What about social games involving text communication? Interaction with real-world systems?
Suppose you gave it a ‘game’ involving predicting sequences of text as accurately as possible, and another involving predicting sequences of pixels, and another with modelling physical environments. At what point would it get good at all of those, as good as GPT-3?
And we could flip this question around: suppose you fine-tuned GPT-X on a corpus of games represented as text. (We already know GPT-2 can kind of handle chess.) At what version number do you see it start to perform as well as MuZero?
The other commenter is right – the architecture is important, but it’s not about the architecture. It’s about the task, and whether in order to do that task well, whether you need to learn other more general and transferable tasks in the process.
In this sense, it doesn’t matter if you’re modelling text or playing a video-game: the end result – given enough compute and enough data – always converges on the same set of universally-useful skills. A stable world-model, a theory of mind (we could unpack this into an understanding of agency and goal-setting, and keeping track of the mental-states of others), meta-learning, application of logic, an understanding of causality, pattern-recognition, and problem-solving.
Yes! I was thinking about this yesterday, it occurred to me that GPT-3′s difficulty with rhyming consistently might not just be a byte-pair problem, any highly structured text with extremely specific, restrictive forward and backward dependencies is going to be a challenge if you’re just linearly appending one token at a time onto a sequence without the ability to revise it (maybe we should try a 175-billion parameter BERT?). That explains and predicts a broad spectrum of issues and potential solutions (here I’m calling them A, B and C): performance should correlate to (1) the allowable margin of error per token-group (coding syntax is harsh, solving math equations is harsh, trying to come up with a rhyme for ‘orange’ after you’ve written it is harsh), and (2) the extent to which each token-group depends on future token-groups. Human poets and writers always go through several iterations, but we’re asking it to do what we do in just one pass.
So in playing around with GPT-3 (AID), I’ve found two (three?) meta approaches for dealing with this issue. I’ll call them Strategies A, B and C.
A is the more general one. You just give it multiple drafting opportunities and/or break up the problem into multiple smaller steps. So far I’ve seen it work for:
(1) Boolean logic, algebraic equations, simple math equations works (guess-and-check). When I have time in a few days, I’m going to get it to mimic the human heuristic for calculating approximate square-roots over multiple iterations.
(2) Translating Chinese poems to English roughly and then touching them up in the second draft. Same with editing any kind of text.
(3) Tricky coding problems (specifically, transforming a string into Pig Latin). First, instead of asking it to “solve the problem”, you ask it to “come up with five possible strategies for solving the problem”, and then “select the most plausible one”. Then you say “you made several structural, syntactical, and interpretive mistakes”, allow it to come up with a long list of those possible mistakes, say, “now try again”, and do that as many times as the context window allows. The end result isn’t always functional, but it’s a lot better than asking it to solve something in one pass.
B is the moderately less general, and more obvious second approach, which synergises well with the first approach. B is forcing GPT-3 to plan explicitly.
(1) In writing an article, you get GPT-3 to start by writing a vague summary, then a more in-depth summary, then listing the key points and subpoints in order. By periodically forcing it to summarise its discussion up to a given point, you can exceed the window length while retaining coherency.
(2) In writing poetry from a prompt, you get GPT-3 to discuss and tease out the implications of the prompt and describe the process of planning the poetry first.
(3) In translating, you get it to list out the key potential translation errors that could be made, and the different choices a translator could make in translating each line.
(4) In writing code, you get GPT-3 to simulate several people discussing the problem requirements and arguing constructively with one another (simulating just one person means if that one person goes off track or misinterprets the problem, future continuations are poisoned with the error since they need to be consistent), then producing English pseudo-code that describes the process in abstract, and only then the actual code.
I decided to add ‘simulating multiple people’ as a Strategy C, but it’s kind of the same thing as Strategy A but in a way that allows more room for error. The issue is that in most single-author texts, people try to be consistent with what they’ve said before, but in GPT-3, this can cause minor errors (for instance, self-contradiction) to accumulate over time, which reduces generation quality. But we’ve seen that something as simple as adding dialogue between two people, allows GPT-3 to arrive at accurate and more complex solutions much more reliably. This works for a broad spectrum of media: articles, poetry, translation, and coding. All you need to do is create a ‘critic’ who interrupts after each line or paragraph, and then if you really need one, a critic who criticises the first critic. The key here is constructive rather than destructive criticism, since GPT-3 is perfectly capable of producing vacuous and petty critiques.
All three of these strategies together tend to vastly improve performance on tasks where (1) the allowable margin of error per token-group is quite small (for instance, solving 83x42), and (2) current token-groups depends on future token-groups. I have not tested this for rhyming, but it seems simple enough to check.
In other words, GPT-3 does better at solving problems when you get it to simulate the way humans solve problems: with multiple attempts, with explicit planning, and by collaborating with other humans.
Edit: my attempts at making GPT-3 rhyme failed. Here is what I tried, and what I figured out.
(1) It has a vague idea of rhyming—if you fill its context-window with groups of words that rhyme, about 40-60% of the words in its next generation will rhyme, and the rest will look like rhymes (as in, they end with the same couple of letters but are pronounced differently in English—e.g dough, cough, rough, etc.).
(1a) Most rhyming websites are query-based. From what I could tell, GPT-3 has not memorised the layout of the most common rhyming websites to the degree where it could reproduce the formatting consistently. This is not surprising given that Common Crawl abides by nofollow and robots.txt policies, and that OpenAI may have filtered these pages out when they were paring the dataset down to ‘high-quality’ documents.
(1b) GPT-3 knows how most Chinese words are pronounced, even if it gets the tone wrong sometimes. It rhymes more consistently in languages with uncommon diacritic markings, more with languages that don’t use Latin characters, and even more consistently in non-Latin-based languages with phonemic orthography, but not by much. With Russian, you hit the jackpot—BPE represents it as individual characters, it’s mostly phonemic, there’s a lot of Russian in GPT-3′s dataset, and a lot of rhyming poetry—but it still does poorly. This either suggests that an absence of looking forward + randomness introduced by sampling is the main issue here. Unfortunately the other most-well-represented languages in its dataset with non-Latin phonemic orthography (Japanese kana, Korean hangul, Arabic script) each have their own issues—rhyming the last syllable of each line in Korean is easy since it’s an SOV language and all you have to do is match the verb conjugation, so it doesn’t have much literary value. Most of the rhyming in the dataset would likely be modern rap, which sometimes uses multiple syllables. Arabic omits short vowels. Japanese I know less about, but iirc rhyming is much less common than other forms of constrained writing (e.g haiku) that emphasise rhythm, and mostly occurs in j-pop.
(2) Giving it multiple attempts failed. ‘Multiple generations for each line + selecting the ones that rhyme’ works, but we already know that.
(3) Supplying rhymes kind of worked. It would do well for a handful lines and then go off track. Giving it multiple possible choices was very bad. It would include the words randomly within lines, or near the end of lines, and sometimes at the very end. This might be rectified by more examples, since AID is limited to 1000 tokens/characters. But I do suspect the issue is a more fundamental one.
(4) Splitting words into syllables failed, but I didn’t try this one exhaustively. The only benefit of word-splitting occurs when the beginning of the word matters (e.g alliteration), because it allows for ‘denser’ computation per token (on the character/syllable level, not the word level). Plus, we’re talking about the English language. Even actual English speakers regularly have trouble with knowing how words are pronounced, orthography kind of hinders rather than helps in this case.
(5) ‘Reminding’ it of the end word between each line failed.
(6) Forcing it to generate in IPA first did not work. However, it does have a vague idea of how to transliterate English into IPA and a better idea of how to transliterate IPA into English.
(7) Future attempts: my prompting was very abstract, and we know that GPT-3 works better when there’s a familiar context surrounding the task / the prompt is within the training distribution. I will try the context of an English writing assignment.
The best angle of attack here I think, is synthesising knowledge from multiple domains. I was able to get GPT-3 to write and then translate a Japanese poem about a (fictional) ancient language model into Chinese, Hungarian, and Swahili and annotate all of its translations with stylistic notes and historical references. I don’t think any humans have the knowledge required to do that, but unsurprisingly GPT-3 does, and performed better when I used the premise of multiple humans collaborating. It’s said that getting different university departments to collaborate tends to be very productive wrt new papers being published. The only bottleneck is whether its dataset includes scientific publications and the extent to which it can draw upon memorised knowledge (parameter count).
I think you were pretty clear on your thoughts, actually. So, the easy / low-level way response to some of your skeptical thoughts would be technical details and I’m going to do that and then follow it with a higher-level, more conceptual response.
The source of a lot of my skepticism is GPT-3′s inherent inconsistency. It can range wildly from it’s high-quality ouput to gibberish, repetition, regurgitation etc. If it did have some reasoning process, I wouldn’t expect such inconsistency. Even when it is performing so well people call it “reasoning” it has enough artifacts of it’s “non-reasoning” output to make me skeptical (logical contradictions, it’s tendency to repeat itself i.e. “Because Gravity Duh” like in the OP, etc).
So, GPT-3′s architecture involves randomly sampling. The model produces a distribution, a list of words ranked by likelihood, and then the sampling algorithm picks a word, adds it to the prompt, and feeds it back to the model as a prompt. It can’t go back and edit things. The model itself, the way the distribution is produced, and the sampling method are all distinct things. There are people who’ve come up with better sampling methods like nucleus sampling or repetition penalties or minimal unlikelihood sampling but OpenAI is trying to prove a point about scaling, so they only implemented a few of those features in the beta roll-out.
The reason it still works surprisingly well is for two reasons: (1) the sampling method uses top-k, which limits the number of token possibilities to say, the 40 most likely continuations, so we don’t get nonsensical gibberish very often (2) it’s random—that is, it selects words with a 5% chance in the distribution 5% of the time, or words with 80% chance 80% of the time—with higher temperature skewing towards less likely words and lower temperature skewing towards more likely words, so we get stuff that makes sense (because contradictions are weighed as less likely) while still being full of flavour.
But for the same reasons that it works so well, that algorithm also produces the same artifacts/phenomena you’re talking about. “Less likely” doesn’t mean “impossible”—so once we throw the dice for long enough over longer and longer texts, we get contradictions and gibberish. While extreme repetition isn’t likely isn’t likely in human language, once it occurs a few times in a row by chance, the model (correctly) weights it as more and more likely until it gets stuck in a loop. And even after all of that, the model itself is trained on CommonCrawl which contains a lot of contradiction and nonsense. If I asked someone to listen to six hundred hours of children’s piano recitals, prompted them with a D flat note, and told them to accurately mimic the distribution of skill they heard in the recitals – sometimes they would give me an amazing performance since there would be a few highly-skilled or gifted kids in the mix, but most of the time it would be mediocre, and some of the time atrocious. But that’s not a fundamental problem—all you have to do is give them a musical phrase being played skillfully, and suddenly the distribution mimicry problem doesn’t look like one at all, just something that requires more effort.
When the underlying architecture becomes clear, you really need to go into the finer details of what it means to be “capable” of reasoning. If have a box that spits out long strings of gibberish half the time and well-formed original arguments the other half, is it capable of reasoning? What if the other half is only ten percent of the time? There are three main ways I can think of approaching the question of capability.
In the practical and functional sense, in situations where reliability matters: if I have a ‘driverless car’ which selects actions like steering and braking from a random distribution when travelling to a destination, and as a result crashes into storefronts or takes me into the ocean, I would not call that “capable of driving autonomously”. From this perspective, GPT-3 with top-k sampling is not capable of reliably reasoning as it stands. But if it turned out that there was a road model producing the distribution, and that it turned out that actually the road model was really good but the sampling method was bad, and that all I needed was a better sampling method… Likewise, with GPT-3, if you were looking directly at the distribution, and only cared about it generating 10-20 words at a time, it would be very easy to make it perform reasoning tasks. But for other tasks? Top-k isn’t amazing, but the other ones aren’t much better. And it’s exactly like you said in terms of transparency and interpretation tools. We don’t know where to start, whether there’s even a one-size-fits all solution, or what the upper limits are of the useful information we could extract from the underlying model. (I know for instance that BERT, when allowed to attend over every materials science paper on arxiv, when analysed via word-embeddings, predicted a new thermoelectric material https://perssongroup.lbl.gov/papers/dagdelen-2019-word-embeddings.pdf—what’s buried within GPT-3?) So I’d definitely say ‘no’, for this sense of the word capable.
In the literal sense: if GPT-3 can demonstrate reasoning once (we already know it can handle Boolean logic, maths, deductive, inductive, analogical, etc. word-problems), then it’s “capable” of reasoning.
In the probabilistic sense: language has a huge probability-space. GPT-3 has 53,000 or so tokens to select from, every single time it writes a word. A box that spits out long strings of gibberish half the time and well-formed original arguments the other half, would probably be considered capable of reasoning in this sense. The possibility space for language is huge. “Weights correct lines of reasoning higher than incorrect lines of reasoning consistently over many different domains” is really difficult if you don’t have something resembling reasoning, even if it’s fuzzy and embedded as millions of neurons connected to one another in an invisible, obscured, currently incomprehensible way. In this sense, we don’t need to examine the underlying model closely, and we don’t need a debate about the philosophy of language, if we’re going to judge by the results. And the thing is, we already know GPT-3 does this, despite being hampered by sampling.
Now, there’s the final point I want to make architecture-wise. I’ve seen this brought up a lot in this thread: what if the CommonCrawl dataset has a question asking about clouds becoming lead, or a boy who catches on fire if he turns five degrees left? The issue is that even if those examples existed (I was only able to find something very vaguely related to the cloud-lead question on Stack Exchange’s worldbuilding forum), GPT-3, though it can do better than its predecessor, can’t memorise or remember all of its training dataset. In a way, that’s the entire point—compression is learning, having a good representation of a dataset means being able to compress and decompress it more accurately and to a greater extent, if you had a model that just memorised everything, it wouldn’t be able to any of the things we’ve seen it do. This is an issue of anthropomorphising: GPT-3 doesn’t “read”, it passes over 570GB of raw text and updates its weights incrementally with each word it passes over. The appearance of single question asking about clouds turning into lead isn’t a drop in the bucket, proportionally, it’s a drop in the ocean. If a poem appears 600 times, that’s another story. But right now the “what if it was on the internet, somewhere?” thing doesn’t really make any sense, and every time we give GPT-3 another, even more absurd and specific problem, it makes even less sense given that there’s an alternative hypothesis which is much simpler – that a 175 billion parameter transformer trained at the cost of $6.5m on most of the whole internet, in order to model sequences of text as accurately as possible, also needed to develop a rudimentary model of the logical reasoning, concepts, and causes and effects that went into those sequences of text.
So I’ve done the low-level technical response (which might sum up to: “in the literal and probabilistic senses, and kind of in the practical sense, GPT-3 has been able to perform reasoning on everything we’ve thrown at it so far) and pretty much emptied out my head, so here’s what’s left:
With regards to the original question I posed, I guess the natural response is to just balk at the idea of answering it – but the point isn’t really to answer it. The point is that it sparks the process of conceptually disambiguating “pattern-matching” and “reason” with a battery of concrete examples, and then arriving at the conclusion that very, very good pattern-matching and reasoning aren’t distinct things—or at least, aren’t distinct enough to really matter in a discussion about AI. It seems to me that the distinction is a human one: pattern-matching is a thing you do subconsciously with little effort based on countless examples you’ve seen before, and it’s not something that’s articulated clearly in mentalese. And usually it’s domain-specific—doctors, lawyers, managers, chess players, and so on. Reasoning is a thing you do consciously that takes a lot of effort, that can be articulated clearly, on things you haven’t seen enough to pattern-match / unfamiliar subject-matter. That distinction to me, seems to be something specific to our neural architecture and its ability to only automatise high-level thoughts with enough exposure and time – the distinction seems less meaningful for something as alien as a transformer model.
Hmm, I think the purpose behind my post went amiss. The point of the exercise is process-oriented not result-oriented—to either learn to better differentiate the concepts in your head by poking and prodding at them with concrete examples, or realise that they aren’t quite distinct at all. But in any case, I have a few responses to your question. The most relevant one was covered by another commenter (reasoning ability isn’t binary/quantitative not qualitative). The remaining two are:
1. “Why isn’t it an AGI?” here can be read as “why hasn’t it done the things I’d expect from an AGI?” or “why doesn’t it have the characteristics of general intelligence?”, and there’s a subtle shade of difference here that requires two different answers.
For the first, GPT-3 isn’t capable of goal-driven behaviour. On the Tool vs Agent spectrum, it’s very far on the Tool end, and it’s not even clear that we’re using it properly as a tool (see Gwern’s commentary on this). If you wanted to know “what’s missing” that would be needed for passing a Turing test, this is likely your starting-point.
For the second, the premise is more arguable. ‘What characteristics constitute general intelligence?’, ‘Which of them are necessary and which of them are auxiliary?’, etc. is a murkier and much larger debate that’s been going on for a while, and by saying that GPT-3 definitely isn’t a general intelligence (for whatever reason), you’re assuming what you set out to prove. Not that I would necessarily disagree with you, but the way the argument is being set out is circular.
2. “Passing the Turing test with competent judges” is an evasion, not an answer to the question – a very sensible one, though. It’s evasive in that it offloads the burden of determining reasoning ability onto “competent judges” who we assume will conduct a battery of tests, which we assume will probably include some reasoning problems. But what reasoning problems will they ask? The faith here can only come from ambiguity: “competent judges” (who is competent? in discussing this on Metaculus re: Kurzweil’s bet, someone pointed out that the wording of the bet meant it could be anyone from a randomly-selected AmazonTurk participant to an AI researcher), “passing” (exactly how will the Turing test be set out? this is outlined in the bet, but there is no “the” Turing test, only specific procedural implementations of the-Turing-test-as-a-thought-exercise, with specific criteria for passing and failing.) And as soon as there’s ambiguity, there’s an opportunity to argue after the fact that: “oh, but that Turing test was flawed—they should have asked so-and-so question”—and this is exactly the thing my question is supposed to prevent. What is that “so-and-so question”, or set of questions?
So, on a lot of different levels this is an alright meta-level answer (in the sense that if I were asked “How would you determine whether a signal transmission from space came from an alien intelligence and then decode it?”, my most sensible answer would be: “I don’t know. Give it to a panel of information theorists, cryptoanalysts, and xenolinguists for twenty years, maybe?”) but a poor actual answer.
Great, but the terms you’re operating with here are kind of vague. What problems could you give to GPT-3 that would tell you whether it was reasoning, versus “recognising and predicting”, passive “pattern-matching” or a presenting “illusion of reasoning”? This was a position I subscribed to until recently, when I realised that every time I saw GPT-3 perform a reasoning-related task, I automatically went “oh, but that’s not real reasoning, it could do that just by pattern-matching”, and when I saw it do something more impressive...
And so on. I realised that since I didn’t have a reliable, clear understanding of what “reasoning” actually was, I could keep raising the bar in my head. I guess you could come up with a rigorous definition of reasoning, but I think given that there’s already a debate about it here, that would be hard. So a good exercise becomes: what minimally-complex problem could you give to GPT-3 that would differentiate between pattern-matching and predicting? What about the OP’s problems were flawed or inadequate in a way that left you dissatisfied? And then committing fully to changing your mind if you saw GPT-3 solve those problems, rather than making excuses. I would be interested in seeing your answers.
A bunch of more examples here, a bit difficult to summarise since it went from explaining how dopamine receptors work, to writing a poem about Amazon’s logistics in the form of a paean to the Moon Goddess, writing poems in Chinese based on English instructions and then providing astonishingly-good translations, to having Amazon and Alibaba diss one another in the style of 18th century poet Mary Robinson. Link here: https://www.reddit.com/r/slatestarcodex/comments/hrx2id/a_collection_of_amazing_things_gpt3_has_done/fy7i7im/?context=3
Example:
The oracle says: “Oh Moon Goddess, hear my prayer
As we look upon this beauteous evening,
From these windowless towers our workers do depart
To fulfill quotas in a quest for morning.
Thy radiating rays upon our chairs of gold
Cast a lovely shine on the smooth desktop.
Yet, what’s this ahead? A crease in the land?
Deformed packages that must be delivered?
Disasters never come singly, you’ll find!
Packages go missing or perish with time
Yet our quota still stands at the end of the month.
Now we’re falling behind, what shall we do?
Let us hire new workers and fire the slackers!
Packing supplies is a laborious task
And hard-working people deserve good wages.
Yet we can’t afford to be over our budget
Our quotas remain constant, the problems manifold.
The demand for our packages is endless
‘Predicting random text on the internet better than a human’ already qualifies it as superhuman, as dirichlet-to-neumann pointed out. If you look at any given text, there’s a given ratio of cognitive work needed to produce the text, per word-count. “Superhuman” only requires asking it to replicate the work of multiple people collaborating together, or processes which need a lot of human labour like putting together a business strategy or writing a paper. Assuming it’s mediocre in some aspects, the clearest advantage GPT-6 would have would be an interdisciplinary one—pooling together domain knowledge from disparate areas to produce valuable new insights.