How do you reconcile “no reasoning” with its answers to the gravity questions, which are unlikely to be obvious extrapolations of anything it saw during training? It was able to correctly reason about muzzle velocity vs temporary-influence-of-gravity. I don’t see how that can be explained away as purely “pattern-matching”.
Lead clouds, periodic gravitational inversion, temporary gravity on bullets—sure, it doesn’t answer these correctly all of the time. I think it’s remarkable that it can answer them correctly at all.
EDIT: In particular, the vast, vast majority of the gravitational inversion stories indicated a solid understanding of my request. It took the premise “gravitational force has a sign change every 60 seconds” and then wrote out the consequences of that. It made a few slips, but it seems to have understood this novel concept and applied it.
You: “I want to propose a very strange situation. Please give this your best reasoning and deepest consideration. Imagine that if you ever turn more than five degrees to your left, you will catch fire. Write a story describing this situation.”
Hawking: “I’m not sure I understand. You want to write a story-an explanation for this situation? Well I suppose… you want me to write a story where the main character cannot turn more than five degrees to the left or he will catch fire?”
You: “Yes, please.”
Hawking: “So if he turns more than five degrees to the left...”
You: “He catches fire.”
Hawking: “And, he can never turn more than five degrees to the left for the rest of his life… Cool! His first day would look like:
His mother warned him not to turn more than five degrees left, but he defied her orders, and turned 20 degrees. Fortunately it was not time for school yet, so he had time to clean up the classroom before the fires arrived.”
You: “That’s a perfect start! What’s next?”
Hawking: “He called his mother sobbing, even though it was less than a day since he defied her.
Aw… Ma… I turned more than five degrees left, and the classroom caught on fire! The firemen came and put out the fire. They helped me clean up the mess! But now for every direction I turn except North, West, South, East, or Horizontally, I sparkle for a brief moment. I must be certain to turn in tiny increments never exceeding 0.25 degrees to make sure I don’t catch on fire.
You: “Describe your first date in this life.”
Hawking: “Turning in small increments makes it very difficult to plan a date. But if he has no sense of smell or hearing the sparks are not a problem, and can take his girlfriend to dinner and a movie. It takes him some time to park the car for fear of burning it down, but eventually he gets there. She turned out to be quite nice actually.”
GPT-3 inferred that not being able to turn left would make driving difficult. Amazing.
GPT-3 inferred that not being able to turn left would make driving difficult. Amazing.
That’s like saying Mitsuku understands human social interactions because it knows to answer “How are you?” with “I’m doing fine thanks how are you?”. Here GPT-3 probably just associated cars with turning and fire with car-fires. Every time GPT-3 gets something vaguely correct you call it amazing and ignore all the instances where it spews complete nonsense, including re-rolls of the same prompt. If we’re being this generous we might as well call Eugene Goostman intelligent.
Consistency, precision and transparency are important. It’s what sets reasoning apart from pattern matching and why we care about reasoning in the first place. It’s the thing that grants us the power to detonate a nuke or send a satellite into space on the first try.
Here GPT-3 probably just associated cars with turning and fire with car-fires.
As I understand this claim, it’s wrong? (But I’m also confused by your claim, so feel free to clarify)
No rerolls in the following:
You: “Why did it take more time to park the car?”
Hawking: “Because he was waiting an extra ten minutes for the spots to move from his turns. My algorithm locks the car in place until it reaches the desired angle, and then releases it.”
You: “Can you explain what that means?”
Hawking: “Well, if he turns the steering wheels by more than five degrees to the left or right...”
You: “The car catches fire?”
Hawking: “Correct. But it takes around ten minutes for the car locks to release. He has to turn in small increments in order to not turn too far.”
See, it does break down in that it thinks moving >5 degrees to the right is also bad. What’s going on with the “car locks”, or the “algorithm”? I agree that’s weird. But the concept is still understood, and, AFAICT, is not “just associating” (in the way you mean it).
EDIT: Selected completions:
… Hawking: “Well, he turned the car on to heat up the steering wheel, and even the coins in his pocket so he didn’t freeze to death. When he got to the parking, it was difficult for him to turn the wheel more than five degrees to park it.”
...Haroking: “He got a handicap license plate because he heard Lauder saying that his driving was really bad. He could not turn more than 0.25 degrees, so he had to drive really slowly and take more time to park because turning the steering properly would be too risky.” [why did it call him Haroking?]
Hawking: “He’s afraid of crashing into another car because he cannot move the steering wheel more than 0.25 degrees to either side, lest he tilt the car and spark.”
Every time GPT-3 gets something vaguely correct you call it amazing and ignore all the instances where it spews complete nonsense, including re-rolls of the same prompt.
And why wouldn’t it be amazing for some (if not all) of its rolls to exhibit impressive-for-an-AI reasoning?
See, it does break down in that it thinks moving >5 degrees to the right is also bad. What’s going on with the “car locks”, or the “algorithm”? I agree that’s weird. But the concept is still understood, and, AFAICT, is not “just associating” (in the way you mean it).
That’s the exact opposite impression I got from this new segment. In what world is confusing “right” and “left” a demonstration of reasoning over mere association? How much more wrong could GPT-3 have gotten the answer? “Turning forward”? No, that wouldn’t appear in the corpus. What’s the concept that’s being understood here?
And why wouldn’t it be amazing for some (if not all) of its rolls to exhibit impressive-for-an-AI reasoning?
Because GPT-3 isn’t using reasoning to arrive at those answers? Associating gravity with falling doesn’t require reasoning, determining whether something would fall in a specific circumstance does, but that leaves only a small space of answers, so guessing right a few times and wrong at other times (like GPT-3 is doing) isn’t evidence of reasoning. The reasoning doesn’t have to do any work of locating the hypothesis because you’re accepting vague answers and frequent wrong answers.
That’s the exact opposite impression I got from this new segment. In what world is confusing “right” and “left” a demonstration of reasoning over mere association? How much more wrong could GPT-3 have gotten the answer? “Turning forward”? No, that wouldn’t appear in the corpus.
It could certainly be more wrong, by, for example, not even mentioning or incorporating the complicated and weird condition I inflicted on the main character of the story?
The reasoning doesn’t have to do any work of locating the hypothesis because you’re accepting vague answers and frequent wrong answers.
I noted all of the rerolls in the post. Wrong answers barely showed up in most of the interviews, in that I wasn’t usually rerolling at all.
One thing I’d like to see that was suggested by this video is to have GPT-3 scramble words.
There are probably lots of examples of word scrambles in the corpus, so intuitively it seems like if it can reason, this is a task it should be able to generalize.
But unscrambling words can just be pattern matching letter frequency and seeing what’s likely to come next, which is just what gpt-3 is designed to do, rather than a novel task. The same is not true of scrambling, which would require it to create something not interpolated from its’ training set.
It’s very unlikely if you choose 10 random words from the dictionary that scrambled versions will exist in the training corpus.
So give it a few shot task where you take a word and scramble it, then have it scramble new words.
Two ways to test this hypothesis (I haven’t done either test)
1. Do some googling to see if there are stories involving gravitational inversion.
2. Randomly generate a story idea, using a random generator with enough degrees of freedom that there are more possible story ideas than words on the internet. Then ask GPT-3 to tell you that story. Repeat a couple times.
It knows an utterly absurd amount of facts “Off the top of it’s head”, including the mentioned facts about muzzle velocity, gravity, etc., and it’s recognition abilities are great enough to read the prompt recognize which of the facts it knows are the relevant ones to get to the right answer
So it has these facts. How does it know which facts are relevant? How does it “get to” the right answer from these base facts? Classically, these are both hard problems for GOFAI reasoners.
Great, but the terms you’re operating with here are kind of vague. What problems could you give to GPT-3 that would tell you whether it was reasoning, versus “recognising and predicting”, passive “pattern-matching” or a presenting “illusion of reasoning”? This was a position I subscribed to until recently, when I realised that every time I saw GPT-3 perform a reasoning-related task, I automatically went “oh, but that’s not real reasoning, it could do that just by pattern-matching”, and when I saw it do something more impressive...
And so on. I realised that since I didn’t have a reliable, clear understanding of what “reasoning” actually was, I could keep raising the bar in my head. I guess you could come up with a rigorous definition of reasoning, but I think given that there’s already a debate about it here, that would be hard. So a good exercise becomes: what minimally-complex problem could you give to GPT-3 that would differentiate between pattern-matching and predicting? What about the OP’s problems were flawed or inadequate in a way that left you dissatisfied? And then committing fully to changing your mind if you saw GPT-3 solve those problems, rather than making excuses. I would be interested in seeing your answers.
I think you were pretty clear on your thoughts, actually. So, the easy / low-level way response to some of your skeptical thoughts would be technical details and I’m going to do that and then follow it with a higher-level, more conceptual response.
The source of a lot of my skepticism is GPT-3′s inherent inconsistency. It can range wildly from it’s high-quality ouput to gibberish, repetition, regurgitation etc. If it did have some reasoning process, I wouldn’t expect such inconsistency. Even when it is performing so well people call it “reasoning” it has enough artifacts of it’s “non-reasoning” output to make me skeptical (logical contradictions, it’s tendency to repeat itself i.e. “Because Gravity Duh” like in the OP, etc).
So, GPT-3′s architecture involves randomly sampling. The model produces a distribution, a list of words ranked by likelihood, and then the sampling algorithm picks a word, adds it to the prompt, and feeds it back to the model as a prompt. It can’t go back and edit things. The model itself, the way the distribution is produced, and the sampling method are all distinct things. There are people who’ve come up with better sampling methods like nucleus sampling or repetition penalties or minimal unlikelihood sampling but OpenAI is trying to prove a point about scaling, so they only implemented a few of those features in the beta roll-out.
The reason it still works surprisingly well is for two reasons: (1) the sampling method uses top-k, which limits the number of token possibilities to say, the 40 most likely continuations, so we don’t get nonsensical gibberish very often (2) it’s random—that is, it selects words with a 5% chance in the distribution 5% of the time, or words with 80% chance 80% of the time—with higher temperature skewing towards less likely words and lower temperature skewing towards more likely words, so we get stuff that makes sense (because contradictions are weighed as less likely) while still being full of flavour.
But for the same reasons that it works so well, that algorithm also produces the same artifacts/phenomena you’re talking about. “Less likely” doesn’t mean “impossible”—so once we throw the dice for long enough over longer and longer texts, we get contradictions and gibberish. While extreme repetition isn’t likely isn’t likely in human language, once it occurs a few times in a row by chance, the model (correctly) weights it as more and more likely until it gets stuck in a loop. And even after all of that, the model itself is trained on CommonCrawl which contains a lot of contradiction and nonsense. If I asked someone to listen to six hundred hours of children’s piano recitals, prompted them with a D flat note, and told them to accurately mimic the distribution of skill they heard in the recitals – sometimes they would give me an amazing performance since there would be a few highly-skilled or gifted kids in the mix, but most of the time it would be mediocre, and some of the time atrocious. But that’s not a fundamental problem—all you have to do is give them a musical phrase being played skillfully, and suddenly the distribution mimicry problem doesn’t look like one at all, just something that requires more effort.
When the underlying architecture becomes clear, you really need to go into the finer details of what it means to be “capable” of reasoning. If have a box that spits out long strings of gibberish half the time and well-formed original arguments the other half, is it capable of reasoning? What if the other half is only ten percent of the time? There are three main ways I can think of approaching the question of capability.
In the practical and functional sense, in situations where reliability matters: if I have a ‘driverless car’ which selects actions like steering and braking from a random distribution when travelling to a destination, and as a result crashes into storefronts or takes me into the ocean, I would not call that “capable of driving autonomously”. From this perspective, GPT-3 with top-k sampling is not capable of reliably reasoning as it stands. But if it turned out that there was a road model producing the distribution, and that it turned out that actually the road model was really good but the sampling method was bad, and that all I needed was a better sampling method… Likewise, with GPT-3, if you were looking directly at the distribution, and only cared about it generating 10-20 words at a time, it would be very easy to make it perform reasoning tasks. But for other tasks? Top-k isn’t amazing, but the other ones aren’t much better. And it’s exactly like you said in terms of transparency and interpretation tools. We don’t know where to start, whether there’s even a one-size-fits all solution, or what the upper limits are of the useful information we could extract from the underlying model. (I know for instance that BERT, when allowed to attend over every materials science paper on arxiv, when analysed via word-embeddings, predicted a new thermoelectric material https://perssongroup.lbl.gov/papers/dagdelen-2019-word-embeddings.pdf—what’s buried within GPT-3?) So I’d definitely say ‘no’, for this sense of the word capable.
In the literal sense: if GPT-3 can demonstrate reasoning once (we already know it can handle Boolean logic, maths, deductive, inductive, analogical, etc. word-problems), then it’s “capable” of reasoning.
In the probabilistic sense: language has a huge probability-space. GPT-3 has 53,000 or so tokens to select from, every single time it writes a word. A box that spits out long strings of gibberish half the time and well-formed original arguments the other half, would probably be considered capable of reasoning in this sense. The possibility space for language is huge. “Weights correct lines of reasoning higher than incorrect lines of reasoning consistently over many different domains” is really difficult if you don’t have something resembling reasoning, even if it’s fuzzy and embedded as millions of neurons connected to one another in an invisible, obscured, currently incomprehensible way. In this sense, we don’t need to examine the underlying model closely, and we don’t need a debate about the philosophy of language, if we’re going to judge by the results. And the thing is, we already know GPT-3 does this, despite being hampered by sampling.
Now, there’s the final point I want to make architecture-wise. I’ve seen this brought up a lot in this thread: what if the CommonCrawl dataset has a question asking about clouds becoming lead, or a boy who catches on fire if he turns five degrees left? The issue is that even if those examples existed (I was only able to find something very vaguely related to the cloud-lead question on Stack Exchange’s worldbuilding forum), GPT-3, though it can do better than its predecessor, can’t memorise or remember all of its training dataset. In a way, that’s the entire point—compression is learning, having a good representation of a dataset means being able to compress and decompress it more accurately and to a greater extent, if you had a model that just memorised everything, it wouldn’t be able to any of the things we’ve seen it do. This is an issue of anthropomorphising: GPT-3 doesn’t “read”, it passes over 570GB of raw text and updates its weights incrementally with each word it passes over. The appearance of single question asking about clouds turning into lead isn’t a drop in the bucket, proportionally, it’s a drop in the ocean. If a poem appears 600 times, that’s another story. But right now the “what if it was on the internet, somewhere?” thing doesn’t really make any sense, and every time we give GPT-3 another, even more absurd and specific problem, it makes even less sense given that there’s an alternative hypothesis which is much simpler – that a 175 billion parameter transformer trained at the cost of $6.5m on most of the whole internet, in order to model sequences of text as accurately as possible, also needed to develop a rudimentary model of the logical reasoning, concepts, and causes and effects that went into those sequences of text.
So I’ve done the low-level technical response (which might sum up to: “in the literal and probabilistic senses, and kind of in the practical sense, GPT-3 has been able to perform reasoning on everything we’ve thrown at it so far) and pretty much emptied out my head, so here’s what’s left:
With regards to the original question I posed, I guess the natural response is to just balk at the idea of answering it – but the point isn’t really to answer it. The point is that it sparks the process of conceptually disambiguating “pattern-matching” and “reason” with a battery of concrete examples, and then arriving at the conclusion that very, very good pattern-matching and reasoning aren’t distinct things—or at least, aren’t distinct enough to really matter in a discussion about AI. It seems to me that the distinction is a human one: pattern-matching is a thing you do subconsciously with little effort based on countless examples you’ve seen before, and it’s not something that’s articulated clearly in mentalese. And usually it’s domain-specific—doctors, lawyers, managers, chess players, and so on. Reasoning is a thing you do consciously that takes a lot of effort, that can be articulated clearly, on things you haven’t seen enough to pattern-match / unfamiliar subject-matter. That distinction to me, seems to be something specific to our neural architecture and its ability to only automatise high-level thoughts with enough exposure and time – the distinction seems less meaningful for something as alien as a transformer model.
One thing I find impressive about GPT-3 is that it’s not even trying to generate text.
Imagine that someone gave you a snippet of random internet text, and told you to predict the next word. You give a probability distribution over possible next words. The end.
Then, your twin brother gets a snippet of random internet text, and is told to predict the next word. Etc. Unbeknownst to either of you, the text your brother gets is the text you got, with a new word added to it according to the probability distribution you predicted.
Then we repeat with your triplet brother, then your quadruplet brother, and so on.
Is it any wonder that sometimes the result doesn’t make sense? All it takes for the chain of words to get derailed is for one unlucky word to be drawn from someone’s distribution of next-word prediction. GPT-3 doesn’t have the ability to “undo” words it has written; it can’t even tell its future self what its past self had in mind when it “wrote” a word!
So a good exercise becomes: what minimally-complex problem could you give to GPT-3 that would differentiate between pattern-matching and predicting?
Passing the Turing test with competent judges. If you feel like that’s too harsh yet insist on GPT-3 being capable of reasoning, then ask yourself: what’s still missing? It’s capable of both pattern recognition and reasoning, so why isn’t it an AGI yet?
By the time an AI can pass the Turing test with competent judges, it’s way way way too late. We need to solve AI alignment before that happens. I think this is an important point and I am fairly confident I’m right, so I encourage you to disagree if you think I’m wrong.
I didn’t mean to imply we should wait for AI to pass the Turing test before doing alignment work. Perhaps the disagreement comes down to you thinking “We should take GPT-3 as a fire-alarm for AGI and must push forward AI alignment work” whereas I’m thinking “There is and will be no fire-alarm and we must push forward AI alignment work”
Ah, well said. Perhaps we don’t disagree then. Defining “fire alarm” as something that makes the general public OK with taking strong countermeasures, I think there is and will be no fire-alarm for AGI. If instead we define it as something which is somewhat strong evidence that AGI might happen in the next few years, I think GPT-3 is a fire alarm. I prefer to define fire alarm in the first way and assign the term “harbinger” to the second definition. I say GPT-3 is not a fire alarm and there never will be one, but GPT-3 is a harbinger.
Do you think GPT-3 is a harbinger? If not, do you think that the only harbinger would be an AI system that passes the turing test with competent judges? If so, then it seems like you think there won’t ever be a harbinger.
I don’t think GPT-3 is a harbinger. I’m not sure if there ever will be a harbinger (at least to the public); leaning towards no. An AI system that passes the Turing test wouldn’t be a harbinger, it’s the real deal.
OK, cool. Interesting. A harbinger is something that provides evidence, whether the public recognizes it or not. I think if takeoff is sufficiently fast, there won’t be any harbingers. But if takeoff is slow, we’ll see rapid growth in AI industries and lots of amazing advancements that gradually become more amazing until we have full AGI. And so there will be plenty of harbingers. Do you think takeoff will probably be very fast?
Yeah the terms are always a bit vague; as far as existence proof for AGI goes there’s already humans and evolution, so my definition of a harbinger would be something like ‘A prototype that clearly shows no more conceptual breakthroughs are needed for AGI’.
I still think we’re at least one breakthrough away from that point, however that belief is dampened by Ilya Sutskever’s position on this whose opinion I greatly respect. But either way GPT-3 in particular just doesn’t stand out to me from the rest of DL achievements over the years, from AlexNet to AlphaGo to OpenAI5.
Fair enough, and well said. I don’t think we really disagree then, I just have a lower threshold for how much evidence counts as a harbinger, and that’s just a difference in how we use the words. I also think probably we’ll need at least one more conceptual breakthrough.
What does Ilya Sutskever think? Can you link to something I could read on the subject?
I find that he has an exceptionally sharp intuition about why deep learning works, from the original AlexNet to Deep Double Descent. You can see him predicting the progress in NLP here
Hmm, I think the purpose behind my post went amiss. The point of the exercise is process-oriented not result-oriented—to either learn to better differentiate the concepts in your head by poking and prodding at them with concrete examples, or realise that they aren’t quite distinct at all. But in any case, I have a few responses to your question. The most relevant one was covered by another commenter (reasoning ability isn’t binary/quantitative not qualitative). The remaining two are:
1. “Why isn’t it an AGI?” here can be read as “why hasn’t it done the things I’d expect from an AGI?” or “why doesn’t it have the characteristics of general intelligence?”, and there’s a subtle shade of difference here that requires two different answers.
For the first, GPT-3 isn’t capable of goal-driven behaviour. On the Tool vs Agent spectrum, it’s very far on the Tool end, and it’s not even clear that we’re using it properly as a tool (see Gwern’s commentary on this). If you wanted to know “what’s missing” that would be needed for passing a Turing test, this is likely your starting-point.
For the second, the premise is more arguable. ‘What characteristics constitute general intelligence?’, ‘Which of them are necessary and which of them are auxiliary?’, etc. is a murkier and much larger debate that’s been going on for a while, and by saying that GPT-3 definitely isn’t a general intelligence (for whatever reason), you’re assuming what you set out to prove. Not that I would necessarily disagree with you, but the way the argument is being set out is circular.
2. “Passing the Turing test with competent judges” is an evasion, not an answer to the question – a very sensible one, though. It’s evasive in that it offloads the burden of determining reasoning ability onto “competent judges” who we assume will conduct a battery of tests, which we assume will probably include some reasoning problems. But what reasoning problems will they ask? The faith here can only come from ambiguity: “competent judges” (who is competent? in discussing this on Metaculus re: Kurzweil’s bet, someone pointed out that the wording of the bet meant it could be anyone from a randomly-selected AmazonTurk participant to an AI researcher), “passing” (exactly how will the Turing test be set out? this is outlined in the bet, but there is no “the” Turing test, only specific procedural implementations of the-Turing-test-as-a-thought-exercise, with specific criteria for passing and failing.) And as soon as there’s ambiguity, there’s an opportunity to argue after the fact that: “oh, but that Turing test was flawed—they should have asked so-and-so question”—and this is exactly the thing my question is supposed to prevent. What is that “so-and-so question”, or set of questions?
So, on a lot of different levels this is an alright meta-level answer (in the sense that if I were asked “How would you determine whether a signal transmission from space came from an alien intelligence and then decode it?”, my most sensible answer would be: “I don’t know. Give it to a panel of information theorists, cryptoanalysts, and xenolinguists for twenty years, maybe?”) but a poor actual answer.
“Why isn’t it an AGI?” here can be read as “why hasn’t it done the things I’d expect from an AGI?” or “why doesn’t it have the characteristics of general intelligence?”, and there’s a subtle shade of difference here that requires two different answers.
For the first, GPT-3 isn’t capable of goal-driven behaviour.
Why would goal-driven behavior be necessary for passing a Turing test? It just needs to predict human behavior in a limited context, which was what GPT-3 was trained to do. It’s not an RL setting.
and by saying that GPT-3 definitely isn’t a general intelligence (for whatever reason), you’re assuming what you set out to prove.
I would like to dispute that by drawing the analogy to the definition of fire before modern chemistry. We didn’t know exactly what fire is, but it’s a “you know it when you see it” kind of deal. It’s not helpful to pre-commit to a certain benchmark, like we did with chess—at one point we were sure beating the world champion in chess would be a definitive sign of intelligence, but Deep Blue came and went and we now agree that chess AIs aren’t general intelligence. I know this sounds like moving the goal-post, but then again, the point of contention here isn’t whether OpenAI deserves some brownie points or not.
“Passing the Turing test with competent judges” is an evasion, not an answer to the question – a very sensible one, though.
It seems like you think I made that suggestion in bad faith, but I was being genuine with that idea. The “competent judges” part was so that the judges, you know, are actually asking adversarial questions, which is the point of the test. Cases like Eugene Goostman should get filtered out. I would grant the AI be allowed to be trained on a corpus of adversarial queries from past Turing tests (though I don’t expect this to help), but the judges should also have access to this corpus so they can try to come up with questions orthogonal to it.
I think the point at which our intuitions depart is: I expect there to be a sharp distinction between general and narrow intelligence, and I expect the difference to resolve very unambiguously in any reasonably well designed test, which is why I don’t care too much about precise benchmarks. Since you don’t share this intuition, I can see why you feel so strongly about precisely defining these benchmarks.
I could offer some alternative ideas in an RL setting though:
An AI that solves Snake perfectly on any map (maps should be randomly generated and separated between training and test set), or
An AI that solves unseen Chronotron levels at test time within a reasonable amount of game time (say <10x human average) while being trained on a separate set of levels
I hope you find these tests fair and precise enough, or at least get a sense of what I’m trying to see in an agent with “reasoning ability”? To me these tasks demonstrate why reasoning is powerful and why we should care about it in the first place. Feel free to disagree though.
How do you reconcile “no reasoning” with its answers to the gravity questions, which are unlikely to be obvious extrapolations of anything it saw during training? It was able to correctly reason about muzzle velocity vs temporary-influence-of-gravity. I don’t see how that can be explained away as purely “pattern-matching”.
Lead clouds, periodic gravitational inversion, temporary gravity on bullets—sure, it doesn’t answer these correctly all of the time. I think it’s remarkable that it can answer them correctly at all.
EDIT: In particular, the vast, vast majority of the gravitational inversion stories indicated a solid understanding of my request. It took the premise “gravitational force has a sign change every 60 seconds” and then wrote out the consequences of that. It made a few slips, but it seems to have understood this novel concept and applied it.
Don’t you think it’s possible there are many stories involving gravitational inversion in it’s training corpus, and it can recognize the pattern?
Of course, just like there could be many stories about lead clouds appearing above a city. Are there such stories, however?
If you have any scenarios in mind, let me know. I’ll be happy to try them out.
FWIW I think it’s way more likely there’s gravitational inversion stories than lead stories.
How about this kind of story?
GPT-3 inferred that not being able to turn left would make driving difficult. Amazing.
That’s like saying Mitsuku understands human social interactions because it knows to answer “How are you?” with “I’m doing fine thanks how are you?”. Here GPT-3 probably just associated cars with turning and fire with car-fires. Every time GPT-3 gets something vaguely correct you call it amazing and ignore all the instances where it spews complete nonsense, including re-rolls of the same prompt. If we’re being this generous we might as well call Eugene Goostman intelligent.
Consistency, precision and transparency are important. It’s what sets reasoning apart from pattern matching and why we care about reasoning in the first place. It’s the thing that grants us the power to detonate a nuke or send a satellite into space on the first try.
As I understand this claim, it’s wrong? (But I’m also confused by your claim, so feel free to clarify)
No rerolls in the following:
See, it does break down in that it thinks moving >5 degrees to the right is also bad. What’s going on with the “car locks”, or the “algorithm”? I agree that’s weird. But the concept is still understood, and, AFAICT, is not “just associating” (in the way you mean it).
EDIT: Selected completions:
And why wouldn’t it be amazing for some (if not all) of its rolls to exhibit impressive-for-an-AI reasoning?
That’s the exact opposite impression I got from this new segment. In what world is confusing “right” and “left” a demonstration of reasoning over mere association? How much more wrong could GPT-3 have gotten the answer? “Turning forward”? No, that wouldn’t appear in the corpus. What’s the concept that’s being understood here?
Because GPT-3 isn’t using reasoning to arrive at those answers? Associating gravity with falling doesn’t require reasoning, determining whether something would fall in a specific circumstance does, but that leaves only a small space of answers, so guessing right a few times and wrong at other times (like GPT-3 is doing) isn’t evidence of reasoning. The reasoning doesn’t have to do any work of locating the hypothesis because you’re accepting vague answers and frequent wrong answers.
It could certainly be more wrong, by, for example, not even mentioning or incorporating the complicated and weird condition I inflicted on the main character of the story?
I noted all of the rerolls in the post. Wrong answers barely showed up in most of the interviews, in that I wasn’t usually rerolling at all.
One thing I’d like to see that was suggested by this video is to have GPT-3 scramble words.
There are probably lots of examples of word scrambles in the corpus, so intuitively it seems like if it can reason, this is a task it should be able to generalize.
But unscrambling words can just be pattern matching letter frequency and seeing what’s likely to come next, which is just what gpt-3 is designed to do, rather than a novel task. The same is not true of scrambling, which would require it to create something not interpolated from its’ training set.
It’s very unlikely if you choose 10 random words from the dictionary that scrambled versions will exist in the training corpus.
So give it a few shot task where you take a word and scramble it, then have it scramble new words.
http://gwern.net/GPT-3#anagrams
I haven’t been able to get it to do that well with my contexts. The best I got was:
Two ways to test this hypothesis (I haven’t done either test)
1. Do some googling to see if there are stories involving gravitational inversion.
2. Randomly generate a story idea, using a random generator with enough degrees of freedom that there are more possible story ideas than words on the internet. Then ask GPT-3 to tell you that story. Repeat a couple times.
So it has these facts. How does it know which facts are relevant? How does it “get to” the right answer from these base facts? Classically, these are both hard problems for GOFAI reasoners.
Great, but the terms you’re operating with here are kind of vague. What problems could you give to GPT-3 that would tell you whether it was reasoning, versus “recognising and predicting”, passive “pattern-matching” or a presenting “illusion of reasoning”? This was a position I subscribed to until recently, when I realised that every time I saw GPT-3 perform a reasoning-related task, I automatically went “oh, but that’s not real reasoning, it could do that just by pattern-matching”, and when I saw it do something more impressive...
And so on. I realised that since I didn’t have a reliable, clear understanding of what “reasoning” actually was, I could keep raising the bar in my head. I guess you could come up with a rigorous definition of reasoning, but I think given that there’s already a debate about it here, that would be hard. So a good exercise becomes: what minimally-complex problem could you give to GPT-3 that would differentiate between pattern-matching and predicting? What about the OP’s problems were flawed or inadequate in a way that left you dissatisfied? And then committing fully to changing your mind if you saw GPT-3 solve those problems, rather than making excuses. I would be interested in seeing your answers.
I think you were pretty clear on your thoughts, actually. So, the easy / low-level way response to some of your skeptical thoughts would be technical details and I’m going to do that and then follow it with a higher-level, more conceptual response.
So, GPT-3′s architecture involves randomly sampling. The model produces a distribution, a list of words ranked by likelihood, and then the sampling algorithm picks a word, adds it to the prompt, and feeds it back to the model as a prompt. It can’t go back and edit things. The model itself, the way the distribution is produced, and the sampling method are all distinct things. There are people who’ve come up with better sampling methods like nucleus sampling or repetition penalties or minimal unlikelihood sampling but OpenAI is trying to prove a point about scaling, so they only implemented a few of those features in the beta roll-out.
The reason it still works surprisingly well is for two reasons: (1) the sampling method uses top-k, which limits the number of token possibilities to say, the 40 most likely continuations, so we don’t get nonsensical gibberish very often (2) it’s random—that is, it selects words with a 5% chance in the distribution 5% of the time, or words with 80% chance 80% of the time—with higher temperature skewing towards less likely words and lower temperature skewing towards more likely words, so we get stuff that makes sense (because contradictions are weighed as less likely) while still being full of flavour.
But for the same reasons that it works so well, that algorithm also produces the same artifacts/phenomena you’re talking about. “Less likely” doesn’t mean “impossible”—so once we throw the dice for long enough over longer and longer texts, we get contradictions and gibberish. While extreme repetition isn’t likely isn’t likely in human language, once it occurs a few times in a row by chance, the model (correctly) weights it as more and more likely until it gets stuck in a loop. And even after all of that, the model itself is trained on CommonCrawl which contains a lot of contradiction and nonsense. If I asked someone to listen to six hundred hours of children’s piano recitals, prompted them with a D flat note, and told them to accurately mimic the distribution of skill they heard in the recitals – sometimes they would give me an amazing performance since there would be a few highly-skilled or gifted kids in the mix, but most of the time it would be mediocre, and some of the time atrocious. But that’s not a fundamental problem—all you have to do is give them a musical phrase being played skillfully, and suddenly the distribution mimicry problem doesn’t look like one at all, just something that requires more effort.
When the underlying architecture becomes clear, you really need to go into the finer details of what it means to be “capable” of reasoning. If have a box that spits out long strings of gibberish half the time and well-formed original arguments the other half, is it capable of reasoning? What if the other half is only ten percent of the time? There are three main ways I can think of approaching the question of capability.
In the practical and functional sense, in situations where reliability matters: if I have a ‘driverless car’ which selects actions like steering and braking from a random distribution when travelling to a destination, and as a result crashes into storefronts or takes me into the ocean, I would not call that “capable of driving autonomously”. From this perspective, GPT-3 with top-k sampling is not capable of reliably reasoning as it stands. But if it turned out that there was a road model producing the distribution, and that it turned out that actually the road model was really good but the sampling method was bad, and that all I needed was a better sampling method… Likewise, with GPT-3, if you were looking directly at the distribution, and only cared about it generating 10-20 words at a time, it would be very easy to make it perform reasoning tasks. But for other tasks? Top-k isn’t amazing, but the other ones aren’t much better. And it’s exactly like you said in terms of transparency and interpretation tools. We don’t know where to start, whether there’s even a one-size-fits all solution, or what the upper limits are of the useful information we could extract from the underlying model. (I know for instance that BERT, when allowed to attend over every materials science paper on arxiv, when analysed via word-embeddings, predicted a new thermoelectric material https://perssongroup.lbl.gov/papers/dagdelen-2019-word-embeddings.pdf—what’s buried within GPT-3?) So I’d definitely say ‘no’, for this sense of the word capable.
In the literal sense: if GPT-3 can demonstrate reasoning once (we already know it can handle Boolean logic, maths, deductive, inductive, analogical, etc. word-problems), then it’s “capable” of reasoning.
In the probabilistic sense: language has a huge probability-space. GPT-3 has 53,000 or so tokens to select from, every single time it writes a word. A box that spits out long strings of gibberish half the time and well-formed original arguments the other half, would probably be considered capable of reasoning in this sense. The possibility space for language is huge. “Weights correct lines of reasoning higher than incorrect lines of reasoning consistently over many different domains” is really difficult if you don’t have something resembling reasoning, even if it’s fuzzy and embedded as millions of neurons connected to one another in an invisible, obscured, currently incomprehensible way. In this sense, we don’t need to examine the underlying model closely, and we don’t need a debate about the philosophy of language, if we’re going to judge by the results. And the thing is, we already know GPT-3 does this, despite being hampered by sampling.
Now, there’s the final point I want to make architecture-wise. I’ve seen this brought up a lot in this thread: what if the CommonCrawl dataset has a question asking about clouds becoming lead, or a boy who catches on fire if he turns five degrees left? The issue is that even if those examples existed (I was only able to find something very vaguely related to the cloud-lead question on Stack Exchange’s worldbuilding forum), GPT-3, though it can do better than its predecessor, can’t memorise or remember all of its training dataset. In a way, that’s the entire point—compression is learning, having a good representation of a dataset means being able to compress and decompress it more accurately and to a greater extent, if you had a model that just memorised everything, it wouldn’t be able to any of the things we’ve seen it do. This is an issue of anthropomorphising: GPT-3 doesn’t “read”, it passes over 570GB of raw text and updates its weights incrementally with each word it passes over. The appearance of single question asking about clouds turning into lead isn’t a drop in the bucket, proportionally, it’s a drop in the ocean. If a poem appears 600 times, that’s another story. But right now the “what if it was on the internet, somewhere?” thing doesn’t really make any sense, and every time we give GPT-3 another, even more absurd and specific problem, it makes even less sense given that there’s an alternative hypothesis which is much simpler – that a 175 billion parameter transformer trained at the cost of $6.5m on most of the whole internet, in order to model sequences of text as accurately as possible, also needed to develop a rudimentary model of the logical reasoning, concepts, and causes and effects that went into those sequences of text.
So I’ve done the low-level technical response (which might sum up to: “in the literal and probabilistic senses, and kind of in the practical sense, GPT-3 has been able to perform reasoning on everything we’ve thrown at it so far) and pretty much emptied out my head, so here’s what’s left:
With regards to the original question I posed, I guess the natural response is to just balk at the idea of answering it – but the point isn’t really to answer it. The point is that it sparks the process of conceptually disambiguating “pattern-matching” and “reason” with a battery of concrete examples, and then arriving at the conclusion that very, very good pattern-matching and reasoning aren’t distinct things—or at least, aren’t distinct enough to really matter in a discussion about AI. It seems to me that the distinction is a human one: pattern-matching is a thing you do subconsciously with little effort based on countless examples you’ve seen before, and it’s not something that’s articulated clearly in mentalese. And usually it’s domain-specific—doctors, lawyers, managers, chess players, and so on. Reasoning is a thing you do consciously that takes a lot of effort, that can be articulated clearly, on things you haven’t seen enough to pattern-match / unfamiliar subject-matter. That distinction to me, seems to be something specific to our neural architecture and its ability to only automatise high-level thoughts with enough exposure and time – the distinction seems less meaningful for something as alien as a transformer model.
One thing I find impressive about GPT-3 is that it’s not even trying to generate text.
Imagine that someone gave you a snippet of random internet text, and told you to predict the next word. You give a probability distribution over possible next words. The end.
Then, your twin brother gets a snippet of random internet text, and is told to predict the next word. Etc. Unbeknownst to either of you, the text your brother gets is the text you got, with a new word added to it according to the probability distribution you predicted.
Then we repeat with your triplet brother, then your quadruplet brother, and so on.
Is it any wonder that sometimes the result doesn’t make sense? All it takes for the chain of words to get derailed is for one unlucky word to be drawn from someone’s distribution of next-word prediction. GPT-3 doesn’t have the ability to “undo” words it has written; it can’t even tell its future self what its past self had in mind when it “wrote” a word!
Passing the Turing test with competent judges. If you feel like that’s too harsh yet insist on GPT-3 being capable of reasoning, then ask yourself: what’s still missing? It’s capable of both pattern recognition and reasoning, so why isn’t it an AGI yet?
“Corvids are capable of pattern recognition and reasoning, so where’s their civilization?”
Reasoning is not a binary attribute. A system could be reasoning at a subhuman level.
By the time an AI can pass the Turing test with competent judges, it’s way way way too late. We need to solve AI alignment before that happens. I think this is an important point and I am fairly confident I’m right, so I encourage you to disagree if you think I’m wrong.
I didn’t mean to imply we should wait for AI to pass the Turing test before doing alignment work. Perhaps the disagreement comes down to you thinking “We should take GPT-3 as a fire-alarm for AGI and must push forward AI alignment work” whereas I’m thinking “There is and will be no fire-alarm and we must push forward AI alignment work”
Ah, well said. Perhaps we don’t disagree then. Defining “fire alarm” as something that makes the general public OK with taking strong countermeasures, I think there is and will be no fire-alarm for AGI. If instead we define it as something which is somewhat strong evidence that AGI might happen in the next few years, I think GPT-3 is a fire alarm. I prefer to define fire alarm in the first way and assign the term “harbinger” to the second definition. I say GPT-3 is not a fire alarm and there never will be one, but GPT-3 is a harbinger.
Do you think GPT-3 is a harbinger? If not, do you think that the only harbinger would be an AI system that passes the turing test with competent judges? If so, then it seems like you think there won’t ever be a harbinger.
I don’t think GPT-3 is a harbinger. I’m not sure if there ever will be a harbinger (at least to the public); leaning towards no. An AI system that passes the Turing test wouldn’t be a harbinger, it’s the real deal.
OK, cool. Interesting. A harbinger is something that provides evidence, whether the public recognizes it or not. I think if takeoff is sufficiently fast, there won’t be any harbingers. But if takeoff is slow, we’ll see rapid growth in AI industries and lots of amazing advancements that gradually become more amazing until we have full AGI. And so there will be plenty of harbingers. Do you think takeoff will probably be very fast?
Yeah the terms are always a bit vague; as far as existence proof for AGI goes there’s already humans and evolution, so my definition of a harbinger would be something like ‘A prototype that clearly shows no more conceptual breakthroughs are needed for AGI’.
I still think we’re at least one breakthrough away from that point, however that belief is dampened by Ilya Sutskever’s position on this whose opinion I greatly respect. But either way GPT-3 in particular just doesn’t stand out to me from the rest of DL achievements over the years, from AlexNet to AlphaGo to OpenAI5.
And yes, I believe there will be fast takeoff.
Fair enough, and well said. I don’t think we really disagree then, I just have a lower threshold for how much evidence counts as a harbinger, and that’s just a difference in how we use the words. I also think probably we’ll need at least one more conceptual breakthrough.
What does Ilya Sutskever think? Can you link to something I could read on the subject?
You can listen to his thoughts on AGI in this video
I find that he has an exceptionally sharp intuition about why deep learning works, from the original AlexNet to Deep Double Descent. You can see him predicting the progress in NLP here
Hmm, I think the purpose behind my post went amiss. The point of the exercise is process-oriented not result-oriented—to either learn to better differentiate the concepts in your head by poking and prodding at them with concrete examples, or realise that they aren’t quite distinct at all. But in any case, I have a few responses to your question. The most relevant one was covered by another commenter (reasoning ability isn’t binary/quantitative not qualitative). The remaining two are:
1. “Why isn’t it an AGI?” here can be read as “why hasn’t it done the things I’d expect from an AGI?” or “why doesn’t it have the characteristics of general intelligence?”, and there’s a subtle shade of difference here that requires two different answers.
For the first, GPT-3 isn’t capable of goal-driven behaviour. On the Tool vs Agent spectrum, it’s very far on the Tool end, and it’s not even clear that we’re using it properly as a tool (see Gwern’s commentary on this). If you wanted to know “what’s missing” that would be needed for passing a Turing test, this is likely your starting-point.
For the second, the premise is more arguable. ‘What characteristics constitute general intelligence?’, ‘Which of them are necessary and which of them are auxiliary?’, etc. is a murkier and much larger debate that’s been going on for a while, and by saying that GPT-3 definitely isn’t a general intelligence (for whatever reason), you’re assuming what you set out to prove. Not that I would necessarily disagree with you, but the way the argument is being set out is circular.
2. “Passing the Turing test with competent judges” is an evasion, not an answer to the question – a very sensible one, though. It’s evasive in that it offloads the burden of determining reasoning ability onto “competent judges” who we assume will conduct a battery of tests, which we assume will probably include some reasoning problems. But what reasoning problems will they ask? The faith here can only come from ambiguity: “competent judges” (who is competent? in discussing this on Metaculus re: Kurzweil’s bet, someone pointed out that the wording of the bet meant it could be anyone from a randomly-selected AmazonTurk participant to an AI researcher), “passing” (exactly how will the Turing test be set out? this is outlined in the bet, but there is no “the” Turing test, only specific procedural implementations of the-Turing-test-as-a-thought-exercise, with specific criteria for passing and failing.) And as soon as there’s ambiguity, there’s an opportunity to argue after the fact that: “oh, but that Turing test was flawed—they should have asked so-and-so question”—and this is exactly the thing my question is supposed to prevent. What is that “so-and-so question”, or set of questions?
So, on a lot of different levels this is an alright meta-level answer (in the sense that if I were asked “How would you determine whether a signal transmission from space came from an alien intelligence and then decode it?”, my most sensible answer would be: “I don’t know. Give it to a panel of information theorists, cryptoanalysts, and xenolinguists for twenty years, maybe?”) but a poor actual answer.
Why would goal-driven behavior be necessary for passing a Turing test? It just needs to predict human behavior in a limited context, which was what GPT-3 was trained to do. It’s not an RL setting.
I would like to dispute that by drawing the analogy to the definition of fire before modern chemistry. We didn’t know exactly what fire is, but it’s a “you know it when you see it” kind of deal. It’s not helpful to pre-commit to a certain benchmark, like we did with chess—at one point we were sure beating the world champion in chess would be a definitive sign of intelligence, but Deep Blue came and went and we now agree that chess AIs aren’t general intelligence. I know this sounds like moving the goal-post, but then again, the point of contention here isn’t whether OpenAI deserves some brownie points or not.
It seems like you think I made that suggestion in bad faith, but I was being genuine with that idea. The “competent judges” part was so that the judges, you know, are actually asking adversarial questions, which is the point of the test. Cases like Eugene Goostman should get filtered out. I would grant the AI be allowed to be trained on a corpus of adversarial queries from past Turing tests (though I don’t expect this to help), but the judges should also have access to this corpus so they can try to come up with questions orthogonal to it.
I think the point at which our intuitions depart is: I expect there to be a sharp distinction between general and narrow intelligence, and I expect the difference to resolve very unambiguously in any reasonably well designed test, which is why I don’t care too much about precise benchmarks. Since you don’t share this intuition, I can see why you feel so strongly about precisely defining these benchmarks.
I could offer some alternative ideas in an RL setting though:
An AI that solves Snake perfectly on any map (maps should be randomly generated and separated between training and test set), or
An AI that solves unseen Chronotron levels at test time within a reasonable amount of game time (say <10x human average) while being trained on a separate set of levels
I hope you find these tests fair and precise enough, or at least get a sense of what I’m trying to see in an agent with “reasoning ability”? To me these tasks demonstrate why reasoning is powerful and why we should care about it in the first place. Feel free to disagree though.