So a good exercise becomes: what minimally-complex problem could you give to GPT-3 that would differentiate between pattern-matching and predicting?
Passing the Turing test with competent judges. If you feel like that’s too harsh yet insist on GPT-3 being capable of reasoning, then ask yourself: what’s still missing? It’s capable of both pattern recognition and reasoning, so why isn’t it an AGI yet?
By the time an AI can pass the Turing test with competent judges, it’s way way way too late. We need to solve AI alignment before that happens. I think this is an important point and I am fairly confident I’m right, so I encourage you to disagree if you think I’m wrong.
I didn’t mean to imply we should wait for AI to pass the Turing test before doing alignment work. Perhaps the disagreement comes down to you thinking “We should take GPT-3 as a fire-alarm for AGI and must push forward AI alignment work” whereas I’m thinking “There is and will be no fire-alarm and we must push forward AI alignment work”
Ah, well said. Perhaps we don’t disagree then. Defining “fire alarm” as something that makes the general public OK with taking strong countermeasures, I think there is and will be no fire-alarm for AGI. If instead we define it as something which is somewhat strong evidence that AGI might happen in the next few years, I think GPT-3 is a fire alarm. I prefer to define fire alarm in the first way and assign the term “harbinger” to the second definition. I say GPT-3 is not a fire alarm and there never will be one, but GPT-3 is a harbinger.
Do you think GPT-3 is a harbinger? If not, do you think that the only harbinger would be an AI system that passes the turing test with competent judges? If so, then it seems like you think there won’t ever be a harbinger.
I don’t think GPT-3 is a harbinger. I’m not sure if there ever will be a harbinger (at least to the public); leaning towards no. An AI system that passes the Turing test wouldn’t be a harbinger, it’s the real deal.
OK, cool. Interesting. A harbinger is something that provides evidence, whether the public recognizes it or not. I think if takeoff is sufficiently fast, there won’t be any harbingers. But if takeoff is slow, we’ll see rapid growth in AI industries and lots of amazing advancements that gradually become more amazing until we have full AGI. And so there will be plenty of harbingers. Do you think takeoff will probably be very fast?
Yeah the terms are always a bit vague; as far as existence proof for AGI goes there’s already humans and evolution, so my definition of a harbinger would be something like ‘A prototype that clearly shows no more conceptual breakthroughs are needed for AGI’.
I still think we’re at least one breakthrough away from that point, however that belief is dampened by Ilya Sutskever’s position on this whose opinion I greatly respect. But either way GPT-3 in particular just doesn’t stand out to me from the rest of DL achievements over the years, from AlexNet to AlphaGo to OpenAI5.
Fair enough, and well said. I don’t think we really disagree then, I just have a lower threshold for how much evidence counts as a harbinger, and that’s just a difference in how we use the words. I also think probably we’ll need at least one more conceptual breakthrough.
What does Ilya Sutskever think? Can you link to something I could read on the subject?
I find that he has an exceptionally sharp intuition about why deep learning works, from the original AlexNet to Deep Double Descent. You can see him predicting the progress in NLP here
Hmm, I think the purpose behind my post went amiss. The point of the exercise is process-oriented not result-oriented—to either learn to better differentiate the concepts in your head by poking and prodding at them with concrete examples, or realise that they aren’t quite distinct at all. But in any case, I have a few responses to your question. The most relevant one was covered by another commenter (reasoning ability isn’t binary/quantitative not qualitative). The remaining two are:
1. “Why isn’t it an AGI?” here can be read as “why hasn’t it done the things I’d expect from an AGI?” or “why doesn’t it have the characteristics of general intelligence?”, and there’s a subtle shade of difference here that requires two different answers.
For the first, GPT-3 isn’t capable of goal-driven behaviour. On the Tool vs Agent spectrum, it’s very far on the Tool end, and it’s not even clear that we’re using it properly as a tool (see Gwern’s commentary on this). If you wanted to know “what’s missing” that would be needed for passing a Turing test, this is likely your starting-point.
For the second, the premise is more arguable. ‘What characteristics constitute general intelligence?’, ‘Which of them are necessary and which of them are auxiliary?’, etc. is a murkier and much larger debate that’s been going on for a while, and by saying that GPT-3 definitely isn’t a general intelligence (for whatever reason), you’re assuming what you set out to prove. Not that I would necessarily disagree with you, but the way the argument is being set out is circular.
2. “Passing the Turing test with competent judges” is an evasion, not an answer to the question – a very sensible one, though. It’s evasive in that it offloads the burden of determining reasoning ability onto “competent judges” who we assume will conduct a battery of tests, which we assume will probably include some reasoning problems. But what reasoning problems will they ask? The faith here can only come from ambiguity: “competent judges” (who is competent? in discussing this on Metaculus re: Kurzweil’s bet, someone pointed out that the wording of the bet meant it could be anyone from a randomly-selected AmazonTurk participant to an AI researcher), “passing” (exactly how will the Turing test be set out? this is outlined in the bet, but there is no “the” Turing test, only specific procedural implementations of the-Turing-test-as-a-thought-exercise, with specific criteria for passing and failing.) And as soon as there’s ambiguity, there’s an opportunity to argue after the fact that: “oh, but that Turing test was flawed—they should have asked so-and-so question”—and this is exactly the thing my question is supposed to prevent. What is that “so-and-so question”, or set of questions?
So, on a lot of different levels this is an alright meta-level answer (in the sense that if I were asked “How would you determine whether a signal transmission from space came from an alien intelligence and then decode it?”, my most sensible answer would be: “I don’t know. Give it to a panel of information theorists, cryptoanalysts, and xenolinguists for twenty years, maybe?”) but a poor actual answer.
“Why isn’t it an AGI?” here can be read as “why hasn’t it done the things I’d expect from an AGI?” or “why doesn’t it have the characteristics of general intelligence?”, and there’s a subtle shade of difference here that requires two different answers.
For the first, GPT-3 isn’t capable of goal-driven behaviour.
Why would goal-driven behavior be necessary for passing a Turing test? It just needs to predict human behavior in a limited context, which was what GPT-3 was trained to do. It’s not an RL setting.
and by saying that GPT-3 definitely isn’t a general intelligence (for whatever reason), you’re assuming what you set out to prove.
I would like to dispute that by drawing the analogy to the definition of fire before modern chemistry. We didn’t know exactly what fire is, but it’s a “you know it when you see it” kind of deal. It’s not helpful to pre-commit to a certain benchmark, like we did with chess—at one point we were sure beating the world champion in chess would be a definitive sign of intelligence, but Deep Blue came and went and we now agree that chess AIs aren’t general intelligence. I know this sounds like moving the goal-post, but then again, the point of contention here isn’t whether OpenAI deserves some brownie points or not.
“Passing the Turing test with competent judges” is an evasion, not an answer to the question – a very sensible one, though.
It seems like you think I made that suggestion in bad faith, but I was being genuine with that idea. The “competent judges” part was so that the judges, you know, are actually asking adversarial questions, which is the point of the test. Cases like Eugene Goostman should get filtered out. I would grant the AI be allowed to be trained on a corpus of adversarial queries from past Turing tests (though I don’t expect this to help), but the judges should also have access to this corpus so they can try to come up with questions orthogonal to it.
I think the point at which our intuitions depart is: I expect there to be a sharp distinction between general and narrow intelligence, and I expect the difference to resolve very unambiguously in any reasonably well designed test, which is why I don’t care too much about precise benchmarks. Since you don’t share this intuition, I can see why you feel so strongly about precisely defining these benchmarks.
I could offer some alternative ideas in an RL setting though:
An AI that solves Snake perfectly on any map (maps should be randomly generated and separated between training and test set), or
An AI that solves unseen Chronotron levels at test time within a reasonable amount of game time (say <10x human average) while being trained on a separate set of levels
I hope you find these tests fair and precise enough, or at least get a sense of what I’m trying to see in an agent with “reasoning ability”? To me these tasks demonstrate why reasoning is powerful and why we should care about it in the first place. Feel free to disagree though.
Passing the Turing test with competent judges. If you feel like that’s too harsh yet insist on GPT-3 being capable of reasoning, then ask yourself: what’s still missing? It’s capable of both pattern recognition and reasoning, so why isn’t it an AGI yet?
“Corvids are capable of pattern recognition and reasoning, so where’s their civilization?”
Reasoning is not a binary attribute. A system could be reasoning at a subhuman level.
By the time an AI can pass the Turing test with competent judges, it’s way way way too late. We need to solve AI alignment before that happens. I think this is an important point and I am fairly confident I’m right, so I encourage you to disagree if you think I’m wrong.
I didn’t mean to imply we should wait for AI to pass the Turing test before doing alignment work. Perhaps the disagreement comes down to you thinking “We should take GPT-3 as a fire-alarm for AGI and must push forward AI alignment work” whereas I’m thinking “There is and will be no fire-alarm and we must push forward AI alignment work”
Ah, well said. Perhaps we don’t disagree then. Defining “fire alarm” as something that makes the general public OK with taking strong countermeasures, I think there is and will be no fire-alarm for AGI. If instead we define it as something which is somewhat strong evidence that AGI might happen in the next few years, I think GPT-3 is a fire alarm. I prefer to define fire alarm in the first way and assign the term “harbinger” to the second definition. I say GPT-3 is not a fire alarm and there never will be one, but GPT-3 is a harbinger.
Do you think GPT-3 is a harbinger? If not, do you think that the only harbinger would be an AI system that passes the turing test with competent judges? If so, then it seems like you think there won’t ever be a harbinger.
I don’t think GPT-3 is a harbinger. I’m not sure if there ever will be a harbinger (at least to the public); leaning towards no. An AI system that passes the Turing test wouldn’t be a harbinger, it’s the real deal.
OK, cool. Interesting. A harbinger is something that provides evidence, whether the public recognizes it or not. I think if takeoff is sufficiently fast, there won’t be any harbingers. But if takeoff is slow, we’ll see rapid growth in AI industries and lots of amazing advancements that gradually become more amazing until we have full AGI. And so there will be plenty of harbingers. Do you think takeoff will probably be very fast?
Yeah the terms are always a bit vague; as far as existence proof for AGI goes there’s already humans and evolution, so my definition of a harbinger would be something like ‘A prototype that clearly shows no more conceptual breakthroughs are needed for AGI’.
I still think we’re at least one breakthrough away from that point, however that belief is dampened by Ilya Sutskever’s position on this whose opinion I greatly respect. But either way GPT-3 in particular just doesn’t stand out to me from the rest of DL achievements over the years, from AlexNet to AlphaGo to OpenAI5.
And yes, I believe there will be fast takeoff.
Fair enough, and well said. I don’t think we really disagree then, I just have a lower threshold for how much evidence counts as a harbinger, and that’s just a difference in how we use the words. I also think probably we’ll need at least one more conceptual breakthrough.
What does Ilya Sutskever think? Can you link to something I could read on the subject?
You can listen to his thoughts on AGI in this video
I find that he has an exceptionally sharp intuition about why deep learning works, from the original AlexNet to Deep Double Descent. You can see him predicting the progress in NLP here
Hmm, I think the purpose behind my post went amiss. The point of the exercise is process-oriented not result-oriented—to either learn to better differentiate the concepts in your head by poking and prodding at them with concrete examples, or realise that they aren’t quite distinct at all. But in any case, I have a few responses to your question. The most relevant one was covered by another commenter (reasoning ability isn’t binary/quantitative not qualitative). The remaining two are:
1. “Why isn’t it an AGI?” here can be read as “why hasn’t it done the things I’d expect from an AGI?” or “why doesn’t it have the characteristics of general intelligence?”, and there’s a subtle shade of difference here that requires two different answers.
For the first, GPT-3 isn’t capable of goal-driven behaviour. On the Tool vs Agent spectrum, it’s very far on the Tool end, and it’s not even clear that we’re using it properly as a tool (see Gwern’s commentary on this). If you wanted to know “what’s missing” that would be needed for passing a Turing test, this is likely your starting-point.
For the second, the premise is more arguable. ‘What characteristics constitute general intelligence?’, ‘Which of them are necessary and which of them are auxiliary?’, etc. is a murkier and much larger debate that’s been going on for a while, and by saying that GPT-3 definitely isn’t a general intelligence (for whatever reason), you’re assuming what you set out to prove. Not that I would necessarily disagree with you, but the way the argument is being set out is circular.
2. “Passing the Turing test with competent judges” is an evasion, not an answer to the question – a very sensible one, though. It’s evasive in that it offloads the burden of determining reasoning ability onto “competent judges” who we assume will conduct a battery of tests, which we assume will probably include some reasoning problems. But what reasoning problems will they ask? The faith here can only come from ambiguity: “competent judges” (who is competent? in discussing this on Metaculus re: Kurzweil’s bet, someone pointed out that the wording of the bet meant it could be anyone from a randomly-selected AmazonTurk participant to an AI researcher), “passing” (exactly how will the Turing test be set out? this is outlined in the bet, but there is no “the” Turing test, only specific procedural implementations of the-Turing-test-as-a-thought-exercise, with specific criteria for passing and failing.) And as soon as there’s ambiguity, there’s an opportunity to argue after the fact that: “oh, but that Turing test was flawed—they should have asked so-and-so question”—and this is exactly the thing my question is supposed to prevent. What is that “so-and-so question”, or set of questions?
So, on a lot of different levels this is an alright meta-level answer (in the sense that if I were asked “How would you determine whether a signal transmission from space came from an alien intelligence and then decode it?”, my most sensible answer would be: “I don’t know. Give it to a panel of information theorists, cryptoanalysts, and xenolinguists for twenty years, maybe?”) but a poor actual answer.
Why would goal-driven behavior be necessary for passing a Turing test? It just needs to predict human behavior in a limited context, which was what GPT-3 was trained to do. It’s not an RL setting.
I would like to dispute that by drawing the analogy to the definition of fire before modern chemistry. We didn’t know exactly what fire is, but it’s a “you know it when you see it” kind of deal. It’s not helpful to pre-commit to a certain benchmark, like we did with chess—at one point we were sure beating the world champion in chess would be a definitive sign of intelligence, but Deep Blue came and went and we now agree that chess AIs aren’t general intelligence. I know this sounds like moving the goal-post, but then again, the point of contention here isn’t whether OpenAI deserves some brownie points or not.
It seems like you think I made that suggestion in bad faith, but I was being genuine with that idea. The “competent judges” part was so that the judges, you know, are actually asking adversarial questions, which is the point of the test. Cases like Eugene Goostman should get filtered out. I would grant the AI be allowed to be trained on a corpus of adversarial queries from past Turing tests (though I don’t expect this to help), but the judges should also have access to this corpus so they can try to come up with questions orthogonal to it.
I think the point at which our intuitions depart is: I expect there to be a sharp distinction between general and narrow intelligence, and I expect the difference to resolve very unambiguously in any reasonably well designed test, which is why I don’t care too much about precise benchmarks. Since you don’t share this intuition, I can see why you feel so strongly about precisely defining these benchmarks.
I could offer some alternative ideas in an RL setting though:
An AI that solves Snake perfectly on any map (maps should be randomly generated and separated between training and test set), or
An AI that solves unseen Chronotron levels at test time within a reasonable amount of game time (say <10x human average) while being trained on a separate set of levels
I hope you find these tests fair and precise enough, or at least get a sense of what I’m trying to see in an agent with “reasoning ability”? To me these tasks demonstrate why reasoning is powerful and why we should care about it in the first place. Feel free to disagree though.