“Why isn’t it an AGI?” here can be read as “why hasn’t it done the things I’d expect from an AGI?” or “why doesn’t it have the characteristics of general intelligence?”, and there’s a subtle shade of difference here that requires two different answers.
For the first, GPT-3 isn’t capable of goal-driven behaviour.
Why would goal-driven behavior be necessary for passing a Turing test? It just needs to predict human behavior in a limited context, which was what GPT-3 was trained to do. It’s not an RL setting.
and by saying that GPT-3 definitely isn’t a general intelligence (for whatever reason), you’re assuming what you set out to prove.
I would like to dispute that by drawing the analogy to the definition of fire before modern chemistry. We didn’t know exactly what fire is, but it’s a “you know it when you see it” kind of deal. It’s not helpful to pre-commit to a certain benchmark, like we did with chess—at one point we were sure beating the world champion in chess would be a definitive sign of intelligence, but Deep Blue came and went and we now agree that chess AIs aren’t general intelligence. I know this sounds like moving the goal-post, but then again, the point of contention here isn’t whether OpenAI deserves some brownie points or not.
“Passing the Turing test with competent judges” is an evasion, not an answer to the question – a very sensible one, though.
It seems like you think I made that suggestion in bad faith, but I was being genuine with that idea. The “competent judges” part was so that the judges, you know, are actually asking adversarial questions, which is the point of the test. Cases like Eugene Goostman should get filtered out. I would grant the AI be allowed to be trained on a corpus of adversarial queries from past Turing tests (though I don’t expect this to help), but the judges should also have access to this corpus so they can try to come up with questions orthogonal to it.
I think the point at which our intuitions depart is: I expect there to be a sharp distinction between general and narrow intelligence, and I expect the difference to resolve very unambiguously in any reasonably well designed test, which is why I don’t care too much about precise benchmarks. Since you don’t share this intuition, I can see why you feel so strongly about precisely defining these benchmarks.
I could offer some alternative ideas in an RL setting though:
An AI that solves Snake perfectly on any map (maps should be randomly generated and separated between training and test set), or
An AI that solves unseen Chronotron levels at test time within a reasonable amount of game time (say <10x human average) while being trained on a separate set of levels
I hope you find these tests fair and precise enough, or at least get a sense of what I’m trying to see in an agent with “reasoning ability”? To me these tasks demonstrate why reasoning is powerful and why we should care about it in the first place. Feel free to disagree though.
Why would goal-driven behavior be necessary for passing a Turing test? It just needs to predict human behavior in a limited context, which was what GPT-3 was trained to do. It’s not an RL setting.
I would like to dispute that by drawing the analogy to the definition of fire before modern chemistry. We didn’t know exactly what fire is, but it’s a “you know it when you see it” kind of deal. It’s not helpful to pre-commit to a certain benchmark, like we did with chess—at one point we were sure beating the world champion in chess would be a definitive sign of intelligence, but Deep Blue came and went and we now agree that chess AIs aren’t general intelligence. I know this sounds like moving the goal-post, but then again, the point of contention here isn’t whether OpenAI deserves some brownie points or not.
It seems like you think I made that suggestion in bad faith, but I was being genuine with that idea. The “competent judges” part was so that the judges, you know, are actually asking adversarial questions, which is the point of the test. Cases like Eugene Goostman should get filtered out. I would grant the AI be allowed to be trained on a corpus of adversarial queries from past Turing tests (though I don’t expect this to help), but the judges should also have access to this corpus so they can try to come up with questions orthogonal to it.
I think the point at which our intuitions depart is: I expect there to be a sharp distinction between general and narrow intelligence, and I expect the difference to resolve very unambiguously in any reasonably well designed test, which is why I don’t care too much about precise benchmarks. Since you don’t share this intuition, I can see why you feel so strongly about precisely defining these benchmarks.
I could offer some alternative ideas in an RL setting though:
An AI that solves Snake perfectly on any map (maps should be randomly generated and separated between training and test set), or
An AI that solves unseen Chronotron levels at test time within a reasonable amount of game time (say <10x human average) while being trained on a separate set of levels
I hope you find these tests fair and precise enough, or at least get a sense of what I’m trying to see in an agent with “reasoning ability”? To me these tasks demonstrate why reasoning is powerful and why we should care about it in the first place. Feel free to disagree though.