I really don’t care about IQ tests; ChatGPT does not perform at a human level. I’ve spent hours with it. Sometimes it does come off like a human with an IQ of about 83, all concentrated in verbal skills. Sometimes it sounds like a human with a much higher IQ than that (and a bunch of naive prejudices). But if you take it out of its comfort zone and try to get it to think, it sounds more like a human with profound brain damage. You can take it step by step through a chain of simple inferences, and still have it give an obviously wrong, pattern-matched answer at the end. I wish I’d saved what it told me about cooking and neutrons. Let’s just say it became clear that it did was not using an actual model of the physical world to generate its answers.
Other examples are cherry picked. Having prompted DALL-E and Stable Diffusion quite a bit, I’m pretty convinced those drawings are heavily cherry picked; normally you get a few that match your prompt, plus a bunch of stuff that doesn’t really meet the specs, not to mention a bit of eldritch horror. That doesn’t happen if you ask a human to draw something, not even if it’s a small child. And you don’t have to iterate on the prompt so much with a human, either.
Competitive coding is a cherry-picked problem, as easy as a coding challenge gets… the tasks are tightly bounded, described in terms that almost amount to code themselves, and come with comprehensive test cases. On the other hand, “coding assistants” are out there annoying people by throwing really dumb bugs into their output (which is just close enough to right that you might miss those bugs on a quick glance and really get yourself into trouble).
Self-driving cars bog down under any really unusual driving conditions in ways that humans do not… which is why they’re being run in ultra-heavily-mapped urban cores with human help nearby, and even then mostly for publicity.
The protein thing is getting along toward generating enzymes, but I don’t think it’s really there yet. The Diplomacy bot is indeed scary, but it still operates in a very limited domain.
… and none of them have the agency to decide why they should generate this or that, or to systematically generate things in pursuit of any actual goal in a novel or nearly unrestricted domain, or to adapt flexibly to the unexpected. That’s what intelligence is really about.
I’m not saying when somebody will patch together an AI with a human-like level of “general” performance. Maybe it will be soon. Again, the game-playing stuff is especially concerning. And there’s a disturbingly large amount of hardware available. Maybe we’ll see true AGI even in 2023 (although I still doubt it a lot).
But it did not happen in 2022, not even approximately, not even “in a sense”. Those things don’t have human-like performance in domains even as wide as “drawing” or “computer programming” or “driving”. They have flashes of human-level, or superhuman, performance, in parts of those domains… along with frequent abject failures.
While I generally agree with you here, I strongly encourage you to find an IQ-83-ish human and put them through the same conversational tests that you put ChatGPT through, and then report back your findings. I’m quite curious to hear how it goes.
ChatGPT does not perform at a human level. I’ve spent hours with it. Sometimes it does come off like a human with an IQ of about 83, all concentrated in verbal skills. Sometimes it sounds like a human with a much higher IQ than that (and a bunch of naive prejudices). But if you take it out of its comfort zone and try to get it to think, it sounds more like a human with profound brain damage. You can take it step by step through a chain of simple inferences, and still have it give an obviously wrong, pattern-matched answer at the end. I wish I’d saved what it told me about cooking and neutrons. Let’s just say it became clear that it did was not using an actual model of the physical world to generate its answers.
It’s not just a low-IQ human, though. I know some low IQ people and their primary deficiency is the ability to apply information to new situations. Where “new situation” might just be the same situation as before, but on a different day, or with a slightly different setup (eg task requires A, B1, B2, and C—usually all 4 things need to be done, but sometimes B1 is already done and you just need to do A, B2, and C—medium or high would notice and just do the 3 things, medium might also just do B1 anyway. Low IQ folx tend to do A, and then not know what to do next because it doesn’t match the pattern.)
The one test I ran with GPTchat was asking three questions (with a prompt that you are giving instructions to a silly dad who is going to do exactly what you say, but mis-interpret things if they aren’t specific):
Explain how to make a Peanut Butter Sandwitch (gave a good result to this common choice)
Explain how to add 237 and 465 (word salad with errors every time it attempted)
Explain how to start a load of laundry (was completely unable to give specific steps—it only had general platitudes like “place the sorted laundry into the washing machine.” No-where did it specify to only place one sort of sorted laundry in at a time, nor did it ever include “open the washing machine” on the list.
I would love it if you ran that exact test with those people you know & report back what happened. I’m not saying you are wrong, I just am genuinely curious. Ideally you shouldn’t do it verbally, you should do it via text chat, and then just share the transcript. No worries if you are too busy of course.
I agree that several of OP’s examples weren’t the best, and that AGI did not arrive in 2022. However, one of the big lessons we’ve learned from the last few years of LLM advancement is that there is a rapidly growing variety of tasks that, as it turned out, modern ML can nail just by scaling up and observing extremely large datasets of behavior by sentient minds. It is not clear how much low-hanging fruit remains, only that it keeps coming.
Furthermore, when using products like Stablediffusion, DALLE, and especially ChatGPT, it’s important to note that these are products packaged to minimize processing power per use for a public that will rarely notice the difference. In certain cases, the provider might even be dumbing down the AI to reduce the frequency that it disturbs average users. They can count as confirming evidence of danger but aren’t reliable for disconfirming evidence of danger, there’s plenty of other sources for that.
DALL-E and Stable Diffusion quite a bit, I’m pretty convinced those drawings are heavily cherry picked
But those use a different algorithm, diffusion vs. feedforward. It’s possible that those examples are cherrypicked but I wouldn’t be so sure until the full model comes out.
I really don’t care about IQ tests; ChatGPT does not perform at a human level. I’ve spent hours with it. Sometimes it does come off like a human with an IQ of about 83, all concentrated in verbal skills. Sometimes it sounds like a human with a much higher IQ than that (and a bunch of naive prejudices). But if you take it out of its comfort zone and try to get it to think, it sounds more like a human with profound brain damage. You can take it step by step through a chain of simple inferences, and still have it give an obviously wrong, pattern-matched answer at the end. I wish I’d saved what it told me about cooking and neutrons. Let’s just say it became clear that it did was not using an actual model of the physical world to generate its answers.
Other examples are cherry picked. Having prompted DALL-E and Stable Diffusion quite a bit, I’m pretty convinced those drawings are heavily cherry picked; normally you get a few that match your prompt, plus a bunch of stuff that doesn’t really meet the specs, not to mention a bit of eldritch horror. That doesn’t happen if you ask a human to draw something, not even if it’s a small child. And you don’t have to iterate on the prompt so much with a human, either.
Competitive coding is a cherry-picked problem, as easy as a coding challenge gets… the tasks are tightly bounded, described in terms that almost amount to code themselves, and come with comprehensive test cases. On the other hand, “coding assistants” are out there annoying people by throwing really dumb bugs into their output (which is just close enough to right that you might miss those bugs on a quick glance and really get yourself into trouble).
Self-driving cars bog down under any really unusual driving conditions in ways that humans do not… which is why they’re being run in ultra-heavily-mapped urban cores with human help nearby, and even then mostly for publicity.
The protein thing is getting along toward generating enzymes, but I don’t think it’s really there yet. The Diplomacy bot is indeed scary, but it still operates in a very limited domain.
… and none of them have the agency to decide why they should generate this or that, or to systematically generate things in pursuit of any actual goal in a novel or nearly unrestricted domain, or to adapt flexibly to the unexpected. That’s what intelligence is really about.
I’m not saying when somebody will patch together an AI with a human-like level of “general” performance. Maybe it will be soon. Again, the game-playing stuff is especially concerning. And there’s a disturbingly large amount of hardware available. Maybe we’ll see true AGI even in 2023 (although I still doubt it a lot).
But it did not happen in 2022, not even approximately, not even “in a sense”. Those things don’t have human-like performance in domains even as wide as “drawing” or “computer programming” or “driving”. They have flashes of human-level, or superhuman, performance, in parts of those domains… along with frequent abject failures.
While I generally agree with you here, I strongly encourage you to find an IQ-83-ish human and put them through the same conversational tests that you put ChatGPT through, and then report back your findings. I’m quite curious to hear how it goes.
It’s not just a low-IQ human, though. I know some low IQ people and their primary deficiency is the ability to apply information to new situations. Where “new situation” might just be the same situation as before, but on a different day, or with a slightly different setup (eg task requires A, B1, B2, and C—usually all 4 things need to be done, but sometimes B1 is already done and you just need to do A, B2, and C—medium or high would notice and just do the 3 things, medium might also just do B1 anyway. Low IQ folx tend to do A, and then not know what to do next because it doesn’t match the pattern.)
The one test I ran with GPTchat was asking three questions (with a prompt that you are giving instructions to a silly dad who is going to do exactly what you say, but mis-interpret things if they aren’t specific):
Explain how to make a Peanut Butter Sandwitch (gave a good result to this common choice)
Explain how to add 237 and 465 (word salad with errors every time it attempted)
Explain how to start a load of laundry (was completely unable to give specific steps—it only had general platitudes like “place the sorted laundry into the washing machine.” No-where did it specify to only place one sort of sorted laundry in at a time, nor did it ever include “open the washing machine” on the list.
I would love it if you ran that exact test with those people you know & report back what happened. I’m not saying you are wrong, I just am genuinely curious. Ideally you shouldn’t do it verbally, you should do it via text chat, and then just share the transcript. No worries if you are too busy of course.
I agree that several of OP’s examples weren’t the best, and that AGI did not arrive in 2022. However, one of the big lessons we’ve learned from the last few years of LLM advancement is that there is a rapidly growing variety of tasks that, as it turned out, modern ML can nail just by scaling up and observing extremely large datasets of behavior by sentient minds. It is not clear how much low-hanging fruit remains, only that it keeps coming.
Furthermore, when using products like Stablediffusion, DALLE, and especially ChatGPT, it’s important to note that these are products packaged to minimize processing power per use for a public that will rarely notice the difference. In certain cases, the provider might even be dumbing down the AI to reduce the frequency that it disturbs average users. They can count as confirming evidence of danger but aren’t reliable for disconfirming evidence of danger, there’s plenty of other sources for that.
But those use a different algorithm, diffusion vs. feedforward. It’s possible that those examples are cherrypicked but I wouldn’t be so sure until the full model comes out.
both DALL-E and stable diffusion use diffusion, which is a many-step feedforward algorithm.
Oh, 😳. Well my misunderstanding aside, they are both built using diffusion, while those pics (from Google MUSE) are from a model that doesn’t use diffusion. https://metaphysic.ai/muse-googles-super-fast-text-to-image-model-abandons-latent-diffusion-for-transformers/