I am not sure how did you come to the conclusion that current models are superhuman. I can visualize complex scenes in 3D for example. Especially under some drugs :)
And I don’t even think I have an especially good imagination.
In general, it is very hard to compare mental imagery with Stable Diffusion. For example, it it is hard to imagine something with many different details in different parts of the image but it is perhaps a matter of representation. An analogy could be that our perception is like a low-resolution display. I can easily zoom in on any area and see the details.
I wouldn’t say that current models are superhuman. Although I wouldn’t claim humans are better either, it is just very unobvious how to compare it properly and probably there are a lot of potential pitfalls.
So 1) has a large role here.
In 2) CNNs are not a great example (as you mentioned yourself). Vision transformers demonstrate similar performance. It seems that inductive bias is relatively easy to learn for neural networks. I would guess it’s similar for human brains too although I don’t know much about neurobiology.
3) Doesn’t seem like a good reason to me. There are modern GANs that demonstrate similar performance to diffusion models, also there are approaches which make diffusion work in a very small number of steps, even 1 step showed decent results IIRC. Also, even ImageGPT worked pretty well back in the day.
4) Similarly to the initial claim, I don’t think much can be confidently said about LLM language abilities in comparison to humans. I do not know what exactly it means and how to measure it. We can do benchmarks, yes. Do they tell us anything deep? I don’t think so. LLMs are very different kinds of intelligence, they can do many things humans can’t and vice versa.
But at the same time, I wouldn’t say that visual models strike me as much more capable given the same size/same amount of compute. They are quite stupid. They can’t count. They can’t do simple compositionally.
5) It is possible we will have much more efficient language models, but again, I don’t think they are much more inefficient than visual models.
My two main reasons for the perceived efficiency difference:
It is super hard to compare with humans. We may do it completely wrong. I think we should aspire to avoid it unless absolutely necessary.
“Language ability” depends much more on understanding and having a complicated world model compared to “visual ability”. We are not terribly disappointed when Stable Diffusion consistently draws three zombies when we ask for four and mostly forgive it for weird four-fingered hands sometimes growing from the wrong places. But when LLMs do similar nonsense, it is much more evident and hurts performance a lot (both on benchmarks and in the real world). LLMs can imitate style well, they have decent grammar. Larger ones GPT-4 can even count decently well and probably do some reasoning. So the hard part (at least for our current deep learning methods) is the world model. Pattern matching is easy and not really important in the grand scheme of things. But it still looks kinda impressive when visual models do it.
I am not sure how did you come to the conclusion that current models are superhuman. I can visualize complex scenes in 3D for example. Especially under some drugs :)
And I don’t even think I have an especially good imagination.
In general, it is very hard to compare mental imagery with Stable Diffusion. For example, it it is hard to imagine something with many different details in different parts of the image but it is perhaps a matter of representation. An analogy could be that our perception is like a low-resolution display. I can easily zoom in on any area and see the details.
I wouldn’t say that current models are superhuman. Although I wouldn’t claim humans are better either, it is just very unobvious how to compare it properly and probably there are a lot of potential pitfalls.
So 1) has a large role here.
In 2) CNNs are not a great example (as you mentioned yourself). Vision transformers demonstrate similar performance. It seems that inductive bias is relatively easy to learn for neural networks. I would guess it’s similar for human brains too although I don’t know much about neurobiology.
3) Doesn’t seem like a good reason to me. There are modern GANs that demonstrate similar performance to diffusion models, also there are approaches which make diffusion work in a very small number of steps, even 1 step showed decent results IIRC. Also, even ImageGPT worked pretty well back in the day.
4) Similarly to the initial claim, I don’t think much can be confidently said about LLM language abilities in comparison to humans. I do not know what exactly it means and how to measure it. We can do benchmarks, yes. Do they tell us anything deep? I don’t think so. LLMs are very different kinds of intelligence, they can do many things humans can’t and vice versa.
But at the same time, I wouldn’t say that visual models strike me as much more capable given the same size/same amount of compute. They are quite stupid. They can’t count. They can’t do simple compositionally.
5) It is possible we will have much more efficient language models, but again, I don’t think they are much more inefficient than visual models.
My two main reasons for the perceived efficiency difference:
It is super hard to compare with humans. We may do it completely wrong. I think we should aspire to avoid it unless absolutely necessary.
“Language ability” depends much more on understanding and having a complicated world model compared to “visual ability”. We are not terribly disappointed when Stable Diffusion consistently draws three zombies when we ask for four and mostly forgive it for weird four-fingered hands sometimes growing from the wrong places. But when LLMs do similar nonsense, it is much more evident and hurts performance a lot (both on benchmarks and in the real world). LLMs can imitate style well, they have decent grammar. Larger ones GPT-4 can even count decently well and probably do some reasoning. So the hard part (at least for our current deep learning methods) is the world model. Pattern matching is easy and not really important in the grand scheme of things. But it still looks kinda impressive when visual models do it.