Possible. Possible. But I don’t see how that is more likely than that Alibaba just made something better. Or they made something with with lots of contamination. I think this should make us update toward not underestimating them. The Kling thing is a whole nother issue. If it is confirmed text-to-video and not something else, then we are in big trouble because the chip limits have failed.
For what it’s worth, Yann LeCun argues that video diffusion models like Sora, or any models which predict pixels, are useless for creating an AGI world model. So this might be a dead end. The reason, according to LeCun, is that pixel data is very high dimensional and redundant compared to text (LLMs only use something like 65.000 tokens), which makes exact prediction less useful. In his 2022 outline of his proposed AGI framework, JEPA, he instead proposes an architecture which predicts embeddings rather than exact pixels.
Wait till you find out that qwen 2 is probably just llama 3 with a few changes and some training on benchmarks to inflate performance a bit
Possible. Possible. But I don’t see how that is more likely than that Alibaba just made something better. Or they made something with with lots of contamination. I think this should make us update toward not underestimating them. The Kling thing is a whole nother issue. If it is confirmed text-to-video and not something else, then we are in big trouble because the chip limits have failed.
For what it’s worth, Yann LeCun argues that video diffusion models like Sora, or any models which predict pixels, are useless for creating an AGI world model. So this might be a dead end. The reason, according to LeCun, is that pixel data is very high dimensional and redundant compared to text (LLMs only use something like 65.000 tokens), which makes exact prediction less useful. In his 2022 outline of his proposed AGI framework, JEPA, he instead proposes an architecture which predicts embeddings rather than exact pixels.