For what it’s worth, Yann LeCun argues that video diffusion models like Sora, or any models which predict pixels, are useless for creating an AGI world model. So this might be a dead end. The reason, according to LeCun, is that pixel data is very high dimensional and redundant compared to text (LLMs only use something like 65.000 tokens), which makes exact prediction less useful. In his 2022 outline of his proposed AGI framework, JEPA, he instead proposes an architecture which predicts embeddings rather than exact pixels.
For what it’s worth, Yann LeCun argues that video diffusion models like Sora, or any models which predict pixels, are useless for creating an AGI world model. So this might be a dead end. The reason, according to LeCun, is that pixel data is very high dimensional and redundant compared to text (LLMs only use something like 65.000 tokens), which makes exact prediction less useful. In his 2022 outline of his proposed AGI framework, JEPA, he instead proposes an architecture which predicts embeddings rather than exact pixels.