How does it work? There is a technical report. Mostly it seems like OpenAI did standard OpenAI things, meaning they fed in tons of data, used lots of compute, and pressed the scaling button super hard. The innovations they are willing to talk about seem to be things like ‘do not crop the videos into a standard size.’
That does not mean there are not important other innovations. I presume that there are. They simply are not talking about the other improvements.
Actually, the key, most important thing they do disclose is that this is a Diffusion Transformer.
This architecture has been introduced by William Peebles, Saining Xie, Scalable Diffusion Models with Transformers, https://arxiv.org/abs/2212.09748
My take on Sora has been that its performance really seems to validate the central claim of their 2022 paper that Transformer-based diffusion models should work better than diffusion models based on older neural nets. This might have various implications beyond video generation. Intuitively, Transformers + Diffusion does feel like an attractive combination. The success of Sora might motivate people to try to use this combination more widely.
Actually, the key, most important thing they do disclose is that this is a Diffusion Transformer.
This architecture has been introduced by William Peebles, Saining Xie, Scalable Diffusion Models with Transformers, https://arxiv.org/abs/2212.09748
The first author is now a co-lead of Sora: https://www.wpeebles.com/ and https://twitter.com/billpeeb.
My take on Sora has been that its performance really seems to validate the central claim of their 2022 paper that Transformer-based diffusion models should work better than diffusion models based on older neural nets. This might have various implications beyond video generation. Intuitively, Transformers + Diffusion does feel like an attractive combination. The success of Sora might motivate people to try to use this combination more widely.