I think all of these videos other than the octopus and paper planes are “at-a-glance” photorealistic to me.
Overall, I think SORA can do “at-a-glance” photorealistic videos and can model to some extent how things move in the real world. I don’t think it can do both complex motion and photorealism in the same video. As in, the videos which are photorealistic don’t really involve complex motion and the videos which involve complex motion aren’t photorealistic.
(So probably some amount of hype, but also pretty real?)
Hmm, I don’t buy it. These two scenes seem very much not like the kind of thing a video game engine could produce:
Look at this frame! I think there is something very slightly off about that face, but the cat hitting the person’s face and the person’s reaction seem very realistic to me and IMO qualifies as “complex motion and photorealism in the same video”.
Yeah, this is the example I’ve been using to convince people that the game engines are almost certainly generating training data but are probably not involved at sampling time. I can’t come up with any sort of hybrid architecture like ‘NN controlling game-engine through API’ where you get that third front leg. One of the biggest benefits of a game-engine would be ensuring exactly that wouldn’t happen—body parts becoming detached and floating in mid-air and lack of conservation. If you had a game engine with a hyper-realistic cat body model in it which something external was manipulating, one of the biggest benefits is that you wouldn’t have that sort of common-sense physics problem. (Meanwhile, it does look like past generative modeling of cats in its errors. Remember the ProGAN interpolation videos of CATS? Hilarious, but also an apt demonstration of how extremely hard cats are to model. They’re worse than hands.)
In addition, you see plenty of classic NN tells throughout—note the people driving a ‘Dandrover’...
Yeah, those were exactly the two videos which most made me think that the model was mostly trained on video game animation. In the tokyo one, the woman’s facial muscles never move at all, even when the camera zooms in on her. And in the SUV one, the dust cloud isn’t realistic, but even covering that up the SUV has a Grand Theft Auto look to its motion.
“Can’t do both complex motion and photorealism in the same video” is a good hypothesis to track, thanks for putting that one on my radar.
I think I mildly disagree, but probably we’re looking at the same examples.
I think the most impressive (in terms of realism) videos are under “Sora is able to generate complex scenes with multiple characters, …”. (Includes white SUV video and Toyko suburbs video.)
I think all of these videos other than the octopus and paper planes are “at-a-glance” photorealistic to me.
Overall, I think SORA can do “at-a-glance” photorealistic videos and can model to some extent how things move in the real world. I don’t think it can do both complex motion and photorealism in the same video. As in, the videos which are photorealistic don’t really involve complex motion and the videos which involve complex motion aren’t photorealistic.
(So probably some amount of hype, but also pretty real?)
Hmm, I don’t buy it. These two scenes seem very much not like the kind of thing a video game engine could produce:
Look at this frame! I think there is something very slightly off about that face, but the cat hitting the person’s face and the person’s reaction seem very realistic to me and IMO qualifies as “complex motion and photorealism in the same video”.
Were these supposed to embed as videos? I just see stills, and don’t know where they came from.
These are stills from some of the videos I was referencing.
TBC, I wasn’t claiming anything about video game engines.
I wouldn’t have called the cat one “complex motion”, but I can see where you’re coming from.
Yeah, I mean I guess it depends on what you mean by photorealistic. That cat has three front legs.
Yeah, this is the example I’ve been using to convince people that the game engines are almost certainly generating training data but are probably not involved at sampling time. I can’t come up with any sort of hybrid architecture like ‘NN controlling game-engine through API’ where you get that third front leg. One of the biggest benefits of a game-engine would be ensuring exactly that wouldn’t happen—body parts becoming detached and floating in mid-air and lack of conservation. If you had a game engine with a hyper-realistic cat body model in it which something external was manipulating, one of the biggest benefits is that you wouldn’t have that sort of common-sense physics problem. (Meanwhile, it does look like past generative modeling of cats in its errors. Remember the ProGAN interpolation videos of CATS? Hilarious, but also an apt demonstration of how extremely hard cats are to model. They’re worse than hands.)
In addition, you see plenty of classic NN tells throughout—note the people driving a ‘Dandrover’...
Yeah, those were exactly the two videos which most made me think that the model was mostly trained on video game animation. In the tokyo one, the woman’s facial muscles never move at all, even when the camera zooms in on her. And in the SUV one, the dust cloud isn’t realistic, but even covering that up the SUV has a Grand Theft Auto look to its motion.
“Can’t do both complex motion and photorealism in the same video” is a good hypothesis to track, thanks for putting that one on my radar.
(Note that I was talking about the one with the train going through Toyko suburbs.)