The thing about the simulation capability that worries me most isn’t plugging it in as-is, but probing the model, finding where the simulator pieces are and extracting them. This is obviously complicated, but for example something as simple as a linear probe identifying which entire layers are most involved and initializing a new model for training with those layers integrated, a model which doesn’t have to output video (obviously your data/task/loss metric would have to ensure it gets used/updated/not overwritten, but choosing things where it would be useful should be enough)… I’m neither qualified to elaborate further nor inclined to do so, but the broad concern here is more efficient application of the simulation capability / integrating it into more diverse models.
Why “selection” could be a capacity which would generalize: albeit to a (highly-lossy) first approximation, most of the most successful models have been based on increasingly-general types of gamification of tasks. The more general models have more general tasks. Video can capture sufficient information to describe almost any action which humans do or would wish to take along with numerous phenomena which are impossible to directly experience in low-dimensional physical space, so if you can simulate a video, you can operate or orchestrate reality. Why selection couldn’t generalize: I can watch someone skiing but that doesn’t mean that I can ski. I can watch a speedrun of a video game and, even though the key presses are clearly visible, fail to replicate it. I could also hack together a fake speedrun. I suspect that Sora will be more useful for more-convincingly-faking speedrun content than for actually beating human players or becoming the TAS tool to end all TAS tools (aside from novel glitch discovery). This is primarily because there’s not a strong reason to believe that the model can trained to achieve extremely high-fidelity or high-precision tasks.
The thing about the simulation capability that worries me most isn’t plugging it in as-is, but probing the model, finding where the simulator pieces are and extracting them. This is obviously complicated, but for example something as simple as a linear probe identifying which entire layers are most involved and initializing a new model for training with those layers integrated, a model which doesn’t have to output video (obviously your data/task/loss metric would have to ensure it gets used/updated/not overwritten, but choosing things where it would be useful should be enough)… I’m neither qualified to elaborate further nor inclined to do so, but the broad concern here is more efficient application of the simulation capability / integrating it into more diverse models.
Why “selection” could be a capacity which would generalize: albeit to a (highly-lossy) first approximation, most of the most successful models have been based on increasingly-general types of gamification of tasks. The more general models have more general tasks. Video can capture sufficient information to describe almost any action which humans do or would wish to take along with numerous phenomena which are impossible to directly experience in low-dimensional physical space, so if you can simulate a video, you can operate or orchestrate reality.
Why selection couldn’t generalize: I can watch someone skiing but that doesn’t mean that I can ski. I can watch a speedrun of a video game and, even though the key presses are clearly visible, fail to replicate it. I could also hack together a fake speedrun. I suspect that Sora will be more useful for more-convincingly-faking speedrun content than for actually beating human players or becoming the TAS tool to end all TAS tools (aside from novel glitch discovery). This is primarily because there’s not a strong reason to believe that the model can trained to achieve extremely high-fidelity or high-precision tasks.