EDIT: as Ryan helpfully points out in the replies, the patent I refer to is actually about OpenAI’s earlier work, and thus shouldn’t be much of an update for anything.
Note that OpenAI has applied for a patentwhich, to my understanding, is about using a video generation model as a backbone for an agent that can interact with a computer. They describe theirtraining pipeline as something roughly like:
Start with unlabeled video data (“receiving labeled digital video data;”)
Train an ML model to label the video data (“training a first machine learning model including an inverse dynamics model (IDM) using the labeled digital video data”)
Then, train a new model to generate video (“further training the first machine learning model or a second machine learning model using the pseudo-labeled digital video data to generate at least one additional pseudo-label for the unlabeled digital video.”)
Then, train the video generation model to predict actions (keyboard/mouse clicks) a user is taking from video of a PC (“2. The method of claim 1, wherein the IDM or machine learning model is trained to generate one or more predicted actions to be performed via a user interface without human intervention. [...] 4. The method of claim 2, wherein the one or more predicted actions generated include at least one of a key press, a button press, a touchscreen input, a joystick movement, a mouse click, a scroll wheel movement, or a mouse movement.’)
Now you have a model which can predict what actions to take given a recording of a computer monitor!
They even specifically mention the keyboard overlay setup you describe:
11. The method of claim 1, wherein the labeled digital video data comprises timestep data paired with user interface action data.
If you haven’t seen the patent (to my knowledge, basically no-one on LessWrong has?) then you get lots of Bayes points!
I might be reading too much into the patent, but it seems to me that Sora is exactly the first half of the training setup described in that patent.So I would assume they’ll soon start working on the second half, which is the actual agent (if they haven’t already).
I think Sora is probably (the precursor of) a foundation model for an agent with a world model. I actually noticed this patent a few hours before Sora was announced, and I had the rough thought of “Oh wow, if OpenAI releases a video model, I’d probably think that agents were coming soon”. And a few hours later Sora comes out.
Interestingly, the patent contains information about hardware for running agents. I’m not sure how patents work and how much this actually implies OpenAI wants to build hardware, but sure is interesting that this is in there:
13. A system comprising:
at least one memory storing instructions;
at least one processor configured to execute the instructions to perform operations for training a machine learning model to perform automated actions,
Interestingly, the patent contains information about hardware for running agents. I’m not sure how patents work and how much this actually implies OpenAI wants to build hardware, but sure is interesting that this is in there:
I think the hardware description in the patent is just bullshit patent-ese. Like they patent people maybe want to see things that look like other patents and patents don’t really understand or handle software well I think. I think the hardware description is just a totally normal description of a setup for running a DNN.
I’ve read the patent a bit and I don’t think it’s about video generation, just about adding additional labels to unlabeled video.
Then, train a new model to generate video (“further training the first machine learning model or a second machine learning model using the pseudo-labeled digital video data to generate at least one additional pseudo-label for the unlabeled digital video.”)
This is just generating pseudo-labels for existing unlabeled video data. See the video pretraining work that this patent references.
EDIT: as Ryan helpfully points out in the replies, the patent I refer to is actually about OpenAI’s earlier work, and thus shouldn’t be much of an update for anything.
Note that OpenAI has applied for apatentwhich, to my understanding, is about using a video generation model as a backbone for an agent that can interact with a computer. They describe theirtraining pipeline as something roughly like:Start with unlabeled video data (“receiving labeled digital video data;”)Train an ML model to label the video data (“training a first machine learning model including an inverse dynamics model (IDM) using the labeled digital video data”)Then, train a new model to generate video (“further training the first machine learning model or a second machine learning model using the pseudo-labeled digital video data to generate at least one additional pseudo-label for the unlabeled digital video.”)Then, train the video generation model to predict actions (keyboard/mouse clicks) a user is taking from video of a PC (“2. The method of claim 1, wherein the IDM or machine learning model is trained to generate one or more predicted actions to be performed via a user interface without human intervention. [...] 4. The method of claim 2, wherein the one or more predicted actions generated include at least one of a key press, a button press, a touchscreen input, a joystick movement, a mouse click, a scroll wheel movement, or a mouse movement.’)Now you have a model which can predict what actions to take given a recording of a computer monitor!They even specifically mention the keyboard overlay setup you describe:If you haven’t seen the patent (to my knowledge, basically no-one on LessWrong has?) then you get lots of Bayes points!I might be reading too much into the patent, but it seems to me that Sora is exactly the first half of the training setup described in that patent. So I would assume they’ll soon start working on the second half, which is the actual agent (if they haven’t already).I think Sora is probably (the precursor of) a foundation model for an agent with a world model. I actually noticed this patent a few hours before Sora was announced, and I had the rough thought of “Oh wow, if OpenAI releases a video model, I’d probably think that agents were coming soon”. And a few hours later Sora comes out.Interestingly, the patent contains information about hardware for running agents. I’m not sure how patents work and how much this actually implies OpenAI wants to build hardware, but sure is interesting that this is in there:AFAICT, the is very similar to the exact process used for OpenAI’s earlier minecraft video pretraining work.
Edit: yep, this patent is about this video pretraining work.
Thanks a lot for the correction! Edited my comment.
I think the hardware description in the patent is just bullshit patent-ese. Like they patent people maybe want to see things that look like other patents and patents don’t really understand or handle software well I think. I think the hardware description is just a totally normal description of a setup for running a DNN.
I’ve read the patent a bit and I don’t think it’s about video generation, just about adding additional labels to unlabeled video.
This is just generating pseudo-labels for existing unlabeled video data. See the video pretraining work that this patent references.