Currently, there isn’t much modeling the video data on imagenet. Static image models have been far more advanced (over years of supervised learning, like from captcha) than what we have right now for video classification.
Still, our best ML systems are still very far from matching human reliability in real-world tasks such as driving, even after being fed with enormous amounts of supervisory data from human experts, after going through millions of reinforcement learning trials in virtual environments, and after engineers have hardwired hundreds of behaviors into them.
I imagine a large part of the world model described in the paper will come from video classifications.
If you look some examples of video classification problems, they are labeling entity along with an action that’s associated with the addition temporal data associated with videos. I’m not sure if this additional temporal information directly from the model will be used in this paper’s world model, or maybe this world model will most likely use classification models from static images and will store action information separate from the entities that are classified in the videos.
Of course, this example alone just shows how much work there needs to be done to even begin the preliminary phase of developing the models described in this paper, but I think the autonomous aspects presented in the paper can be independently investigated before the maturity of video data classification models.
The autonomous parts seem to be largely based on human brain emulation, at least on the architectural level. The current state of the art ML autonomy has been unsupervised learning from deep neural nets like DeepMind. That’s like using single/ensemble model for one set of tasks. This paper is more like using ensemble learning for different types of tasks that model different parts of the human brain, and their overall interactions resembles the autonomous aspect and the context such autonomy operates in.
Currently, there isn’t much modeling the video data on imagenet. Static image models have been far more advanced (over years of supervised learning, like from captcha) than what we have right now for video classification.
I imagine a large part of the world model described in the paper will come from video classifications.
If you look some examples of video classification problems, they are labeling entity along with an action that’s associated with the addition temporal data associated with videos. I’m not sure if this additional temporal information directly from the model will be used in this paper’s world model, or maybe this world model will most likely use classification models from static images and will store action information separate from the entities that are classified in the videos.
Of course, this example alone just shows how much work there needs to be done to even begin the preliminary phase of developing the models described in this paper, but I think the autonomous aspects presented in the paper can be independently investigated before the maturity of video data classification models.
The autonomous parts seem to be largely based on human brain emulation, at least on the architectural level. The current state of the art ML autonomy has been unsupervised learning from deep neural nets like DeepMind. That’s like using single/ensemble model for one set of tasks. This paper is more like using ensemble learning for different types of tasks that model different parts of the human brain, and their overall interactions resembles the autonomous aspect and the context such autonomy operates in.