I think your thinking makes sense, and, if for instance on every timestep you presented a different images in a stereotypically defined sequence, or with a certain correlation structure, you would indeed get information about those correlations in the weights. However, this model was designed to be used in the restricted to settings where you show a single still image for many timesteps until convergence. In that setting, weights give you image features for static images (in a heirarchical manner), and priors for low level features will feed back from activations in higher level areas.
There are extensions to this model that deal with video, where there are explicit spatiotemporal expectations built into the network. you can see one of those networks in this paper: https://arxiv.org/abs/2112.10048
Thanks!
I think your thinking makes sense, and, if for instance on every timestep you presented a different images in a stereotypically defined sequence, or with a certain correlation structure, you would indeed get information about those correlations in the weights. However, this model was designed to be used in the restricted to settings where you show a single still image for many timesteps until convergence. In that setting, weights give you image features for static images (in a heirarchical manner), and priors for low level features will feed back from activations in higher level areas.
There are extensions to this model that deal with video, where there are explicit spatiotemporal expectations built into the network. you can see one of those networks in this paper: https://arxiv.org/abs/2112.10048
But I’ve never implemented such a network myself.