Great explanation, thanks! Although I experienced deja vu (“didn’t you already tell me this?”) somewhere in the middle and skipped to comparisons to deep learning :)
One thing I didn’t see is a discussion of the setting of these “prior activations” that are hiding in the deeper layers of the network.
If you have dynamics where activations change faster than data, and data changes faster than weights, this means that the weights are slowly being trained to get low loss on images averaged out over time. This means the weights will start to encode priors: If data changes continuously the priors will be about continuous changes, if you’re suddenly flashing between different still frames the priors will be about still frames (even if you’re resetting the activations in between).
I think your thinking makes sense, and, if for instance on every timestep you presented a different images in a stereotypically defined sequence, or with a certain correlation structure, you would indeed get information about those correlations in the weights. However, this model was designed to be used in the restricted to settings where you show a single still image for many timesteps until convergence. In that setting, weights give you image features for static images (in a heirarchical manner), and priors for low level features will feed back from activations in higher level areas.
There are extensions to this model that deal with video, where there are explicit spatiotemporal expectations built into the network. you can see one of those networks in this paper: https://arxiv.org/abs/2112.10048
Great explanation, thanks! Although I experienced deja vu (“didn’t you already tell me this?”) somewhere in the middle and skipped to comparisons to deep learning :)
One thing I didn’t see is a discussion of the setting of these “prior activations” that are hiding in the deeper layers of the network.
If you have dynamics where activations change faster than data, and data changes faster than weights, this means that the weights are slowly being trained to get low loss on images averaged out over time. This means the weights will start to encode priors: If data changes continuously the priors will be about continuous changes, if you’re suddenly flashing between different still frames the priors will be about still frames (even if you’re resetting the activations in between).
Right?
Thanks!
I think your thinking makes sense, and, if for instance on every timestep you presented a different images in a stereotypically defined sequence, or with a certain correlation structure, you would indeed get information about those correlations in the weights. However, this model was designed to be used in the restricted to settings where you show a single still image for many timesteps until convergence. In that setting, weights give you image features for static images (in a heirarchical manner), and priors for low level features will feed back from activations in higher level areas.
There are extensions to this model that deal with video, where there are explicit spatiotemporal expectations built into the network. you can see one of those networks in this paper: https://arxiv.org/abs/2112.10048
But I’ve never implemented such a network myself.