“we don’t currently know how to differentiably vary the size of the NN being run. We can certainly imagine NNs being rolled-out a fixed number of times (like RNNs), where the number of rollouts is controllable via a learned parameter, but this parameter won’t be updateable via a standard gradient.”
Is this really true? I can think of a way to do this in a standard gradient type way.
TLDR at each roll out have a neuron that represents the halting probability and then make the result of the roll out the sum of the output vectors at each rollout weighted by the probability the network halted at that rollout.
Interesting! I think this might not actually enforce a prior though, in the sense that the later-stages of the network can just scale up their output magnitudes to compensate for the probability-based dampening.
I’m not sure, but I could imagine an activation representing a counter of “how many steps have I been thinking for” is a useful feature encoded in many such networks.
Just want to point to a more recent (2021) paper implementing adaptive computation by some DeepMind researchers that I found interesting when I was looking into this:
“we don’t currently know how to differentiably vary the size of the NN being run. We can certainly imagine NNs being rolled-out a fixed number of times (like RNNs), where the number of rollouts is controllable via a learned parameter, but this parameter won’t be updateable via a standard gradient.”
Is this really true? I can think of a way to do this in a standard gradient type way.
Also there looks like there is a paper by someone who works in ML from 2017 where they do this https://arxiv.org/abs/1603.08983
TLDR at each roll out have a neuron that represents the halting probability and then make the result of the roll out the sum of the output vectors at each rollout weighted by the probability the network halted at that rollout.
Interesting! I think this might not actually enforce a prior though, in the sense that the later-stages of the network can just scale up their output magnitudes to compensate for the probability-based dampening.
Getting massively out of my depth here, but is that an easy thing to do given the later stages will have to share weights with early stages?
I’m not sure, but I could imagine an activation representing a counter of “how many steps have I been thinking for” is a useful feature encoded in many such networks.
Just want to point to a more recent (2021) paper implementing adaptive computation by some DeepMind researchers that I found interesting when I was looking into this:
https://arxiv.org/pdf/2107.05407.pdf