shiney comments on The Speed + Simplicity Prior is probably anti-deceptive

shiney 28 Apr 2022 9:44 UTC
1 point
“we don’t currently know how to differentiably vary the size of the NN being run. We can certainly imagine NNs being rolled-out a fixed number of times (like RNNs), where the number of rollouts is controllable via a learned parameter, but this parameter won’t be updateable via a standard gradient.”

Is this really true? I can think of a way to do this in a standard gradient type way.

Also there looks like there is a paper by someone who works in ML from 2017 where they do this https://arxiv.org/abs/1603.08983

TLDR at each roll out have a neuron that represents the halting probability and then make the result of the roll out the sum of the output vectors at each rollout weighted by the probability the network halted at that rollout.
- Yonadav Shavit 28 Apr 2022 15:21 UTC
  3 points
  Parent
  Interesting! I think this might not actually enforce a prior though, in the sense that the later-stages of the network can just scale up their output magnitudes to compensate for the probability-based dampening.
  - shiney 28 Apr 2022 19:49 UTC
    1 point
    Parent
    Getting massively out of my depth here, but is that an easy thing to do given the later stages will have to share weights with early stages?
    - Yonadav Shavit 28 Apr 2022 21:16 UTC
      2 points
      Parent
      I’m not sure, but I could imagine an activation representing a counter of “how many steps have I been thinking for” is a useful feature encoded in many such networks.
- Megan Kinniment 13 May 2022 16:09 UTC
  2 points
  Parent
  Just want to point to a more recent (2021) paper implementing adaptive computation by some DeepMind researchers that I found interesting when I was looking into this:
  https://arxiv.org/pdf/2107.05407.pdf