Interesting! I think this might not actually enforce a prior though, in the sense that the later-stages of the network can just scale up their output magnitudes to compensate for the probability-based dampening.
I’m not sure, but I could imagine an activation representing a counter of “how many steps have I been thinking for” is a useful feature encoded in many such networks.
Interesting! I think this might not actually enforce a prior though, in the sense that the later-stages of the network can just scale up their output magnitudes to compensate for the probability-based dampening.
Getting massively out of my depth here, but is that an easy thing to do given the later stages will have to share weights with early stages?
I’m not sure, but I could imagine an activation representing a counter of “how many steps have I been thinking for” is a useful feature encoded in many such networks.