So that’s likely why it works at all, and why larger and deeper networks are required to discover good generalities. You would think a larger and deeper network would have more weights to just memorize the answers but apparently you need it to explore multiple hypotheses in parallel.
At a high level this creates many redundant circuits that are using the same strategy, though, I wonder if there is a way to identify this and randomize the duplicates, causing the network to explore a more diverse set of hypotheses.
So that’s likely why it works at all, and why larger and deeper networks are required to discover good generalities. You would think a larger and deeper network would have more weights to just memorize the answers but apparently you need it to explore multiple hypotheses in parallel.
At a high level this creates many redundant circuits that are using the same strategy, though, I wonder if there is a way to identify this and randomize the duplicates, causing the network to explore a more diverse set of hypotheses.