Gerald Monroe comments on Visible loss landscape basins don’t correspond to distinct algorithms

Gerald Monroe 28 Jul 2023 17:12 UTC
2 points
0
So that’s likely why it works at all, and why larger and deeper networks are required to discover good generalities. You would think a larger and deeper network would have more weights to just memorize the answers but apparently you need it to explore multiple hypotheses in parallel.

At a high level this creates many redundant circuits that are using the same strategy, though, I wonder if there is a way to identify this and randomize the duplicates, causing the network to explore a more diverse set of hypotheses.