For a programmer who is not into symbolic math, would you say that following summary is accurate enough or did I miss some intuition here:
if an overparametrized network has linear dependency between paramaters, it can perform as if it was an underparametrized network
but the trick is that a flat basin is easier to reach by SGD or similar optimizations processes than if we had to search small targets
For a programmer who is not into symbolic math, would you say that following summary is accurate enough or did I miss some intuition here:
if an overparametrized network has linear dependency between paramaters, it can perform as if it was an underparametrized network
but the trick is that a flat basin is easier to reach by SGD or similar optimizations processes than if we had to search small targets