I don’t see what part of the graphs would lead to that conclusion. As the paper says, there’s a memorization, circuit formation and cleanup phase. Everywhere along these lines in the three phases, the network is building up or removing pieces of internal circuitry. Every time an elementary piece of circuitry is added or removed, that corresponds to moving into a different basin (convex subset?).
Points in the same basin are related by internal symmetries. They correspond to the same algorithm, not just in the sense of having the same input-output behavior on the training data (all points on the zero loss manifold have that in common), but also in sharing common intermediate representations. If one solution has a piece of circuitry another doesn’t, they can’t be part of the same basin. Because you can’t transform them into each other through internal symmetries.
So the network is moving through different basins all along those graphs.
… No?
I don’t see what part of the graphs would lead to that conclusion. As the paper says, there’s a memorization, circuit formation and cleanup phase. Everywhere along these lines in the three phases, the network is building up or removing pieces of internal circuitry. Every time an elementary piece of circuitry is added or removed, that corresponds to moving into a different basin (convex subset?).
Points in the same basin are related by internal symmetries. They correspond to the same algorithm, not just in the sense of having the same input-output behavior on the training data (all points on the zero loss manifold have that in common), but also in sharing common intermediate representations. If one solution has a piece of circuitry another doesn’t, they can’t be part of the same basin. Because you can’t transform them into each other through internal symmetries.
So the network is moving through different basins all along those graphs.