I don’t yet understand this proposal. In what way do we decompose this parameter tangent space into “lottery tickets”? Are the lottery tickets the cross product of subnetworks and points in the parameter tangent space? The subnetworks alone? If the latter then how does this differ from the original lottery ticket hypothesis?
The tangent space version is meant to be a fairly large departure from the original LTH; subnetworks are no longer particularly relevant at all.
We can imagine a space of generalized-lottery-ticket-hypotheses, in which the common theme is that we have some space of models chosen at initialization (the “tickets”), and SGD mostly “chooses a winning ticket”—i.e. upweights one of those models and downweights others, as opposed to constructing a model incrementally. The key idea is that it’s a selection process rather than a construction process—in particular, that’s the property mainly relevant to inner alignment concerns.
This space of generalized-LTH models would include, for instance, Solomonoff induction: the space of “tickets” is the space of programs; SI upweights programs which perfectly predict the data and downweights the rest. Inner agency problems are then particularly tricky because deceptive optimizers are “there from the start”. It’s not like e.g. biological evolution, where it took a long time for anything very agenty to be built up.
In the tangent space version of the hypothesis, the “tickets” are exactly the models within the tangent space of the full NN model at initialization: each value of Δθ yields a corresponding function-of-the-NN-inputs F(X) (via linear approximation w.r.t. θ), and each such function is a “ticket”.
I am interested in making this understood by more people, so please let me know if some part of it does not make sense to you.
I confess I don’t really understand what a tangent space is, even after reading the wiki article on the subject. It sounds like it’s something like this: Take a particular neural network. Consider the “space” of possible neural networks that are extremely similar to it, i.e. they have all the same parameters but the weights are slightly different, for some definition of “slightly.” That’s the tangent space. Is this correct? What am I missing?
The tangent space at point a is that whole line labelled “tangent”.
The main difference between the tangent space and the space of neural-networks-for-which-the-weights-are-very-close is that the tangent space extrapolates the linear approximation indefinitely; it’s not just limited to the region near the original point. (In practice, though, that difference does not actually matter much, at least for the problem at hand—we do stay close to the original point.)
The reason we want to talk about “the tangent space” is that it lets us precisely state things like e.g. Newton’s method in terms of search: Newton’s method finds a point at which f(x) is approximately 0 by finding a point where the tangent space hits zero (i.e. where the line in the picture above hits the x-axis). So, the tangent space effectively specifies the “search objective” or “optimization objective” for one step of Newton’s method.
In the NTK/GP model, neural net training is functionally-identical to one step of Newton’s method (though it’s Newton’s method in many dimensions, rather than one dimension).
The tangent space version is meant to be a fairly large departure from the original LTH; subnetworks are no longer particularly relevant at all.
We can imagine a space of generalized-lottery-ticket-hypotheses, in which the common theme is that we have some space of models chosen at initialization (the “tickets”), and SGD mostly “chooses a winning ticket”—i.e. upweights one of those models and downweights others, as opposed to constructing a model incrementally. The key idea is that it’s a selection process rather than a construction process—in particular, that’s the property mainly relevant to inner alignment concerns.
This space of generalized-LTH models would include, for instance, Solomonoff induction: the space of “tickets” is the space of programs; SI upweights programs which perfectly predict the data and downweights the rest. Inner agency problems are then particularly tricky because deceptive optimizers are “there from the start”. It’s not like e.g. biological evolution, where it took a long time for anything very agenty to be built up.
In the tangent space version of the hypothesis, the “tickets” are exactly the models within the tangent space of the full NN model at initialization: each value of Δθ yields a corresponding function-of-the-NN-inputs F(X) (via linear approximation w.r.t. θ), and each such function is a “ticket”.
I am interested in making this understood by more people, so please let me know if some part of it does not make sense to you.
I confess I don’t really understand what a tangent space is, even after reading the wiki article on the subject. It sounds like it’s something like this: Take a particular neural network. Consider the “space” of possible neural networks that are extremely similar to it, i.e. they have all the same parameters but the weights are slightly different, for some definition of “slightly.” That’s the tangent space. Is this correct? What am I missing?
Picture a linear approximation, like this:
The tangent space at point a is that whole line labelled “tangent”.
The main difference between the tangent space and the space of neural-networks-for-which-the-weights-are-very-close is that the tangent space extrapolates the linear approximation indefinitely; it’s not just limited to the region near the original point. (In practice, though, that difference does not actually matter much, at least for the problem at hand—we do stay close to the original point.)
The reason we want to talk about “the tangent space” is that it lets us precisely state things like e.g. Newton’s method in terms of search: Newton’s method finds a point at which f(x) is approximately 0 by finding a point where the tangent space hits zero (i.e. where the line in the picture above hits the x-axis). So, the tangent space effectively specifies the “search objective” or “optimization objective” for one step of Newton’s method.
In the NTK/GP model, neural net training is functionally-identical to one step of Newton’s method (though it’s Newton’s method in many dimensions, rather than one dimension).
The tangent space at a point a is tangent to what manifold?
I recommend just reading the math here. Leave a comment if it’s unclear after that.