(This is the twelfth post in a sequence on Machine Learning based on this book. Click here for part I.)
This post will be more on the mathy side (all of it linear algebra). It contains orthonormal matrices, eigenvalues and eigenvectors, and singular value decomposition.
I
Almost everything I’ve written about in this series thus far has been about what is called supervised learning, which just means “learning based on a fixed amount of training data.” This even includes most of the theoretical work from chapters I-III, such as the PAC learning model. On the other side, there is also unsupervised learning, which is learning without any training data.
Dimensionality reduction is about taking data that is represented by vectors in some high-dimensional space and converting them to vectors in a lower-dimensional space, in a way that preserves meaningful properties of the data. One use case is to overcome computational hurdles in supervised learning – many algorithms have runtimes that increase exponentially with the dimension. In that case, we would essentially be doing the opposite of what we’ve covered in the second half of post VIII, which was about increasing the dimensionality of a data set to make it more expressive. However, dimensionality reduction can also be applied to unsupervised learning – or even to tasks that are outside of machine learning altogether. I once took a class on Information Visualization, and they covered the technique we’re going to look at in this post (although “covered” did not include anyone understanding what happens on a mathematical level). It is very difficult to visualize high-dimensional data; a reasonable approach to deal with this problem is to map it into 2d space, or perhaps 3d space, and then see whether there are any interesting patterns.
The aforementioned technique is called principal component analysis. I like it for two reasons: one, linear algebra plays a crucial role in actually finding the solution (rather than just proving some convergence bound); and two, even though the formalization of our problem will look quite natural, the solution turns out to have highly elegant structure. It’s an example of advanced concepts naturally falling out of studying a practical problem.
II
Here is the outline of how we’ll proceed:
Define the setting
Specify a formal optimization problem that captures the objective “project the data into a lower-dimensional space while losing as little information as possible”
Prove that the solution has important structure
Prove that it has even more important structure
Repeat steps #3 and #4
At that point, finding an algorithm that solves the problem will be trivial.
III
Let n,d∈N+ with d>n and let (x1,...,xm) be some data set, where xi∈Rd for all i∈[m]. We don’t assume the data has labels this time since we’ll have no use for them. (As mentioned, dimensionality reduction isn’t restricted to supervised learning.)
We now wish to find a linear map ϕ:Rd→Rn that will project our data points into the smaller space Rn while somehow not losing much information and another linear map ψ:Rn→Rd that will approximately recover it.
The requirement that ϕ be linear is a pretty natural one. The sets Rd and Rn are bijective, so if arbitrary functions are allowed, it is possible to define ϕ in such a way that no information is lost. However, if you’ve ever seen a bijection between R and R2 written out, you’ll know that this is a complete non-starter (such functions do not preserve any intuitive notion of structure).
Another way to motivate the requirement is by noting how much is gained by prescribing linearity. The space of all functions between two sets is unfathomably vast, and linearity is a strong and well-understood property. If we assume a linear map, we can make much faster progress.
IV
Since we are assuming linearity, we can represent our linear maps as matrices. Thus, we are looking for a pair (A,B), where A is an n×d matrix and B a d×n matrix.
The requirement that the mapping loses as little information as possible can be measured by how well we can recover the original data. Thus, we assume our data points are first mapped to Rn according to A, then mapped back to Rd according to B, and we measure how much the points have changed. (Of course, also we square every distance because mathematicians love to square distances.) This leads to the objective
argminA∈Matn,d,B∈Matd,n∑mi=1||xi−BAxi||2 .
We say that the pair (A,B) is a solution to our problem iff it is in the argmin above.
V
Before we go about establishing the first structural property of our solution, let’s talk about matrices and matrix multiplication.
There are (at least) two different views one can take on matrices and matrix multiplication. The common one is what I hereby call the computational view because it emphasizes how matrix-vector and matrix-matrix multiplication is realized – namely, like this:
In this view, matrices and linear maps are sort of the same thing, at least in the sense that both can be applied to vectors.
Since reading Linear Algebra Done Right, I’ve become increasingly convinced that emphasizing the computational view is a bad idea. The alternative perspective is what I hereby call the formal view because it is closer to the formal definition. Under this view, a matrix is a rectangular grid of numbers:
⎡⎢
⎢⎣a1,1⋯a1,m⋮⋮an,1⋯an,m⎤⎥
⎥⎦
This grid (along with the bases of the domain space and target space) contains all the information about how a function works, but it is not itself a function. The crucial part here is that the first column of the matrix tells us how to write the first basis vector of the domain space as a linear combination of the target space, i.e., ϕ(b1)=∑ni=1ai,1b′i if the two bases are (b1,...,bm) and (b′1,...,b′n). As far as this post is concerned, we only have two spaces, Rd and Rn, and they both have their respective standard basis. Thus, if we assume n=2 and d=3 for the sake of an example, then the d×n matrix
⎡⎢⎣142257⎤⎥⎦
tells us that the vector (1,0) is mapped onto (1,2,5) and the vector (0,1) onto (4,2,7). And that is all; this defines the linear map which corresponds to the matrix.
Another consequence of this view is that I exclusively draw matrices that have rectangular brackets rather than round brackets. But I’m not devoted to the formal view – we won’t exclusively use it; instead, we will use whichever one is more useful for any given problem. And I won’t actually go through the trouble of cleanly differentiating matrices and maps; we’ll usually pretend that matrices can send vectors to places themselves.
With that out of the way, the first thing we will prove about our solution is that it takes the form (∙,V), where V is a pseudo-orthonormal matrix. (“It takes the form” means “there is a pair in the argmin of our minimization problem that has this form.”) A square matrix is orthonormal if its columns (or equivalently, its rows) are orthonormal to each other (if read as vectors). I.e., they all have inner product 0 with each other and have norm 1. However, there is also a variant of this property that can apply to a non-square matrix, and there is no standard terminology for such a matrix, so I call it pseudo-orthonormal. In our case, the matrix V will do this thing:
I.e., it will go from a small space to a larger space. Thus, it looks like this:
where we write the entries as column vectors, i.e., vi=(v1,i⋯vd,i)T.
For a proper orthonormal matrix, both the column vectors and the row vectors are orthonormal. For a d×n pseudo-orthonormal matrix, just the vi (i.e., the column vectors) are orthonormal. The row vectors can’t be (unless many of them are zero vectors) because they’re linearly dependent; there are d many, and they live in n-dimensional space.
For a square matrix, the requirement that it be orthogonal is equivalent to the equations VTV=I and VVT=I, or simply VT=V−1. A d×n pseudo-orthonormal matrix isn’t square and doesn’t have an inverse, but the first of the two equations still holds (and is equivalent to the requirement that the matrix be pseudo-orthonormal). To see why, consider the matrix product in the computational view. Instead of flipping the first matrix, you can just imagine both side by side; but go vertical in both directions, i.e.:
This shows that (VTV)k,ℓ=⟨vk,vℓ⟩, and this is 1 iff k=ℓ and 0 otherwise because the vi are orthonormal. Thus, VTV=In.
Now recall that we wish to construct a pseudo-orthonormal matrix V and prove that there exists a solution with V, i.e., a solution of the form (∙,V).
To construct V note that (1,0,...,0) is mapped to the first column vector of any matrix (formal view), and analogously for every other basis vector. It follows that the image of a matrix is just the span of the column vectors. Thus, im(B) is an n-dimensional subspace of Rd, and we can choose an orthonormal basis (v1,...,vn) of that subspace. Then we define V:=[v1⋯vn] as in the example above (only now the vi are the specific vectors from this orthonormal basis). Note that V is indeed a d×n matrix. Clearly, both V and B have the same image (formal view).
We will also have use for VT (quite a lot, in fact). Let’s add it into the picture:
(Note that, even though the arrow for VT starts from the small circle, it can take vectors from all of Rd.)
We’ve established that VTV is the identity matrix, but VVT isn’t. This can also be seen from this picture: VVT is a map from Rd to Rd, but it maps all vectors into a subspace of Rd. Therefore, all vectors outside of the subspace have to be changed. But what about the vectors who are in the subspace? Could it be that it leaves those unchanged? The answer is affirmative; here’s the proof:
Let y∈im(V). Then ∃z∈Rn such that Vz=y, and therefore, VVTy=VVT(Vz)=V(VTV)z=Vz=y.
This fact is going to be important in just a second. Recall that we were here:
and we wanted to prove that there’s a solution of the form (∙,V). Here’s the proof:
Let (A,B) be a solution. Then BA is our optimal back-and-forth mapping. However, BA=VVTBA, and, therefore, the pair (ABVT,V) is also an optimal back-and-forth mapping and thus is a solution.
Note that BA=VVTBA holds only because the matrix BA sends all vectors into the subspace im(B) which is the same as im(V), and therefore VVT doesn’t do anything.
VI
Next, we prove that there is a solution of the form (VT,V). We will do this by constructing a term that describes the distance we’re trying to minimize and comparing its gradient to zero. Thus, we assume x∈Rd and y∈Rn and want to figure out how to choose y such that the term ||x−Vy|| is minimal. Using that ||a−b||2=⟨a,a⟩−2⟨a,b⟩+⟨b,b⟩ and that ⟨a,b⟩=aTb=bTa, we have:
||x−Vy||2=xTx−2yTVTx+yTVTVy=||x||2−2yTVTx+||y||2
The gradient of this term with respect to y is −2VTx+2y. Setting it to 0 yields y=VTx. Thus, for each point x∈Rd, the vector y in Rn that will make the distance ||x−Vy|| smallest has the form VTx. In particular, this is true for all of our training points xi. Since we have shown that there exists a solution of the form (∙,V), this implies that (VT,V) is such a solution.
VII
Before we prove that our solution has even more structure, let’s reflect on what it means that it has the form (VT,V) with V orthonormal. Most importantly, it means that the vectors we obtain are all orthogonal to each other (formal view). We call those vectors the principal components of our data set, which give the technique its name. You can think of them as indicating the directions that are most meaningful to capture what is going on in our data set. The more principal components we look at (i.e., the larger of an n we choose), the more information we will preserve. Nonetheless, if a few directions already capture most of the variance in the data, then (VT,V) might be meaningful even though n is small.
For more on this perspective, I recommend this video (on at least 1.5 speed).
VIII
Based on what we have established, we can write our optimization problem as
argminV∈Matd,n,V orthonormal∑mi=1||xi−VVTxi||2
Looking at just one of our xi, we can rewrite the distance term like so:
then, since ||xi||2 does not depend on V, we can further rewrite our problem as
argmaxV∈Matd,n,V orthonormal∑mi=1xiTVVTxi
Now we perform a magic trick. The term xiTVVTxi may just be considered a number, but it can also be considered a 1×1 matrix. In that case, it is equal to its trace. (The trace of a square matrix is the sum of its diagonal entries). The trace has the property that one can “rotate” the matrix product it is applied to, so
trace(M1,...,Mn)=trace(M2,...,Mn,M1)
(This follows from the fact that trace(AB)=trace(BA), which is easy to prove.) It also has the property that, if ATB is square (but A and B are not), trace(ATB)=trace(ABT); this will be important later. If we apply rotation to our case, we get
The key component here is the matrix X:=(∑mi=1xixiT). Note that X does not depend on V. However, we will finish deriving our technique by showing that the columns of V are precisely the n eigenvectors of X corresponding to its n largest eigenvalues.
IX
Many matrices have something called a singular-value decomposition – this is a decomposition X=SDS−1 such that S consists of the eigenvectors of X (column vectors) and D is diagonal, i.e.:
S=[w1⋯wd]D=⎡⎢
⎢⎣λ1⋱λd⎤⎥
⎥⎦
The matrix D can be considered to represent the same function as X but with regard to a different basis, namely the basis of the eigenvectors of X. Thus, the singular value decomposition exists iff there exists a basis that only consists of eigenvectors of X.
It is possible to check that the matrix S which realizes this decomposition is, in fact, the matrix with the eigenvectors of X as column vectors. Consider the first eigenpair (w1,λ1) of X. We have Xw1=λ1w1. On the other hand, S sends the vector (1,0,...,0) to w1 (formal view), and therefore, S−1 sends w1 to (1,0,...,0). Then D sends (1,0,...,0) to (λ,0,...,0) because it’s a diagonal matrix, and S sends (λ,0,...,0) to λS(1,0,...,0) which is λw1 (formal view). In summary,
SDS−1w1=SD(1,0,...,0)=S(λ,0,...,0)=λw1=Xw1
This shows that SDS−1 does the right things on the vectors w1,...,wd, which is already a proper proof that the decomposition works since (w1,...,wd) is, by assumption, a basis of Rd.
In our case, X is symmetric; this implies that there exists a decomposition such that S is orthonormal. (We won’t show this.) Note that, unlike V, the matrix S is a square d×d matrix, hence ST=S−1 and the decomposition can be written as X=SDST.
Recall that the term we want to maximize is trace(VTXV). We can rewrite this as VTXV=VTSDSTV=[STV]TD[STV]. The matrix [STV] is a d×n matrix, so not square, but it is pseudo-orthonormal since [STV]T[STV]=VTSSTV=VTV=I. Using the previously mentioned additional rules of the trace (which allow us to take the transpose of both [STV]TD and [ST]V), we can write
Multiplying any matrix with a diagonal D from the left multiplies every entry in row i with Di,i, and in particular, multiplies the diagonal element ([STV][STV]T)i,i with Di,i. Therefore, the above equals ∑di=1Di,i([STV][STV]T)i,i. Now recall that [STV] is pseudo-orthonormal and looks like this:
The column vectors are properly orthonormal; they all have norm 1. The row vectors are not orthonormal, but if the matrix were extended to a proper d×d matrix, they would become orthonormal vectors. Thus, their norm is at most 1. However, the sum of all squared norms of row vectors equals the sum of all squared norms of column vectors (both are just the sum of every matrix entry squared), which is just n. It follows that each ([STV][STV]T)i,i is at most 1, and the sum, namely ∑di=1([STV][STV]T)i,i, is at most n. Now consider the term whose maximum we wish to compute:
∑di=1Di,i([STV][STV]T)i,i
Given the above constraints, the way to make it largest is to weigh the n largest Di,iwith 1, and the rest with 0. It’s not obvious that this can be achieved, but it is clear that it cannot get larger. Thus, ∑di=1Di,i([STV][STV]T)i,i≤∑mi=1D∗i,i=∑mi=1λ∗i, where D∗ is D reordered such that the largest n eigenvalues, which we call λ∗1,...,λ∗n, come first.
Now recall that we have previously written our term as trace(VT(∑mi=1xixiT)V) or equivalently trace(VTXV). If we choose V=[w∗1⋯w∗n] where w∗1,...,w∗n are the eigenvectors of X corresponding to λ∗1,...,λ∗n, we get XV=X[w∗1⋯w∗n]=[λ∗1w∗1⋯λ∗nw∗n] and VTXV=VT[λ1w∗1⋯λnw∗n]. The i-th entry on the diagonal of this matrix is w∗iTλ∗iw∗i=λ∗i||w∗i||2=λ∗i. Since we have established that this is an upper-bound, that makes this choice for V a solution to our original problem, namely argminV∈Matd,n,V orthonormal∑mi=1||xi−VVTxi||2.
X
As promised, the algorithm is now trivial – it fits into three lines:
Given x1,...,xm∈Rd, compute the matrix X:=∑mi=1xixiT, then compute the eigenvalues λ1,...,λd of X, with eigenvectors w1,...,wd. Take the n eigenvectors with the largest eigenvalues and put them into a matrix V. Output (VT,V).
UML XII: Dimensionality Reduction
(This is the twelfth post in a sequence on Machine Learning based on this book. Click here for part I.)
This post will be more on the mathy side (all of it linear algebra). It contains orthonormal matrices, eigenvalues and eigenvectors, and singular value decomposition.
I
Almost everything I’ve written about in this series thus far has been about what is called supervised learning, which just means “learning based on a fixed amount of training data.” This even includes most of the theoretical work from chapters I-III, such as the PAC learning model. On the other side, there is also unsupervised learning, which is learning without any training data.
Dimensionality reduction is about taking data that is represented by vectors in some high-dimensional space and converting them to vectors in a lower-dimensional space, in a way that preserves meaningful properties of the data. One use case is to overcome computational hurdles in supervised learning – many algorithms have runtimes that increase exponentially with the dimension. In that case, we would essentially be doing the opposite of what we’ve covered in the second half of post VIII, which was about increasing the dimensionality of a data set to make it more expressive. However, dimensionality reduction can also be applied to unsupervised learning – or even to tasks that are outside of machine learning altogether. I once took a class on Information Visualization, and they covered the technique we’re going to look at in this post (although “covered” did not include anyone understanding what happens on a mathematical level). It is very difficult to visualize high-dimensional data; a reasonable approach to deal with this problem is to map it into 2d space, or perhaps 3d space, and then see whether there are any interesting patterns.
The aforementioned technique is called principal component analysis. I like it for two reasons: one, linear algebra plays a crucial role in actually finding the solution (rather than just proving some convergence bound); and two, even though the formalization of our problem will look quite natural, the solution turns out to have highly elegant structure. It’s an example of advanced concepts naturally falling out of studying a practical problem.
II
Here is the outline of how we’ll proceed:
Define the setting
Specify a formal optimization problem that captures the objective “project the data into a lower-dimensional space while losing as little information as possible”
Prove that the solution has important structure
Prove that it has even more important structure
Repeat steps #3 and #4
At that point, finding an algorithm that solves the problem will be trivial.
III
Let n,d∈N+ with d>n and let (x1,...,xm) be some data set, where xi∈Rd for all i∈[m]. We don’t assume the data has labels this time since we’ll have no use for them. (As mentioned, dimensionality reduction isn’t restricted to supervised learning.)
We now wish to find a linear map ϕ:Rd→Rn that will project our data points into the smaller space Rn while somehow not losing much information and another linear map ψ:Rn→Rd that will approximately recover it.
The requirement that ϕ be linear is a pretty natural one. The sets Rd and Rn are bijective, so if arbitrary functions are allowed, it is possible to define ϕ in such a way that no information is lost. However, if you’ve ever seen a bijection between R and R2 written out, you’ll know that this is a complete non-starter (such functions do not preserve any intuitive notion of structure).
Another way to motivate the requirement is by noting how much is gained by prescribing linearity. The space of all functions between two sets is unfathomably vast, and linearity is a strong and well-understood property. If we assume a linear map, we can make much faster progress.
IV
Since we are assuming linearity, we can represent our linear maps as matrices. Thus, we are looking for a pair (A,B), where A is an n×d matrix and B a d×n matrix.
The requirement that the mapping loses as little information as possible can be measured by how well we can recover the original data. Thus, we assume our data points are first mapped to Rn according to A, then mapped back to Rd according to B, and we measure how much the points have changed. (Of course, also we square every distance because mathematicians love to square distances.) This leads to the objective
argminA∈Matn,d,B∈Matd,n∑mi=1||xi−BAxi||2 .
We say that the pair (A,B) is a solution to our problem iff it is in the argmin above.
V
Before we go about establishing the first structural property of our solution, let’s talk about matrices and matrix multiplication.
There are (at least) two different views one can take on matrices and matrix multiplication. The common one is what I hereby call the computational view because it emphasizes how matrix-vector and matrix-matrix multiplication is realized – namely, like this:
In this view, matrices and linear maps are sort of the same thing, at least in the sense that both can be applied to vectors.
Since reading Linear Algebra Done Right, I’ve become increasingly convinced that emphasizing the computational view is a bad idea. The alternative perspective is what I hereby call the formal view because it is closer to the formal definition. Under this view, a matrix is a rectangular grid of numbers:
⎡⎢ ⎢⎣a1,1⋯a1,m⋮⋮an,1⋯an,m⎤⎥ ⎥⎦
This grid (along with the bases of the domain space and target space) contains all the information about how a function works, but it is not itself a function. The crucial part here is that the first column of the matrix tells us how to write the first basis vector of the domain space as a linear combination of the target space, i.e., ϕ(b1)=∑ni=1ai,1b′i if the two bases are (b1,...,bm) and (b′1,...,b′n). As far as this post is concerned, we only have two spaces, Rd and Rn, and they both have their respective standard basis. Thus, if we assume n=2 and d=3 for the sake of an example, then the d×n matrix
⎡⎢⎣142257⎤⎥⎦
tells us that the vector (1,0) is mapped onto (1,2,5) and the vector (0,1) onto (4,2,7). And that is all; this defines the linear map which corresponds to the matrix.
Another consequence of this view is that I exclusively draw matrices that have rectangular brackets rather than round brackets. But I’m not devoted to the formal view – we won’t exclusively use it; instead, we will use whichever one is more useful for any given problem. And I won’t actually go through the trouble of cleanly differentiating matrices and maps; we’ll usually pretend that matrices can send vectors to places themselves.
With that out of the way, the first thing we will prove about our solution is that it takes the form (∙,V), where V is a pseudo-orthonormal matrix. (“It takes the form” means “there is a pair in the argmin of our minimization problem that has this form.”) A square matrix is orthonormal if its columns (or equivalently, its rows) are orthonormal to each other (if read as vectors). I.e., they all have inner product 0 with each other and have norm 1. However, there is also a variant of this property that can apply to a non-square matrix, and there is no standard terminology for such a matrix, so I call it pseudo-orthonormal. In our case, the matrix V will do this thing:
I.e., it will go from a small space to a larger space. Thus, it looks like this:
⎡⎢ ⎢ ⎢ ⎢ ⎢⎣v1,1⋯v1,nv2,1⋯v2,n⋮⋮vd,1⋯vd,n⎤⎥ ⎥ ⎥ ⎥ ⎥⎦
Or alternatively, this:
[v1⋯vn]
where we write the entries as column vectors, i.e., vi=(v1,i⋯vd,i)T.
For a proper orthonormal matrix, both the column vectors and the row vectors are orthonormal. For a d×n pseudo-orthonormal matrix, just the vi (i.e., the column vectors) are orthonormal. The row vectors can’t be (unless many of them are zero vectors) because they’re linearly dependent; there are d many, and they live in n-dimensional space.
For a square matrix, the requirement that it be orthogonal is equivalent to the equations VTV=I and VVT=I, or simply VT=V−1. A d×n pseudo-orthonormal matrix isn’t square and doesn’t have an inverse, but the first of the two equations still holds (and is equivalent to the requirement that the matrix be pseudo-orthonormal). To see why, consider the matrix product in the computational view. Instead of flipping the first matrix, you can just imagine both side by side; but go vertical in both directions, i.e.:
This shows that (VTV)k,ℓ=⟨vk,vℓ⟩, and this is 1 iff k=ℓ and 0 otherwise because the vi are orthonormal. Thus, VTV=In.
Now recall that we wish to construct a pseudo-orthonormal matrix V and prove that there exists a solution with V, i.e., a solution of the form (∙,V).
To construct V note that (1,0,...,0) is mapped to the first column vector of any matrix (formal view), and analogously for every other basis vector. It follows that the image of a matrix is just the span of the column vectors. Thus, im(B) is an n-dimensional subspace of Rd, and we can choose an orthonormal basis (v1,...,vn) of that subspace. Then we define V:=[v1⋯vn] as in the example above (only now the vi are the specific vectors from this orthonormal basis). Note that V is indeed a d×n matrix. Clearly, both V and B have the same image (formal view).
We will also have use for VT (quite a lot, in fact). Let’s add it into the picture:
(Note that, even though the arrow for VT starts from the small circle, it can take vectors from all of Rd.)
We’ve established that VTV is the identity matrix, but VVT isn’t. This can also be seen from this picture: VVT is a map from Rd to Rd, but it maps all vectors into a subspace of Rd. Therefore, all vectors outside of the subspace have to be changed. But what about the vectors who are in the subspace? Could it be that it leaves those unchanged? The answer is affirmative; here’s the proof:
This fact is going to be important in just a second. Recall that we were here:
and we wanted to prove that there’s a solution of the form (∙,V). Here’s the proof:
Note that BA=VVTBA holds only because the matrix BA sends all vectors into the subspace im(B) which is the same as im(V), and therefore VVT doesn’t do anything.
VI
Next, we prove that there is a solution of the form (VT,V). We will do this by constructing a term that describes the distance we’re trying to minimize and comparing its gradient to zero. Thus, we assume x∈Rd and y∈Rn and want to figure out how to choose y such that the term ||x−Vy|| is minimal. Using that ||a−b||2=⟨a,a⟩−2⟨a,b⟩+⟨b,b⟩ and that ⟨a,b⟩=aTb=bTa, we have:
||x−Vy||2=xTx−2yTVTx+yTVTVy=||x||2−2yTVTx+||y||2
The gradient of this term with respect to y is −2VTx+2y. Setting it to 0 yields y=VTx. Thus, for each point x∈Rd, the vector y in Rn that will make the distance ||x−Vy|| smallest has the form VTx. In particular, this is true for all of our training points xi. Since we have shown that there exists a solution of the form (∙,V), this implies that (VT,V) is such a solution.
VII
Before we prove that our solution has even more structure, let’s reflect on what it means that it has the form (VT,V) with V orthonormal. Most importantly, it means that the vectors we obtain are all orthogonal to each other (formal view). We call those vectors the principal components of our data set, which give the technique its name. You can think of them as indicating the directions that are most meaningful to capture what is going on in our data set. The more principal components we look at (i.e., the larger of an n we choose), the more information we will preserve. Nonetheless, if a few directions already capture most of the variance in the data, then (VT,V) might be meaningful even though n is small.
For more on this perspective, I recommend this video (on at least 1.5 speed).
VIII
Based on what we have established, we can write our optimization problem as
argminV∈Matd,n,V orthonormal∑mi=1||xi−VVTxi||2
Looking at just one of our xi, we can rewrite the distance term like so:
||xi−VVTxi||=||xi||2−2xiTVVTxi+xiTVVTVVTxi=||xi||2−xiTVVTxi
then, since ||xi||2 does not depend on V, we can further rewrite our problem as
argmaxV∈Matd,n,V orthonormal∑mi=1xiTVVTxi
Now we perform a magic trick. The term xiTVVTxi may just be considered a number, but it can also be considered a 1×1 matrix. In that case, it is equal to its trace. (The trace of a square matrix is the sum of its diagonal entries). The trace has the property that one can “rotate” the matrix product it is applied to, so
trace(M1,...,Mn)=trace(M2,...,Mn,M1)
(This follows from the fact that trace(AB)=trace(BA), which is easy to prove.) It also has the property that, if ATB is square (but A and B are not), trace(ATB)=trace(ABT); this will be important later. If we apply rotation to our case, we get
xiTVVTxi=trace(xiTVVTxi)=trace(VVTxixiT)=trace(VTxixiTV)
and, since the trace is linear,
∑mi=1trace(VTxixiTV)=trace(VT(∑mi=1xixiT)V)
The key component here is the matrix X:=(∑mi=1xixiT). Note that X does not depend on V. However, we will finish deriving our technique by showing that the columns of V are precisely the n eigenvectors of X corresponding to its n largest eigenvalues.
IX
Many matrices have something called a singular-value decomposition – this is a decomposition X=SDS−1 such that S consists of the eigenvectors of X (column vectors) and D is diagonal, i.e.:
S=[w1⋯wd]D=⎡⎢ ⎢⎣λ1⋱λd⎤⎥ ⎥⎦
The matrix D can be considered to represent the same function as X but with regard to a different basis, namely the basis of the eigenvectors of X. Thus, the singular value decomposition exists iff there exists a basis that only consists of eigenvectors of X.
It is possible to check that the matrix S which realizes this decomposition is, in fact, the matrix with the eigenvectors of X as column vectors. Consider the first eigenpair (w1,λ1) of X. We have Xw1=λ1w1. On the other hand, S sends the vector (1,0,...,0) to w1 (formal view), and therefore, S−1 sends w1 to (1,0,...,0). Then D sends (1,0,...,0) to (λ,0,...,0) because it’s a diagonal matrix, and S sends (λ,0,...,0) to λS(1,0,...,0) which is λw1 (formal view). In summary,
SDS−1w1=SD(1,0,...,0)=S(λ,0,...,0)=λw1=Xw1
This shows that SDS−1 does the right things on the vectors w1,...,wd, which is already a proper proof that the decomposition works since (w1,...,wd) is, by assumption, a basis of Rd.
In our case, X is symmetric; this implies that there exists a decomposition such that S is orthonormal. (We won’t show this.) Note that, unlike V, the matrix S is a square d×d matrix, hence ST=S−1 and the decomposition can be written as X=SDST.
Recall that the term we want to maximize is trace(VTXV). We can rewrite this as VTXV=VTSDSTV=[STV]TD[STV]. The matrix [STV] is a d×n matrix, so not square, but it is pseudo-orthonormal since [STV]T[STV]=VTSSTV=VTV=I. Using the previously mentioned additional rules of the trace (which allow us to take the transpose of both [STV]TD and [ST]V), we can write
trace(VTXV)=trace([STV]TD[STV])=trace(D[STV][STV]T)
Multiplying any matrix with a diagonal D from the left multiplies every entry in row i with Di,i, and in particular, multiplies the diagonal element ([STV][STV]T)i,i with Di,i. Therefore, the above equals ∑di=1Di,i([STV][STV]T)i,i. Now recall that [STV] is pseudo-orthonormal and looks like this:
⎡⎢ ⎢ ⎢ ⎢ ⎢⎣z1,1⋯z1,nz2,1⋯z2,n⋮⋮zd1⋯zd,n⎤⎥ ⎥ ⎥ ⎥ ⎥⎦
The column vectors are properly orthonormal; they all have norm 1. The row vectors are not orthonormal, but if the matrix were extended to a proper d×d matrix, they would become orthonormal vectors. Thus, their norm is at most 1. However, the sum of all squared norms of row vectors equals the sum of all squared norms of column vectors (both are just the sum of every matrix entry squared), which is just n. It follows that each ([STV][STV]T)i,i is at most 1, and the sum, namely ∑di=1([STV][STV]T)i,i, is at most n. Now consider the term whose maximum we wish to compute:
∑di=1Di,i([STV][STV]T)i,i
Given the above constraints, the way to make it largest is to weigh the n largest Di,iwith 1, and the rest with 0. It’s not obvious that this can be achieved, but it is clear that it cannot get larger. Thus, ∑di=1Di,i([STV][STV]T)i,i≤∑mi=1D∗i,i=∑mi=1λ∗i, where D∗ is D reordered such that the largest n eigenvalues, which we call λ∗1,...,λ∗n, come first.
Now recall that we have previously written our term as trace(VT(∑mi=1xixiT)V) or equivalently trace(VTXV). If we choose V=[w∗1⋯w∗n] where w∗1,...,w∗n are the eigenvectors of X corresponding to λ∗1,...,λ∗n, we get XV=X[w∗1⋯w∗n]=[λ∗1w∗1⋯λ∗nw∗n] and VTXV=VT[λ1w∗1⋯λnw∗n]. The i-th entry on the diagonal of this matrix is w∗iTλ∗iw∗i=λ∗i||w∗i||2=λ∗i. Since we have established that this is an upper-bound, that makes this choice for V a solution to our original problem, namely argminV∈Matd,n,V orthonormal∑mi=1||xi−VVTxi||2.
X
As promised, the algorithm is now trivial – it fits into three lines: