Linear Algebra Done Right, Axler
Epistemic status: A brisk walkthrough of (what I take to be) the highlights of this book’s contents.
The big one for mathematically understanding ML!
The idea responsible for getting me excited about linear algebra is:
Linear maps are the homomorphisms between vector spaces.
Linear algebra is about the tripartite relationship between (1) homomorphisms[1] between vector spaces, (2) sets of equations, and (3) grids of numbers.
However, grids of numbers (‘matrices’), the usual star of the show in a presentation of linear algebra, aren’t foregrounded in this book. Instead, this is a book chiefly treating the homomorphisms (‘linear maps’) themselves, directly.
Contents and Notes
1. Vector Spaces
Vector spaces are fairly substantial mathematical structures, if you’re pivoting out of thinking about set theory! Intuitively, a vector space is a space for which (1) ray addition and (2) scaling rays (emanating from the origin out to points)[2] are both nicely defined.
Precisely, a vector space is a set defined over a field [3] in which
V is closed under vector addition, and vector addition is commutative, associative, there is an additive identity , and there is an additive inverse for every vector ;
V is closed under scalar multiplication, scalar multiplication is associative, and there is a multiplicative identity ;
and vector addition and scalar multiplication are connected by distribution such that, for all and ,[4]
A subspace of a vector space is any subset that is still itself a vector space, under the same two operations of . Vector spaces can be decomposed into their subspaces, where you think of adding vectors drawn the different subspace via their common addition operation.
2. Finite-Dimensional Vector Spaces
You live at the origin of , and your tools are the vectors that emanate out from your home. Because we have both vector addition and scalar multiplication, we have two ways of extending (or shortening) any single vector out from the origin arbitrarily far. If we’re interested in reaching points in , one immediate way to get to points we didn’t have a vector directly to… is by extending a too-short vector pointed in the right direction! Furthermore, because we can always multiply a vector by to reverse its direction, both the exactly right and exactly wrong directions will suffice to reach out and touch a point in .
We can also use vector addition to add two vectors pointing off in differing directions (directions which aren’t exact opposites). If we have vectors , , and ,[5] we have all the tools we need to produce any vector in ! The awkward lengths of all the vectors are irrelevant, because we can scale all of them arbitrarily. We use some amount of vertical, horizonal, and -dimensional[6] displacement to get to anywhere via addition and multiplication! More formally, we say that the set spans .
Intuitively, a minimal spanning set is called a basis for a vector space. is a basis for the vector space , because none of the vectors are “redundant”: you could not produce every vector in without all three elements in . If you added any further vector to that spanning set, though, the set would now have a redundant vector, as is already spanned. The set would no longer be a minimal spanning set in this sense, and so would cease to be a basis for .
Every finite-dimensional, nonzero[7] vector space containing infinitely many vectors has infinitely many bases (pp. 29-32). Each basis for an -dimensional vector space is a set containing vectors, where each vector is an ordered set containing numbers drawn from (p. 32).
3. Linear Maps
Intuitively, a linear map is a function that translates addition and multiplication between two vector spaces.
Formally, a linear map is a function from a vector space to a vector space (taking vectors and returning vectors) such that
for all ; all ; and all . Note that both are homomorphism properties: one for addition across vector spaces and one for multiplication across vector spaces! We’ll call the former relationship additivity, the latter, homogeneity.
The symbol stands for the set of all the linear maps from to .[8]
Some example linear maps (pp. 38-9) include:
When the vector spaces are specifically the set of all real-valued polynomials :[9]
translating between and .
As linear maps are functions, they can be composed when they have matching domains and co-domains, giving us our notion of products between linear maps.
The kernel of a linear map is the subset containing all and only the vectors that maps to . Note that linear maps can only “get rid” of vectors by shrinking them down all the way, i.e., by sending them to . If a function between vector spaces simply sent everything to a nonzero vector, it would violate the linear map axioms! All kernels are subspaces of (p. 42). A linear map is injective whenever (p. 43).
The image of is the subset of covered by some . All images are subspaces of (p. 44). A linear map is obviously surjective whenever .
The Matrix of a Linear Map
A matrix is an array of numbers, with rows and columns:
(Matrices are a generalization of vectors into the horizontal dimension, and vectors can be thought of as skinny -by- matrices.)
Let . Suppose that is a basis of and is a basis of . For each , we can write uniquely as a linear combination of the ’s:
where for . The scalars completely determine the linear map because a linear map is determined by its values on a basis. The -by- matrix formed by the ’s is called the matrix of with respect to the bases and ; we denote it by
…
If you think of elements of as columns of numbers, then you can think of the th column of as applied to the th basis vector (pp. 48-9; notation converted to our own.)
The vector , with matrix multiplication on the right side of the equation (pp. 53-4).
4. Polynomials
5. Eigenvalues and Eigenvectors
We now begin our study of operator theory!
Any vector space we discuss from here on out will be neither the zero vector space nor an infinite-dimensional vector space.
Operators are linear maps from to itself. Notationally, .
We call a subspace invariant under if, for all ,
Let now specifically be a one-dimensional subspace of such that, fixing any nonzero ,
Then is a one-dimensional subspace of , and every one-dimensional subspace of is of this form. If and the subspace defined by is invariant under , then must be in , and hence there must be a scalar such that . Conversely, if is a nonzero vector in such that for some , then the subspace S defined by is a one-dimensional subspace of invariant under (p. 77; notation converted).
In the above equation , the scalar is called an eigenvalue of , and the corresponding vector is called an eigenvector of .
Because is equivalent to ,[10] we see that the set of eigenvectors of corresponding to equals . In particular, the set of eigenvectors of corresponding to is a subspace of .
[For example,] if , then has only one eigenvalue, namely, , and every vector is an eigenvector for this eigenvalue (p. 77-8; notation converted).
Polynomials Applied to Operators
The main reason that a richer theory exists for operators… than for linear maps is that operators can be raised to powers (p. 80).
An operator raised to a power is just that operator composed with itself times.
Because we have a notion of functional products, functional sums, and now operators raised to powers, we can now construct arbitrary polynomials with operators as the variables!
Upper-Triangular Matrices
A square matrix is an matrix.
An upper-triangular matrix is a square matrix for which all entries under the principal diagonal equal .
Every operator on a finite-dimensional, nonzero, complex vector space has an eigenvalue (p. 81).
…
Suppose is a complex vector space and . Then has an upper-triangular matrix with respect to some basis of (p. 84; notation converted).
…
Suppose has an upper-triangular matrix with respect to some basis of . Then the eigenvalues of consist precisely of the entries on the diagonal of that upper-triangular matrix (p. 86; notation converted).
Diagonal Matrices
A diagonal matrix is a square matrix for which all entries off the principal diagonal equal .
If has [11] distinct eigenvalues, then has a diagonal matrix with respect to some basis of (p. 88; notation converted).
6. Inner-Product Spaces
For , the dot product of and , denoted by , is defined by
where is the th entry in , and similarly for and (p. 98; notation converted).
Inner products are just a generalization of dot products to arbitrary vector spaces . (With some finagling, both dot products and inner products generally can be interpreted as linear maps.) An inner-product space is an ordered set containing a vector space and an inner product on it.
Intuitively, the norm of a vector is the length of that vector, interpreted as a ray, from the origin to its tip. More formally, the norm of a vector in an inner-product space is defined to be the square root of the inner product of that vector with itself:
Note that this looks just like , the Pythagorean theorem for the sides of a right triangle in Euclidian space. That’s because other inner products on other vector spaces are meant to allow for a generalization of the Pythagorean theorem in those vector spaces!
Intuitively, two vectors are orthogonal when they’re perpendicular. Formally, two vectors are called orthogonal when their inner product is . With the opposite and adjacent sides of the right unit triangle in the vector space ,
“It’s all just right triangles, dude.”
7. Operators on Inner-Product Spaces
Complex Spectral Theorem: Suppose that is a complex inner-product space and . Then has an orthonormal[12] basis consisting of eigenvectors of if and only if is normal[13] (p. 133; notation converted).
…
Real Spectral Theorem: Suppose that is a real inner-product space and . Then has an orthonormal basis consisting of eigenvectors of if and only if is self-adjoint[14] (p. 136; notation converted).
…
In other words, to multiply together two block diagonal matrices[15] (with the same size blocks), just multiply together the corresponding entries on the diagonal, as with diagonal matrices.
A diagonal matrix is a special case of a block diagonal matrix where each block has size -by-. At the other extreme, every square matrix is a block diagonal matrix because we can take the first (and only) block to be the entire matrix… The smaller the blocks, the nicer the operator (in the vague sense that the matrix then contains more ’s). The nicest situation is to have an orthonormal basis that gives a diagonal matrix (p. 143).
The singular values of are the eigenvalues of , where each eigenvalue is repeated times (p. 155).
Every operator on has a diagonal matrix with respect to some orthonormal bases of , provided that we are permitted to use two different bases rather than a single basis as customary when working with operators (p. 157).
8. Operators on Complex Vector Spaces
9. Operators on Real Vector Spaces
We have defined eigenvalues of operators; now we need to extend that notion to square matrices. Suppose is an -by- matrix with entries in . A number is called an eigenvalue of if there exists a nonzero -by- matrix such that
For example, is an eigenvalue of because
(p. 194).
…
Suppose and is the matrix of with respect to some basis of . Then the eigenvalues of are the same as the eigenvalues of (p. 194; notation converted).
…
Cayley-Hamilton Theorem: Suppose is a real vector space and . Let denote the characteristic polynomial[16] of . Then (p. 207; notation converted).
The Cayley-Hamilton theorem also holds on complex vector spaces generally (p. 173).
10. Trace and Determinant
The matrix of an operator depends on a choice of basis of . Two different bases of may give different matrices of (p. 214; notation converted).
Intuitively, the determinant of an operator is the change in volume effects. The determinant is negative when the operator flips all the vectors “inverts the volume” it works on.
If is a complex vector space, then equals the product of the eigenvalues of , counting multiplicity… Recall that if is a complex vector space, then there is a basis of with respect to which has an upper-triangular matrix… thus equals the product of the diagonal entries of the matrix (p. 222; notation converted).
- ^
Intuitively, a a homomorphism is a function showing how the operation of vector addition can be translated from one vector space into another and back.
More precisely, a homomorphism is a function (here, from a vector space to a vector space ) such that
with and .
The vector addition symbol on the left side of the equality, inside the function, is defined in , and the addition symbol on the right side of the equality, between the function values, is defined in .
- ^
Vectors can be interpreted geometrically as rays from the origin out to points in a space. Vectors can also be understood algebraically as ordered sets of numbers (with each number representing a coordinate over in the ray interpretation).
As far as notation goes, we’ll use variables with arrows for vectors, lowercase variables for numbers, and capital variables for other larger mathematical structures, such as vector spaces.
- ^
In this book, that field will be either the reals or the complexes .
- ^
Take note of how homomorphism-ish the below distributive relationships are!
- ^
Vectors are conventionally written vertically. But each vector has a transpose , where the vector is written out horizontally instead.
So we’ll use vector transposes to stay in line with conventional notation while not writing out those giant vertical vectors everywhere.
- ^
One deep idea out of mathematics is that the dimensionality of a system is just the number of variables in that system that can vary independently of every other variable. You live in -dimensional space because you can vary your horizontal, vertical, and -dimensional position without necessarily changing your position in the other two spatial dimensions by doing so.
- ^
Note that the set , where is a vector containing only any number of times, satisfies the vector space axioms!
establishes closure under addition, existence of an additive identity, existence of an additive inverse for all vectors, additive commutativity, and additive associativity. Letting the field be the reals with
establishes closure under multiplication, multiplicative associativity, and the existence of a multiplicative identity. Finally,
establishes distributivity.
Any such vector space has just one basis, . Intuitively, since you live at the origin, the origin is already spanned by no vectors at all—i.e., the empty set of vectors. Any additional vector would be redundant, so no other sets constitute bases for .
- ^
In math, the bigger and/or fancier the symbol, the bigger the set or class that symbol usually stands for.
- ^
A vector can stand for a polynomial by containing all the coefficients in the polynomial, coefficients ordered by the degree of each coefficient’s monomial.
- ^
This is addition of functions, , on the left side of the equation. is the identity function.
- ^
is the dimension of , formalized as the number of vectors in any basis of .
- ^
Intuitively, orthonormal sets are nice sets of vectors like , where each vector has length one and is pointing out in a separate dimension.
More precisely, a set of vectors is called orthonormal when its elements are pairwise orthogonal and each vector has a norm of . We will especially care about orthonormal bases, like the set above with respect to .
- ^
The adjoint of a linear map is a linear map such that the inner product of and equals the inner product of and for all and .
Remember that inner products aren’t generally commutative, so the order of arguments matters. Adjoints feel very anticommutative.
An operator on an inner-product space is called normal when
- ^
An operator is self-adjoint when .
- ^
A block diagonal matrix is a square matrix of the form
where are square matrices lying along the diagonal and all the other entries of the matrix equal (p. 142).
- ^
Suppose is a complex vector space and . Let denote the distinct eigenvalues of . Let denote the multiplicity of as an eigenvalue of . The polynomial
is called the characteristic polynomial of . Note that the degree of the characteristic polynomial of equals … the roots of the characteristic polynomial of equal the eigenvalues of (p. 172; notation converted).
Characteristic polynomials can also be defined for real vector spaces, though the reals are a little less well behaved as vector spaces than the complexes.
Suppose is a real vector space and . With respect to some basis of , has a block upper-triangular matrix [any entries acceptable above ] of the form
where each is a -by- or a -by- matrix with no eigenvalues. We define the characteristic polynomial of to be the product of the characteristic polynomials of . Explicitly, for each , define by
Then the characteristic polynomial of is
Clearly the characteristic polynomial of has degree … The characteristic polynomial of depends only on and not on the choice of a particular basis (p. 206; notation converted).
I think this is a typo and should be “eigenvalues” instead of “eigenvectors”?
This could be misleading. E.g. the operator f(v) := -v that literally just flips all vectors has determinant (-1)^n, where n is the dimension of the space it’s working on. The sign of the determinant tells you whether an operator flips the orientation of volumes, it can’t tell you anything about what it does to individual vectors.
(Regarding “orientation of volumes”: in the 2D case, think of R^2 as a sheet of paper, then f(v) := -v is just a 180 degree rotation, so the same side stays up, and the determinant is positive. In contrast, flipping along an axis requires turning over the paper, so negative determinant. Unfortunately this can’t really be visualized the same way in 3D, so then you have to think about ordered bases.)
Thanks—right on both counts! Post amended.
This is great! I’ll thread a few nits under this comment
should read
f(→s)∈S
I think such S can not exist in most cases, and it should instead read ‘… for some →v∈V…’
The expression for S is describing the span of the vector →v, so certainly if V is more than one-dimensional, if some subspace S has this property for all →v∈V then it has this property for linearly independent vectors in V, which is a contradiction.
The definition of matrix (“the basis maps to:”) ought to come after the “uniquely determines the linear map” that justifies it.
For interpreting v as a slim matrix, I would use bra-ket notation: |v> for the function of type V ← R, <v| for the function whose type is the dual R ← V. Then <v|v> has type R ← R (and corresponds to multiplication by a scalar) and |v><v| has type V ← V.
An inner product just maps |v> to <v|. (Though I don’t quite see what the symmetry is for.)
Mapping a point cloud through a linear map thins it by a factor of the determinant; this generalizes to smooth maps, since they are locally linear.