How to pick your categories
Note: this is intended to be a friendly math post, so apologies to anyone for whom this is all old hat. I’m deliberately staying elementary for the benefit of people who are new to the ideas. There are no proofs: this is long enough as it is.
Related: Where to Draw the Boundary, The Cluster Structure of Thingspace, Disguised Queries.
Here’s a rather deep problem in philosophy: how do we come up with categories? What’s the difference between a horror movie and a science fiction movie? Or the difference between a bird and a mammal? Are there such things as “natural kinds,” or are all such ideas arbitrary?
We can frame this in a slightly more mathematical way as follows. Objects in real life (animals, moving pictures, etc.) are enormously complicated and have many features and properties. You can think of this as a very high dimensional space, one dimension for each property, and each object having a value corresponding to each property. A grayscale picture, for example, has a color value for each pixel. A text document has a count for every word (the word “flamingo” might have been used 7 times, for instance.) A multiple-choice questionnaire has an answer for each question. Each object is a point in a high-dimensional featurespace. To identify which objects are similar to each other, we want to identify how close points are in featurespace. For example, two pictures that only differ at one pixel should turn out to be similar.
We could then start to form categories if the objects form empirical clusters in featurespace. If some animals have wings and hollow bones and feathers, and some animals have none of those things but give milk and bear live young, it makes sense to distinguish birds from mammals. If empirical clusters actually exist, then there’s nothing arbitrary about the choice of categories—the categories are appropriate to the data!
There are a number of mathematical techniques for assigning categories; all of them are basically attacking the same problem, and in principle should all agree with each other and identify the “right” categories. But in practice they have different strengths and weaknesses, in computational efficiency, robustness to noise, and ability to classify accurately. This field is incredibly useful—this is how computers do image and speech recognition, this is how natural language processing works, this is how they sequence your DNA. It also, I hope, will yield insights into how people think and perceive.
Clustering techniques
These techniques attempt to directly find clusters in observations. A common example is the K-means algorithm. The goal here is, given a set of observations x1...xn, to partition them into k sets so as to minimize the within-cluster sum of squared differences:
argminS ∑ik ∑x in S ||xi—mi||2
where mi are the means.
The standard algorithm is to pick k means randomly, anywhere, assign points to the cluster with the closest mean, and then assign the new mean of each cluster to be the centroid (average) of the points in the cluster. Then we iterate again, possibly assigning different points to different clusters with each iterative step. This is usually very fast but can be slow in the worst-case scenario.
There are many, many clustering algorithms. They vary in the choice of distance metric (it doesn’t have to be Euclidean, we could take taxicab distances or Hamming distances or something else). There’s also something called hierarchical clustering, which outputs a tree of clusters.
Linear dimensionality reduction techniques
Here’s another way to think about this problem: perhaps, of all the possible features that could distinguish two objects, most of the variation is in only a few features. You have a high-dimensional feature space, but in fact all the points are lying in a much lower-dimensional space. (Maybe, for instance, once you’ve identified what color a flower is, how big it is, and how many petals it has, you’ve almost completely identified the flower.) We’d like to know which coordinate axes explain the data well.
There are a number of methods for doing this—I’ll mention a classic one, singular value decomposition (SVD). For any m x n matrix M, we have a factorization of the form
M = U Σ V*
where U and V are orthogonal and Σ is diagonal. The columns of U are the eigenvectors of M*M, the columns of V are the eigenvectors of MM*, and the elements of Σ (called singular values) are the square roots of the eigenvalues of M*M and MM*.
Now, if we want a low-rank approximation of M (that is, every point in the approximation lies on a low-dimensional hyperplane) all we have to do is chop off Σ to contain only the largest k singular values. Intuitively, these are the dimensions that account for most of the variation in the data matrix M. The approximate matrix M’ = U Σ′ V* can be shown to be the closest possible rank-k approximation to M.
In other words, if you have high-dimensional data, and you suspect that only two coordinates explain all the data—maybe you have a questionnaire, and you think that age and sex explain all the patterns in answering—something like SVD can identify which those coordinates are. In a sense, dimensionality reduction can identify the factors that are worth looking at.
Nonlinear dimensionality reduction techniques
An algorithm like SVD is linear—the approximation it spits out lies entirely on a vector subspace of your original vector space (a line, a plane, a hyperplane of some dimension.) Sometimes, though, that’s a bad idea. What if you have a cloud of data that actually lies on a circle? Or some other curvy shape? SVD will get it wrong.
One interesting tweak on this process is manifold learning—if we suspect that the data lies on a low-dimensional but possibly curvy shape, we try to identify the manifold, just as in SVD we tried to identify the subspace. There are a lot of algorithms for doing this. One of my favorites is the Laplacian eigenmap.
Here, we look at each data point as a node on a graph; if we have lots of data (and we usually do) the graph is a sort of mesh approximation for the smooth manifold it lies on. We construct a sparse, weighted adjacency matrix: it’s N x N, where N is the number of data points. Matrix elements are zero if they correspond to two points that are far apart, but if they correspond to nearby points we put the heat kernel e-||x-y||^2. Then we look at the eigenvectors of this matrix. We use the top eigenvectors as coordinates to embed into Euclidean space. The reason this works, roughly, is that we’re approximating a Laplace operator on the manifold: two points are close together if a diffusion moving along the graph would travel between them quickly. It’s a way of mapping the graph into a lower-dimensional space such that points that are close on the graph are close in the embedding.
Good nonlinear dimensionality reduction techniques can identify data that lies on curvy shapes, like circles and spirals and the “swiss roll” (a rolled-up plane) much better than linear dimensionality reduction techniques.
What’s the moral?
Once upon a time, the search engine Yahoo tried to categorize all the sites on the web according to a pre-set classification system. Science, sports, entertainment, and so on. It was phenomenally unpopular. The content that grew online often didn’t fit the categories. People didn’t want to surf the internet with the equivalent of the Dewey Decimal System.
These days, to some degree, we know better. Amazon.com doesn’t recommend books based on pre-set categories, giving you a horror book if you liked a horror book before. It recommends a book based on the choices of other customers who liked the same books you like. Evidently, they have a big adjacency matrix somewhere, one column for every customer and one row for every purchase, and quite possibly they’re running some sort of a graph diffusion on it. They let the categories emerge organically from the data. If a new genre is emerging, they don’t have to scurry around trying to add a new label for it; it’ll show up automatically.
This suggests a sort of rationalist discipline: categories should always be organic. Humans like to categorize, and categories can be very useful. But not every set of categories is efficient at describing the variety of actual observations. Biologists used to have a kingdom called Monera, consisting of all the single-celled organisms; it was a horrible grab bag, because single-celled organisms are very different from each other genetically. After analyzing genomes, they decided there were actually three domains, Bacteria, Archaea, and Eukarya (the only one which includes multicellular life.) In a real, non-arbitrary way, this is a better kind of categorization, and Monera was a bad category.
Sometimes it seems that researchers don’t always pay attention to the problem of choosing categories and axes well. For example, I once saw a study of autism that did the following: created a questionnaire that rated the user’s “empathizing” and “systematizing” qualities, found that autistics were less “empathizing” and more “systematizing” than non-autistics, and concluded that autism was defined by more systematizing and less empathizing. This is a classic example of privileging one axis over others—what if autistics and non-autistics also differ in some other way? How do you know that you’ve chosen an efficient way to define that category? Wouldn’t you have to go look?
If you say “There are two kinds of people in the world,” but if you look around and lots of people don’t fit your binary, then maybe your binary is bad. If you want to know whether your set of categories is good, go look—see if the data actually clusters that way. There’s still a lot of debate about which mathematical techniques are best for defining categories, but it is a field where science has actually made progress. It’s not all arbitrary. When you play Twenty Questions with the universe, some questions are more useful than others.
References:
Wikipedia is generally very good on this subject, and the wiki page on singular value decomposition contains the proof that it actually works.
This paper by Mikhail Belkin and Partha Niyogi does a much better job of explaining Laplacian eigenmaps. Some other nonlinear dimensionality reduction techniques: Isomap, Locally Linear Embedding, and diffusion maps.
- 19 Nov 2010 19:02 UTC; 4 points) 's comment on Goals for which Less Wrong does (and doesn’t) help by (
The descriptive math part was very good, thanks—and that’s why I resisted downvoting the post. My problem is that the conclusion omits the hugely important factor that categories are useful for specific goals, and the kind of techniques you are suggesting (essentially unsupervised techniques) are context-free.
E.g. is a dead cow more similar to a dead (fixed from ‘live’) horse or to a live cow? (It clearly depends what you want to do with it)
That’s a good point.
I tend to find techniques attractive when they’re generalizable, and context-free, unsupervised techniques fit the bill. You can automate them. You can apply them to a range of projects. But you’re right—sometimes specific knowledge about a specific application matters, and you can’t generalize it away.
The two aren’t mutually exclusive, of course. You can use specific knowledge about a particular problem to make your machine learning methods work better, sometimes.
I read a paper the other day about predicting the targets of a particular type of small nucleolar RNA, which are an important part of the machinery that regulates gene expression. One of the methods they used was to run an SVM classifier on a number of features about the RNA in question. SVM classifiers are one of those nice general-purpose easily-automated methods, but the authors used their knowledge of the specific problem to pick out what features it would use for its classification. Things like the length of particular parts of the RNA—stuff that would occur to molecular biologists, but could be prohibitively expensive for a purely automatic machine learning algorithm to discover if you just gave it all the relevant data.
(More bio nerdery: they combined this with a fast approximation of the electrostatic forces at work, and ended up getting remarkably good accuracy and speed. The paper is here, if anyone’s interested.)
Belatedly, I remembered a relevant tidbit of wisdom I once got from a math professor.
When a theorist comes up with a new algorithm, it’s not going to outperform existing algorithms used commercially in the “real world.” Not even if, in principle the new algorithm is more elegant or faster or whatever. Why? Because in the real world, you don’t just take a general-purpose algorithm off the page, you optimize the hell out of it. Engineers who work with airplanes will jimmy their algorithms to accommodate all the practical “common knowledge” about airplanes. A mere mathematician who doesn’t know anything about airplanes can’t compete with that.
If you’re a theorist trying to come up with a better general method your goal is to give evidence that your algorithm will do better than the existing one after you optimize the hell out of them equally.
Your example doesn’t quite make sense to me. Did you mean “is a dead cow more similar to a dead horse or to a live cow”, or …?
Very quotable
And it brings to mind decision trees, which are essentially an automated way of playing Twenty Questions with the universe. In order to avoid over-fitting your training data, once you’ve constructed a complete decision tree, you go back and prune it, removing questions that are below a certain threshold of usefulness.
The usual way you do this is, you look at the expected reduction in entropy from asking a particular question. If it doesn’t reduce the entropy much, don’t bother asking. If you know that an animal is a bird, you don’t gain much by asking “Is it an Emperor penguin?”. You would reduce the entropy in your pool of possible birds more by asking if it’s a songbird, or if its average adult wingspan is more than 10 cm.
SarahC’s quote is not only clever, but also supported by solid math and practical application.
Overall you did a great job explaining the mathematics of unsupervised categorization but you missed one point in your end-matter.
The initial Monera classification was not a bad category at the time because when it was created there wasn’t enough data to split out different subcategories. All the researchers had were a bunch of fuzzy spots wiggling under a microscope. You touched on this in the Amazon.com example. Just because the categories you have now are good for your current data doesn’t mean that they will remain the same with further data.
Fair enough.
This was Simon Baron-Cohen’s EQ-SQ research. I don’t remember exactly what they concluded.
yep, that’s it. I’m a little nervous about dissing a famous researcher, but I did read the paper and it didn’t seem right. It was definitely phrased as if the correlation between autism and the results of various questionnaires vindicated modeling autism as a empathizing-systematizing spectrum. I’m not saying I disagree with that model personally (how would I know?) but that it’s not good enough justification to do it that way.
Of course, to be fair to Baron-Cohen, I’ve read one paper of his, not his entire body of work; if he fills in the gap elsewhere, more power to him. In that case, my example of “bad research” would be fictional (but still, I believe, bad.)
Another related problem that might be worth talking about is overfitting. Deciding what is noise and what isn’t noise and the related problem of how finely grained categories should be can be tough.
Yes! Whole new issue! Multi-scale methods can help (I like this example a lot) but I think there’s still a little ad-hoc stuff hidden within the algorithm.
Very interesting points. I’m still trying to learn the math behind categorization myself.
Regarding the autism categorizations—good points. It’s also quite possible that autistics might score lower on the systematizing quotient than non-autistics in a different country/world. How could that happen? The questions on that systematizing quotient test were highly subject specific—some of the questions had to do with furniture, others had to do with time tables, others had to do with statistics. The type of person who scores high on systematizing would have to have broad interests. An autistic with exceptionally narrow interests would score very low on this (even though his interests could still have especially high intensity—the intensity and the narrowness both owing themselves to autism).
But it’s quite possible that an autistic person could have obsessions with entirely different domains that don’t appear on the systematizing quotient test—domains that were more salient in a different culture/world.
In another example, I’ll bring up this hypothesis: http://en.wikipedia.org/wiki/Differential_susceptibility_hypothesis http://www.theatlantic.com/magazine/archive/2009/12/the-science-of-success/7761/
So it’s entirely possible that someone with a particular genotype could exhibit one phenotype in one environment, and the exact opposite phenotype in the second environment. How would they then be classified? As according to their genotype? Well, maybe. But in America, the total scope of environmental variation is highly restricted (almost no one suffers from extreme starvation). Environmental variation could be significantly increased through extreme environmental circumstances, or even by cyborg technology. After we use this—how can we then classify people?
Here’s a post I once wrote on classification: http://forums.philosophyforums.com/threads/classification-theory-29482.html
One of my major points: Even the “Tree of Life” isn’t strictly a “tree of life”. Humans owe 8% of their DNA to some rhinovirus (IIRC). It’s entirely possible that in a world of increased viral activity, that the “tree” would totally break down (in fact, there probably is no “tree” in the bacterial kingdoms).
And of course, then if we implement cyborg technology (or artificial DNA) into the bacteria—it makes classification even more complicated. We could compare differences in letter groups in DNA. But what if the artificial genome had different molecules that made up a helix?
Possibly stupid thought, but rather than just applying these techniques to the whole of thingspace, shouldn’t one first do something like this: get a bunch of incoming data, create density map (by measuring at various resolutions, perhaps), look for regions with density significantly different from others (ie, first locate some clusters) and then apply these techniques in those regions rather than over the entire whole of the space to help learn the properties of those clusters?
My information might be out of date, but last time I checked they had one such matrix per department, supposedly with some timing information etc. So they won’t recommend Bolt plush toy if you liked Bolt DVD.
I’ll leave the moral of this story to you.
Awesome, thanks for posting this.
I’ve been working with SVDs a lot in my research recently, and have gotten interested in manifold learning as a result.
What are your data sets like and how well has your manifold learning algorithms worked on them? I get the feeling that, for instance, Local Linear Embedding wouldn’t do so well factoring out rotation in computer vision.
I spent some time learning about this, when I was dabbling with the Netflix contest. There was a fair bit of discussion on their forum regarding SVD & related algorithms. The winner used a “blended” approach, which weighted results from numerous different algorithms. The success of the algorithm was based on the RMSE of the predictions compared with a test set of data.
While the math is a little outside my current capabilities, I really appreciate this thread, because I’ve been working on the very beginning stages of a project that requires computational categorization algorithms, and you’ve given me a lot of good information, and perhaps more importantly, some new things to go and study.
Thanks!
How does one obtain a high-dimensional featurespace to begin with? Can one bootstrap from a one-dimensional space?
I can’t think of any satisfactory way to do this right now.
You shouldn’t want to have a high-dimensional space. High-dimensional spaces are hard to work with, it’s just that they come up often. You basically obtain one when you look at an object or concept or what have you, then think of everything you could measure about it and measure that.
Misha’s answer is almost always the right one, but you technically can project points into a higher-dimensional space using a kernel function. This comes up in Support Vector Machines where you’re trying to separate two classes of data points by drawing a hyperplane between them. If your data isn’t linearly separable, projecting it into a higher-dimensional space can sometimes help.
But most of the time, what you want to so is just measure everything you can think of, and let those measurements be your dimensions. When looking at rubes and bleggs, measure things like redness, blueness, roundedness, furryness, whatever you can think of. Each of those is one dimension. Before you know it, you’ve got a high-dimensional featurespace. Good luck dealing with it.