The difference between “supervised” and “unsupervised” learning just refers to whether model performance is judged during training by how well the model produces the “correct answer” for each input. Supervised learning trains a model using labeled data, while unsupervised learning only learns from unlabeled data. Semi-supervised learning, as you might guess, learns from both labeled and unlabeled data. All of these methods learn to extract features from the input data that are informative as to what part of the data distribution the input comes from. They only differ in what drives the discovery of these features and in what these features are used for in deployment.
Supervised learning is all about building a function that maps inputs to outputs, where the precise internal structure of the function is not known ahead of time. Crucially, the function is learned by presenting the model with known, correct pairs of inputs and outputs, passing these inputs to the model, generating its own error-ridden outputs, and updating its parameters to minimize this error. The hope is that by learning to make correct predictions (mapping inputs to outputs) on the training data, it will generalize to produce correct predictions on new, previously unseen data.
For example, a neural network learning to classify images into one of several categories (e.g., CIFAR-10, CIFAR-100, or ImageNet) will be given thousands of pairs of images with their known category labels. It will have to figure out features of the input images on its own (i.e., patterns of pixels, patterns of patterns, etc.) that are useful for predicting the correct mappings, which it discovers through gradient descent over the space of function parameters. When deployed, the neural network will simply act like a function, generating output labels for new input images. Of course, neural networks are general enough to learn any type of function mapping, not just images to labels, since their nonlinearities equip them to be general piecewise function approximators.
Unsupervised learning, on the other hand, is all about uncovering the structure of some distribution of data rather than learning a function mapping. This could involve clustering, such as using expectation-maximization to find the means and covariances of some multimodal distribution. Importantly, there is no longer a “right answer” for the model to give. Instead, it’s all about reducing the dimensionality of the data by taking advantage of statistical and structural regularities. For example, there are (256^3)^(W*H) possible images of W x H pixels, but only a very low-dimensional manifold within this space contains what we would interpret as faces. Unsupervised learning basically figures out regularities in the set of all training images to create an informative parameterization of this manifold. Then knowing your coordinates in “facial image space” gives you all the information you need to reconstruct the original image.
A neural network example would be variational autoencoders (VAE), which simply learn to replicate their inputs as outputs. That sounds trivial (“Isn’t that just the identity function?”), except that these models pass their inputs (such as images) through an information bottleneck (the encoder) that extracts useful features as a latent space representation (mean and variance vectors, representing position and uncertainty on the data manifold), which can then be used to regenerate the data with a learned generative model (the decoder). Flipping these around would give you a GAN (generative adversarial network), which learns to generate data whose structure and distribution is indistinguishable from that of real data.
Supervised learning trains a model using labeled data, while unsupervised learning only learns from unlabeled data.
I think this is one of the sentences that I feel confused about. I feel like labeling is in some sense “cheating” or trading off efficiency vs. generality.
But learning to guess the teacher’s password is exactly what supervised learning is all about. It’s such a popular technique precisely because it converges so quickly on generating correct answers, even though it’s often at the expense of learning spurious correlations. That’s why these models tend to reproduce human biases and prejudices that exist in the training data. Learning the true causal structure of reality is a much harder problem.
The difference between “supervised” and “unsupervised” learning just refers to whether model performance is judged during training by how well the model produces the “correct answer” for each input. Supervised learning trains a model using labeled data, while unsupervised learning only learns from unlabeled data. Semi-supervised learning, as you might guess, learns from both labeled and unlabeled data. All of these methods learn to extract features from the input data that are informative as to what part of the data distribution the input comes from. They only differ in what drives the discovery of these features and in what these features are used for in deployment.
Supervised learning is all about building a function that maps inputs to outputs, where the precise internal structure of the function is not known ahead of time. Crucially, the function is learned by presenting the model with known, correct pairs of inputs and outputs, passing these inputs to the model, generating its own error-ridden outputs, and updating its parameters to minimize this error. The hope is that by learning to make correct predictions (mapping inputs to outputs) on the training data, it will generalize to produce correct predictions on new, previously unseen data.
For example, a neural network learning to classify images into one of several categories (e.g., CIFAR-10, CIFAR-100, or ImageNet) will be given thousands of pairs of images with their known category labels. It will have to figure out features of the input images on its own (i.e., patterns of pixels, patterns of patterns, etc.) that are useful for predicting the correct mappings, which it discovers through gradient descent over the space of function parameters. When deployed, the neural network will simply act like a function, generating output labels for new input images. Of course, neural networks are general enough to learn any type of function mapping, not just images to labels, since their nonlinearities equip them to be general piecewise function approximators.
Unsupervised learning, on the other hand, is all about uncovering the structure of some distribution of data rather than learning a function mapping. This could involve clustering, such as using expectation-maximization to find the means and covariances of some multimodal distribution. Importantly, there is no longer a “right answer” for the model to give. Instead, it’s all about reducing the dimensionality of the data by taking advantage of statistical and structural regularities. For example, there are (256^3)^(W*H) possible images of W x H pixels, but only a very low-dimensional manifold within this space contains what we would interpret as faces. Unsupervised learning basically figures out regularities in the set of all training images to create an informative parameterization of this manifold. Then knowing your coordinates in “facial image space” gives you all the information you need to reconstruct the original image.
A neural network example would be variational autoencoders (VAE), which simply learn to replicate their inputs as outputs. That sounds trivial (“Isn’t that just the identity function?”), except that these models pass their inputs (such as images) through an information bottleneck (the encoder) that extracts useful features as a latent space representation (mean and variance vectors, representing position and uncertainty on the data manifold), which can then be used to regenerate the data with a learned generative model (the decoder). Flipping these around would give you a GAN (generative adversarial network), which learns to generate data whose structure and distribution is indistinguishable from that of real data.
I think this is one of the sentences that I feel confused about. I feel like labeling is in some sense “cheating” or trading off efficiency vs. generality.
But learning to guess the teacher’s password is exactly what supervised learning is all about. It’s such a popular technique precisely because it converges so quickly on generating correct answers, even though it’s often at the expense of learning spurious correlations. That’s why these models tend to reproduce human biases and prejudices that exist in the training data. Learning the true causal structure of reality is a much harder problem.
Might depend on the labeling, though. Like, a good math class is well-structured data intended to kick-start generalization for the learner.