It’s obvious that we can’t just teach it the class of ‘unknown’.
Hm, this isn’t obvious to me. Many training sets do make use of an “unknown”/”none of the above” label, right? It’s not an elegant solution, but it is a solution.
It wouldn’t surprise me if including “none of the above” examples in your training set often causes learning algorithms to “do the right thing” and draw boundaries between parts of input space that can/cannot be labeled more precisely. If we’re penalizing model complexity, a model which does this is probably going to be simpler than a model which memorizes every “none of the above” training example. In any case, it would be easy to include some “none of the above” examples in your validation set that are much different than the “none of the above” examples in your training set and see if they’re classified correctly.
Even if we could compute thingspace volume, how would that incorporating that into the loss function apply optimization pressure on the volume itself?
I found this a little bit confusing. It sounds like you’ve answered your own question. If you have a formula for thingspace volume, you can incorporate it in to your loss function, and then minimizing your loss function will tend to compress thingspace volume as a side effect. Where’s the difficulty?
I happened to finish a mobile comment and then refreshed after you posted.
Hm, this isn’t obvious to me. Many training sets do make use of an “unknown”/”none of the above” label, right? It’s not an elegant solution, but it is a solution.
The problem is that this isn’t robust in practice, as far as I know. It will bound the size somewhat, but only in the dimensions we happened to specify. The space is then cleaved into three: dog, cat, unknown, with all of them extending (basically) indefinitely. There isn’t any pressure for ‘unknown’ to become the default—we just know that ‘dog’ is bounded by certain images. Better than nothing, but neither robust nor elegant (which makes me skeptical of its ability to robustly scale).
I found this a little bit confusing. It sounds like you’ve answered your own question. If you have a formula for thingspace volume, you can incorporate it in to your loss function, and then minimizing your loss function will tend to compress thingspace volume as a side effect. Where’s the difficulty?
I think you’re right, and I may have come to this conclusion too quickly. I wrote out the equations, and it would in fact be able to optimize off of this. Further thought—I imagine the best way to represent it in the loss equation would be as some function of the proportion of thingspace occupied.
The problem is that this isn’t robust in practice, as far as I know. It will bound the size somewhat, but only in the dimensions we happened to specify.
Suppose we’re making use of an n-dimensional feature space for classification, and the desired representation for “dog” is an n-dimensional hypersphere centered at some coordinates in this n-dimensional space. Suppose we have a procedure for sampling uniformly at random from a finite region r surrounding the dog hypersphere. And suppose that we have a procedure for determining whether a sample from this region r is a dog or not (example: ask someone on Mechanical Turk). If so, it may be possible to offer statistical guarantees about the fidelity of our dog representation based on the number of points sampled, the probability of MTurkers misclassifying images, etc.
More informally: If you have enough samples labeled “none of the above”, and they have a sufficiently broad distribution, then perhaps you can be fairly sure there are “none of the above” examples which bound your notion of “dog” on any given dimension. This doesn’t really have a story for adversarially chosen examples though.
It’s true that 256 * 256 RGB images are quite a large feature space, but deep learning algorithms typically transform images in to a much smaller feature space before doing classification. So for this random sampling idea to work, you might need a way to reverse engineer images based on randomly chosen coordinates in the smaller feature space. Then there’s the problem of ensuring that the transformation into the smaller feature space consistently behaves as expected. Maybe randomly sampling quite close to the “dog” centroid frequently produces non-dogs when reverse engineered.
I do think your idea of treating “none of the above” as a special class, and regularizing so as to minimize the size of every volume but the “none of the above” volume, is a very interesting one.
I guess a simpler, but probably less effective, change would be to tweak your loss function so as to penalize misclassifying a “none of the above” image as a “dog” more heavily than the reverse. Of course, you could also just have a very high decision threshold for actually treating an image as a dog, but tweaking the loss function might have advantages?
Suppose we’re making use of an n-dimensional feature space for classification, and the desired representation for “dog” is an n-dimensional hypersphere centered at some coordinates in this n-dimensional space. Suppose we have a procedure for sampling uniformly at random from a finite region r surround the dog hypersphere.
I think that would be a good approach, and more immediately actionable than mine. The hard part is sampling uniformly at random from r, as that implies having already found the desired hypersphere. Also, it seems less resistant to adversarial examples.
It’s true that 256 * 256 RGB images are quite a large feature space, but deep learning algorithms typically transform images in to a much smaller feature space before doing classification. So for this random sampling idea to work, you might need a way to reverse engineer images based on randomly chosen coordinates in the smaller feature space. There’s also the problem of ensuring that the transformation into the smaller feature space consistently behaves as expected.
Wow, I hadn’t thought of using latent spaces for this! If we could have a probabilistic guarantee that our latent space is volumetrically-representative of the space it encodes, and if we had a way of accurately classifying the latent space itself (this seems to follow from the definition of how latent spaces are constructed), then we could randomly sample the continuous latent space in order to get a distribution over volumes! The problem is how do you accurately sample an infinite space, but you could probably get around that by bounding the coordinates to some multiple of the farthest-value-seen-thus-far.
I imagine that a latent space would let us do other cool things, like locate the edges of each class with some confidence (if they exist, and not unlike gradient descent). Therefore, we would have multiple ways of approximating volume. However, I don’t think I’m familiar enough with them yet to speak confidently and technically about the subject (my class is just reaching autoencoders now).
I guess a simpler, but probably less effective, change would be to tweak your loss function so as to penalize misclassifying a “none of the above” image as a “dog” more heavily than the reverse. Of course, you could also just have a very high decision threshold for actually treating an image as a dog, but tweaking the loss function might have advantages?
This seems similar to the r-sampling idea, but in a way which converges more quickly. I still think the issue is guaranteeing robustness, and finding that ideal r to begin with.
Meta question: how can I do strikethrough in my posts? Tildes don’t do the trick.
Well in reality, if you are paying people on Mechanical Turk to classify your images, maybe you don’t want to sample randomly anyhow. Instead you could select maximally informative data points to ask them about.
This potentially helps with the problem of discovering the bounding region. Suppose that one of the features in the transformed space corresponds to shagginess. And suppose that the shaggiest image in our training set is an image of a dog. A naive learning algorithm might conclude that an image full of shag must be a dog. To deal with this problem, we set shagginess to 10, generate an image, and send it to MTurk. If they think it’s a dog, we double our shagginess. If they think it’s not a dog, we halve our shagginess. (For this use case, it might be best to ask them to describe the image in a single word… if they’re choosing between dog/cat/other, they might select dog on the basis that it looks kinda like dog hair or something like that.) Eventually we get some idea of where the classification boundary should be through binary search.
I’ll bet you could do some math to determine how to get the strongest statistical guarantees with the minimum amount of money spent on MTurk too.
I imagine that a latent space would let us do other cool things, like locate the edges of each class with some confidence
Yep. If the dog is represented using a convex polytope instead of a sphere, you might even reverse engineer the corners of your current classifier region, and then display them all to the user to show how expansive the classifier’s notion of “dog” is. But the map is not the territory: It’s possible that in some cases, the shape the user wants is actually concave.
However, I don’t think I’m familiar enough with them yet to speak confidently and technically about the subject (my class is just reaching autoencoders now).
I’m a deep learning noob too. I’m just about finished with Andrew Ng’s Coursera specialization, which was great, but the word “autoencoder” was never used. However there was some discussion of making use of transformed (“latent”? Staying on the safe side because I’m not familiar with that term) feature spaces. Apparently this is how face recognition systems recognize your face given only a single reference image: Map the reference image into a carefully constructed feature space, then map a new image of you in to the same feature space and compute the Euclidean distance. If the distance is small enough, it’s a match.
Instead you could select maximally informative data points to ask them about.
In this case, information is measured by how much of thingspace would be sheared if it turned out that a data point should be classified as ‘unknown’. It isn’t immediately clear how to find this without a tractable thingspace-volume-subroutine, but I think this would be computationally-efficient for both of our ideas.
I’ll bet you could do some math to determine how to get the strongest statistical guarantees with the minimum amount of money spent on MTurk too.
The technique you’re probably looking for is called Bayesian Optimization. Aside: at my school, ‘Optimization’ - not ‘Conspiracy’ - is unfortunately the word which most frequently follows ‘Bayesian’.
If the dog is represented using a convex polytope instead of a sphere, you might even reverse engineer the corners of your current classifier region, and then display them all to the user to show how expansive the classifier’s notion of “dog” is. But the map is not the territory: It’s possible that in some cases, the shape the user wants is actually concave.
Even an imperfect estimate of the volume would be useful: for example, perhaps we only find some of the edges and conclude the volume is some fraction of its true value. I have the distinct sense of talking past the point you were trying to make, though.
Even an imperfect estimate of the volume would be useful: for example, perhaps we only find some of the edges and conclude the volume is some fraction of its true value. I have the distinct sense of talking past the point you were trying to make, though.
Good post.
Hm, this isn’t obvious to me. Many training sets do make use of an “unknown”/”none of the above” label, right? It’s not an elegant solution, but it is a solution.
It wouldn’t surprise me if including “none of the above” examples in your training set often causes learning algorithms to “do the right thing” and draw boundaries between parts of input space that can/cannot be labeled more precisely. If we’re penalizing model complexity, a model which does this is probably going to be simpler than a model which memorizes every “none of the above” training example. In any case, it would be easy to include some “none of the above” examples in your validation set that are much different than the “none of the above” examples in your training set and see if they’re classified correctly.
I found this a little bit confusing. It sounds like you’ve answered your own question. If you have a formula for thingspace volume, you can incorporate it in to your loss function, and then minimizing your loss function will tend to compress thingspace volume as a side effect. Where’s the difficulty?
I happened to finish a mobile comment and then refreshed after you posted.
The problem is that this isn’t robust in practice, as far as I know. It will bound the size somewhat, but only in the dimensions we happened to specify. The space is then cleaved into three: dog, cat, unknown, with all of them extending (basically) indefinitely. There isn’t any pressure for ‘unknown’ to become the default—we just know that ‘dog’ is bounded by certain images. Better than nothing, but neither robust nor elegant (which makes me skeptical of its ability to robustly scale).
I think you’re right, and I may have come to this conclusion too quickly. I wrote out the equations, and it would in fact be able to optimize off of this. Further thought—I imagine the best way to represent it in the loss equation would be as some function of the proportion of thingspace occupied.
Suppose we’re making use of an n-dimensional feature space for classification, and the desired representation for “dog” is an n-dimensional hypersphere centered at some coordinates in this n-dimensional space. Suppose we have a procedure for sampling uniformly at random from a finite region r surrounding the dog hypersphere. And suppose that we have a procedure for determining whether a sample from this region r is a dog or not (example: ask someone on Mechanical Turk). If so, it may be possible to offer statistical guarantees about the fidelity of our dog representation based on the number of points sampled, the probability of MTurkers misclassifying images, etc.
More informally: If you have enough samples labeled “none of the above”, and they have a sufficiently broad distribution, then perhaps you can be fairly sure there are “none of the above” examples which bound your notion of “dog” on any given dimension. This doesn’t really have a story for adversarially chosen examples though.
It’s true that 256 * 256 RGB images are quite a large feature space, but deep learning algorithms typically transform images in to a much smaller feature space before doing classification. So for this random sampling idea to work, you might need a way to reverse engineer images based on randomly chosen coordinates in the smaller feature space. Then there’s the problem of ensuring that the transformation into the smaller feature space consistently behaves as expected. Maybe randomly sampling quite close to the “dog” centroid frequently produces non-dogs when reverse engineered.
I do think your idea of treating “none of the above” as a special class, and regularizing so as to minimize the size of every volume but the “none of the above” volume, is a very interesting one.
I guess a simpler, but probably less effective, change would be to tweak your loss function so as to penalize misclassifying a “none of the above” image as a “dog” more heavily than the reverse. Of course, you could also just have a very high decision threshold for actually treating an image as a dog, but tweaking the loss function might have advantages?
I think that would be a good approach, and more immediately actionable than mine. The hard part is sampling uniformly at random from r, as that implies having already found the desired hypersphere. Also, it seems less resistant to adversarial examples.
Wow, I hadn’t thought of using latent spaces for this! If we could have a probabilistic guarantee that our latent space is volumetrically-representative of the space it encodes, and if we had a way of accurately classifying the latent space itself (this seems to follow from the definition of how latent spaces are constructed), then we could randomly sample the continuous latent space in order to get a distribution over volumes! The problem is how do you accurately sample an infinite space, but you could probably get around that by bounding the coordinates to some multiple of the farthest-value-seen-thus-far.
I imagine that a latent space would let us do other cool things, like locate the edges of each class with some confidence (if they exist, and not unlike gradient descent). Therefore, we would have multiple ways of approximating volume. However, I don’t think I’m familiar enough with them yet to speak confidently and technically about the subject (my class is just reaching autoencoders now).
This seems similar to the r-sampling idea, but in a way which converges more quickly. I still think the issue is guaranteeing robustness, and finding that ideal r to begin with.
Meta question: how can I do strikethrough in my posts? Tildes don’t do the trick.
Well in reality, if you are paying people on Mechanical Turk to classify your images, maybe you don’t want to sample randomly anyhow. Instead you could select maximally informative data points to ask them about.
This potentially helps with the problem of discovering the bounding region. Suppose that one of the features in the transformed space corresponds to shagginess. And suppose that the shaggiest image in our training set is an image of a dog. A naive learning algorithm might conclude that an image full of shag must be a dog. To deal with this problem, we set shagginess to 10, generate an image, and send it to MTurk. If they think it’s a dog, we double our shagginess. If they think it’s not a dog, we halve our shagginess. (For this use case, it might be best to ask them to describe the image in a single word… if they’re choosing between dog/cat/other, they might select dog on the basis that it looks kinda like dog hair or something like that.) Eventually we get some idea of where the classification boundary should be through binary search.
I’ll bet you could do some math to determine how to get the strongest statistical guarantees with the minimum amount of money spent on MTurk too.
Yep. If the dog is represented using a convex polytope instead of a sphere, you might even reverse engineer the corners of your current classifier region, and then display them all to the user to show how expansive the classifier’s notion of “dog” is. But the map is not the territory: It’s possible that in some cases, the shape the user wants is actually concave.
I’m a deep learning noob too. I’m just about finished with Andrew Ng’s Coursera specialization, which was great, but the word “autoencoder” was never used. However there was some discussion of making use of transformed (“latent”? Staying on the safe side because I’m not familiar with that term) feature spaces. Apparently this is how face recognition systems recognize your face given only a single reference image: Map the reference image into a carefully constructed feature space, then map a new image of you in to the same feature space and compute the Euclidean distance. If the distance is small enough, it’s a match.
In this case, information is measured by how much of thingspace would be sheared if it turned out that a data point should be classified as ‘unknown’. It isn’t immediately clear how to find this without a tractable thingspace-volume-subroutine, but I think this would be computationally-efficient for both of our ideas.
The technique you’re probably looking for is called Bayesian Optimization. Aside: at my school, ‘Optimization’ - not ‘Conspiracy’ - is unfortunately the word which most frequently follows ‘Bayesian’.
Even an imperfect estimate of the volume would be useful: for example, perhaps we only find some of the edges and conclude the volume is some fraction of its true value. I have the distinct sense of talking past the point you were trying to make, though.
No, that sounds more or less right.