knowing when to stop itself requires understanding our values. If you can tell me whether a certain scenario contains events which, were I to grasp them, would prompt me to adjust my concepts, then you must know a lot about my values [...] an understanding of our values was obtained from a finite narrow dataset of non-value-relevant questions. How could that be possible?
Presumably, the finite narrow dataset did teach me something about your values? In the paragliding example, the “easy set” consisted of pilots who were either licensed-and-experienced or not-licensed-and-not-experienced. It seems like it ought to be possible to notice this, and refuse to pronounce judgement on cases where licensedness and experience diverge? (I’m not an ML specialist, but I’ve heard the phrase “out-of-distribution detection.”)
Of course, this does require making some assumptions about your goals: if the unlicensed and inexperienced pilots in the easy set happened to be named John, Sarah, and Kelly, the system needs some sort of inductive bias to guess that the feature you wanted was “unlicensed and inexperienced” rather than “named John, Sarah, or Kelly”. But this could be practical if some concepts are more natural than others: when I look at GANs that generate facial images, it sure seems like they’re meeting both a Safety requirement (not generating non-faces) and a Generalization requirement (generating new faces that weren’t in the training dataset). What am I missing?
When I look at GANs that generate facial images, it sure seems like they’re meeting both a Safety requirement (not generating non-faces) and a Generalization requirement (generating new faces that weren’t in the training dataset). What am I missing?
Disclaimer: not 100% sure that this point holds for all models.
This works great when you run them in a generative mode, creating new images. However, at the paragliding event, presumably they needed a discriminative mode, distinguishing between those who are or are not pilots.
There are various ways that one can run generative models in a discriminative mode. For instance, a generative model implicitly has a probability distribution, so one could look at the size of P(image|model) and see whether that is high; if it is, one might interpret the image as being in-distribution. Alternatively, many generative models including GANs have some latent variables, so one could invert the model and then take a simple classifier over those latent variables.
The issue is then that while the generative mode is highly consistent in safety and generalization, this doesn’t imply that the discriminative mode necessarily will be; in fact, AFAIK usually it is not. For instance, if you are doing P(image|model), then most of the probability loss comes in at having to account for noise, so you can trick the model with all sorts of extremely non-face images as long as they have low amounts of noise. And if you invert it and take a classifier over the latent variables, you could probably jump to some exponentially small subspace with the properties you want.
Presumably, the finite narrow dataset did teach me something about your values? [...] “out-of-distribution detection.”
Yeah right, I do actually think that “out of distribution detection” is what we want here. But it gets really subtle. Consider a model that learns that when answering “is the diamond in the vault?” it’s okay for the physical diamond to be in different physical positions and orientations in the vault. So even though it has not seen the diamond in every possible position and orientation within the training set, it’s still not “out of distribution” to see the diamond in a new position and answer the question confidently. And what if the diamond is somehow left/right mirror-imaged while it is in the vault? Well that also probably is fine for diamonds. But now what if instead of a diamond in the vault, we are learning to do some kind of robotic surgery, and the question we are asking is “is the patient healthy?”. Well in this case also we would hope that the machine learning system would learn that it’s okay for the patient to undergo (small) changes of physical position and orientation, so that much is not “out of distribution”, but in this situation we really would not want to move ahead with a plan that mirror-images our patient, because then the patient wouldn’t be able to eat any food that currently exists on Earth and would starve. So it seems like the “out of distribution” property we want is really “out of distribution with respect to our values”
Now you might say that mirror-imaging ought to be “out of distribution” in both cases, even though it would be harmless in the case of the diamond. That’s reasonable, but it’s not so easy to see how our reporter would learn that on its own. We could just outlaw any very sophisticated plan but then we’re losing competitiveness with systems that are more lax on safety.
it sure seems like they’re meeting both a Safety requirement (not generating non-faces) and a Generalization requirement (generating new faces that weren’t in the training dataset). What am I missing?
Well we might have a predictor that is a perfect statistical model of the thing it was trained to on, but the ontology identification issue is about what kind of safety critical questions can be answered based on the internal computations of such a model. So in the case of a GAN, we might try to answer “is this person lying?” based on a photo of their face, and we might hope that, having trained the GAN on the general-purpose face-generation problem, the latent variables within the GAN might contain features we need to do visual lie detection. Now even if the GAN does perfectly safe face generation, we need some additional work to get a safety guarantee on our reporter, and this is difficult because we want to do it based on a finite narrow dataset.
One further thought: suppose we trained a predictive model to answer the same question as the reporter itself, and suppose we stipulated only that the reporter ought to be as safe and general as the predictor is. Then we could just take the output of the predictor as the reporter’s output and we’d be done. Now what if we trained a predictive model to answer a question that was a kind of “immediate logical neighbor” of the reporter’s question, such as “is the diamond in the left half of the vault?” where the reporter’s question is “is the diamond in the vault?” Then we also should be able to specify and meet a safety guarantee phrased in terms of the relationship between the correctness of the reporter and the predictor. Interested in your thoughts on this.
Presumably, the finite narrow dataset did teach me something about your values? In the paragliding example, the “easy set” consisted of pilots who were either licensed-and-experienced or not-licensed-and-not-experienced. It seems like it ought to be possible to notice this, and refuse to pronounce judgement on cases where licensedness and experience diverge? (I’m not an ML specialist, but I’ve heard the phrase “out-of-distribution detection.”)
Of course, this does require making some assumptions about your goals: if the unlicensed and inexperienced pilots in the easy set happened to be named John, Sarah, and Kelly, the system needs some sort of inductive bias to guess that the feature you wanted was “unlicensed and inexperienced” rather than “named John, Sarah, or Kelly”. But this could be practical if some concepts are more natural than others: when I look at GANs that generate facial images, it sure seems like they’re meeting both a Safety requirement (not generating non-faces) and a Generalization requirement (generating new faces that weren’t in the training dataset). What am I missing?
Disclaimer: not 100% sure that this point holds for all models.
This works great when you run them in a generative mode, creating new images. However, at the paragliding event, presumably they needed a discriminative mode, distinguishing between those who are or are not pilots.
There are various ways that one can run generative models in a discriminative mode. For instance, a generative model implicitly has a probability distribution, so one could look at the size of P(image|model) and see whether that is high; if it is, one might interpret the image as being in-distribution. Alternatively, many generative models including GANs have some latent variables, so one could invert the model and then take a simple classifier over those latent variables.
The issue is then that while the generative mode is highly consistent in safety and generalization, this doesn’t imply that the discriminative mode necessarily will be; in fact, AFAIK usually it is not. For instance, if you are doing P(image|model), then most of the probability loss comes in at having to account for noise, so you can trick the model with all sorts of extremely non-face images as long as they have low amounts of noise. And if you invert it and take a classifier over the latent variables, you could probably jump to some exponentially small subspace with the properties you want.
Yeah right, I do actually think that “out of distribution detection” is what we want here. But it gets really subtle. Consider a model that learns that when answering “is the diamond in the vault?” it’s okay for the physical diamond to be in different physical positions and orientations in the vault. So even though it has not seen the diamond in every possible position and orientation within the training set, it’s still not “out of distribution” to see the diamond in a new position and answer the question confidently. And what if the diamond is somehow left/right mirror-imaged while it is in the vault? Well that also probably is fine for diamonds. But now what if instead of a diamond in the vault, we are learning to do some kind of robotic surgery, and the question we are asking is “is the patient healthy?”. Well in this case also we would hope that the machine learning system would learn that it’s okay for the patient to undergo (small) changes of physical position and orientation, so that much is not “out of distribution”, but in this situation we really would not want to move ahead with a plan that mirror-images our patient, because then the patient wouldn’t be able to eat any food that currently exists on Earth and would starve. So it seems like the “out of distribution” property we want is really “out of distribution with respect to our values”
Now you might say that mirror-imaging ought to be “out of distribution” in both cases, even though it would be harmless in the case of the diamond. That’s reasonable, but it’s not so easy to see how our reporter would learn that on its own. We could just outlaw any very sophisticated plan but then we’re losing competitiveness with systems that are more lax on safety.
Well we might have a predictor that is a perfect statistical model of the thing it was trained to on, but the ontology identification issue is about what kind of safety critical questions can be answered based on the internal computations of such a model. So in the case of a GAN, we might try to answer “is this person lying?” based on a photo of their face, and we might hope that, having trained the GAN on the general-purpose face-generation problem, the latent variables within the GAN might contain features we need to do visual lie detection. Now even if the GAN does perfectly safe face generation, we need some additional work to get a safety guarantee on our reporter, and this is difficult because we want to do it based on a finite narrow dataset.
One further thought: suppose we trained a predictive model to answer the same question as the reporter itself, and suppose we stipulated only that the reporter ought to be as safe and general as the predictor is. Then we could just take the output of the predictor as the reporter’s output and we’d be done. Now what if we trained a predictive model to answer a question that was a kind of “immediate logical neighbor” of the reporter’s question, such as “is the diamond in the left half of the vault?” where the reporter’s question is “is the diamond in the vault?” Then we also should be able to specify and meet a safety guarantee phrased in terms of the relationship between the correctness of the reporter and the predictor. Interested in your thoughts on this.