Presumably, the finite narrow dataset did teach me something about your values? [...] “out-of-distribution detection.”
Yeah right, I do actually think that “out of distribution detection” is what we want here. But it gets really subtle. Consider a model that learns that when answering “is the diamond in the vault?” it’s okay for the physical diamond to be in different physical positions and orientations in the vault. So even though it has not seen the diamond in every possible position and orientation within the training set, it’s still not “out of distribution” to see the diamond in a new position and answer the question confidently. And what if the diamond is somehow left/right mirror-imaged while it is in the vault? Well that also probably is fine for diamonds. But now what if instead of a diamond in the vault, we are learning to do some kind of robotic surgery, and the question we are asking is “is the patient healthy?”. Well in this case also we would hope that the machine learning system would learn that it’s okay for the patient to undergo (small) changes of physical position and orientation, so that much is not “out of distribution”, but in this situation we really would not want to move ahead with a plan that mirror-images our patient, because then the patient wouldn’t be able to eat any food that currently exists on Earth and would starve. So it seems like the “out of distribution” property we want is really “out of distribution with respect to our values”
Now you might say that mirror-imaging ought to be “out of distribution” in both cases, even though it would be harmless in the case of the diamond. That’s reasonable, but it’s not so easy to see how our reporter would learn that on its own. We could just outlaw any very sophisticated plan but then we’re losing competitiveness with systems that are more lax on safety.
it sure seems like they’re meeting both a Safety requirement (not generating non-faces) and a Generalization requirement (generating new faces that weren’t in the training dataset). What am I missing?
Well we might have a predictor that is a perfect statistical model of the thing it was trained to on, but the ontology identification issue is about what kind of safety critical questions can be answered based on the internal computations of such a model. So in the case of a GAN, we might try to answer “is this person lying?” based on a photo of their face, and we might hope that, having trained the GAN on the general-purpose face-generation problem, the latent variables within the GAN might contain features we need to do visual lie detection. Now even if the GAN does perfectly safe face generation, we need some additional work to get a safety guarantee on our reporter, and this is difficult because we want to do it based on a finite narrow dataset.
One further thought: suppose we trained a predictive model to answer the same question as the reporter itself, and suppose we stipulated only that the reporter ought to be as safe and general as the predictor is. Then we could just take the output of the predictor as the reporter’s output and we’d be done. Now what if we trained a predictive model to answer a question that was a kind of “immediate logical neighbor” of the reporter’s question, such as “is the diamond in the left half of the vault?” where the reporter’s question is “is the diamond in the vault?” Then we also should be able to specify and meet a safety guarantee phrased in terms of the relationship between the correctness of the reporter and the predictor. Interested in your thoughts on this.
Yeah right, I do actually think that “out of distribution detection” is what we want here. But it gets really subtle. Consider a model that learns that when answering “is the diamond in the vault?” it’s okay for the physical diamond to be in different physical positions and orientations in the vault. So even though it has not seen the diamond in every possible position and orientation within the training set, it’s still not “out of distribution” to see the diamond in a new position and answer the question confidently. And what if the diamond is somehow left/right mirror-imaged while it is in the vault? Well that also probably is fine for diamonds. But now what if instead of a diamond in the vault, we are learning to do some kind of robotic surgery, and the question we are asking is “is the patient healthy?”. Well in this case also we would hope that the machine learning system would learn that it’s okay for the patient to undergo (small) changes of physical position and orientation, so that much is not “out of distribution”, but in this situation we really would not want to move ahead with a plan that mirror-images our patient, because then the patient wouldn’t be able to eat any food that currently exists on Earth and would starve. So it seems like the “out of distribution” property we want is really “out of distribution with respect to our values”
Now you might say that mirror-imaging ought to be “out of distribution” in both cases, even though it would be harmless in the case of the diamond. That’s reasonable, but it’s not so easy to see how our reporter would learn that on its own. We could just outlaw any very sophisticated plan but then we’re losing competitiveness with systems that are more lax on safety.
Well we might have a predictor that is a perfect statistical model of the thing it was trained to on, but the ontology identification issue is about what kind of safety critical questions can be answered based on the internal computations of such a model. So in the case of a GAN, we might try to answer “is this person lying?” based on a photo of their face, and we might hope that, having trained the GAN on the general-purpose face-generation problem, the latent variables within the GAN might contain features we need to do visual lie detection. Now even if the GAN does perfectly safe face generation, we need some additional work to get a safety guarantee on our reporter, and this is difficult because we want to do it based on a finite narrow dataset.
One further thought: suppose we trained a predictive model to answer the same question as the reporter itself, and suppose we stipulated only that the reporter ought to be as safe and general as the predictor is. Then we could just take the output of the predictor as the reporter’s output and we’d be done. Now what if we trained a predictive model to answer a question that was a kind of “immediate logical neighbor” of the reporter’s question, such as “is the diamond in the left half of the vault?” where the reporter’s question is “is the diamond in the vault?” Then we also should be able to specify and meet a safety guarantee phrased in terms of the relationship between the correctness of the reporter and the predictor. Interested in your thoughts on this.