Interesting idea, but wouldn’t it run into the problem that the map F would learn both valid and invalid correlations? Like it should learn both to predict that the activations representing “diamond position” AND the activations representing “diamond shown on camera” are active. So in a situation where those activations don’t match it’s not clear which will be preferred. You might say that in that case the human model should be able to generate those activations using “camera was hacked” as a hypothesis, but if the hack is done in a way that the human finds incomprehensible this might not work(or put another way, the probability assigned to “diamond location neurons acting weird for some reason” might be higher than “camera hacked in undetectable way”, which could be the case if the encoding of the diamond position is weird enough)
I don’t think that would happen. But imagine that somehow it does happen, the regularization is too strong and the dataset doesn’t include any examples where the camera was hacked, so our model predicts both the activations of the physical diamond and the diamond image independently, what then? Try to think about any toy model of a scenario like that, any simple enough that we can analyze exactly. The simplest is that the variable on which we are conditioning the generation is an “uniformly” distributed scalar to which we apply two linear transformations to predict two values (which are meant to stand for the two diamonds) and then add gaussian noise. Given two observed values I’m pretty sure (I didn’t actually do the math but it seems obvious) that the reconstructed initial value is a weighted average of what would be predicted by either value independently. I expect that something analogous would happen in more realistic scenarios. Is this an acceptable behavior? I think so, or at least much better than any known alternative. If we used an AI to optimize the world so that the answer to “Is the diamond in the vault?” is “Yes”, it would make sure that both the real diamond and the one in the image stay in place.
Interesting idea, but wouldn’t it run into the problem that the map F would learn both valid and invalid correlations? Like it should learn both to predict that the activations representing “diamond position” AND the activations representing “diamond shown on camera” are active. So in a situation where those activations don’t match it’s not clear which will be preferred. You might say that in that case the human model should be able to generate those activations using “camera was hacked” as a hypothesis, but if the hack is done in a way that the human finds incomprehensible this might not work(or put another way, the probability assigned to “diamond location neurons acting weird for some reason” might be higher than “camera hacked in undetectable way”, which could be the case if the encoding of the diamond position is weird enough)
I don’t think that would happen. But imagine that somehow it does happen, the regularization is too strong and the dataset doesn’t include any examples where the camera was hacked, so our model predicts both the activations of the physical diamond and the diamond image independently, what then? Try to think about any toy model of a scenario like that, any simple enough that we can analyze exactly. The simplest is that the variable on which we are conditioning the generation is an “uniformly” distributed scalar to which we apply two linear transformations to predict two values (which are meant to stand for the two diamonds) and then add gaussian noise. Given two observed values I’m pretty sure (I didn’t actually do the math but it seems obvious) that the reconstructed initial value is a weighted average of what would be predicted by either value independently. I expect that something analogous would happen in more realistic scenarios. Is this an acceptable behavior? I think so, or at least much better than any known alternative. If we used an AI to optimize the world so that the answer to “Is the diamond in the vault?” is “Yes”, it would make sure that both the real diamond and the one in the image stay in place.