As I read this, your proposal seems to hinge on a speed prior (or simplicity prior) over F such that F has good generalization from the simple training set to the complex world. I think you could be more precise if you’d explain how the speed prior (enforced by C) chooses direct translation over simulation. Reading your discussion, your disagreement seems to stem from a disagreement about the effect of the speed prior i.e. are translators faster or less complex than imitators?
The disagreement stems from them not understanding my proposal, I hope that now it is clear what I meant. I explained in my submission why the speed prior works, but in a nutshell it is because both the “direct generator” and the “generator of worlds that a human thinks are consistent with a QA pair” need to generate the activations that represent the world, but the “direct generator” does it directly (e.g. “diamond” → representation of a diamond), while the other one performs additional computation to determine what worlds a human thinks are consistent with a QA pair (e.g. “diamond” → it might be a diamond, but it might also be just a screen in front of the camera and what is really in the vault is…), in the training set where the labels are always correct the direct generator is preferred.
Hmm, why would it require additional computation? The counterexample does not need to be an exact human imitator, only a not-translator that performs well in training. In the worst case there exist multiple parts of the activations of the predictor that correlate to “diamond”, so multiple ‘good results’ by just shifting parameters in the model.
A generative model isn’t like a regression model. If we have two variables that are strongly correlated and want to predict another variable, then we can shift the parameters to either of them and get very close to what we would get by using both. In a generative model on the other hand, we need to predict both, no value is privileged, we can’t just shift the parameters. See my reply to interstice on what I think would happen in the worst case scenario where we have failed with the implementation details and that happens. The result wouldn’t be that bad.
If you can think of another algorithm as simple as a direct generator that performs well in training, say so. I think that almost by definition the direct generator is the simplest one.
And if we make a good enough but still human level dataset (although this isn’t a requirement for my approach to work) the only robust and simple correlation that remains is the one we are interested in.
Ah, I missed that it was a generative model. If you don’t mind I’d like to extend this discussion a bit. I think it’s valuable (and fun).
I do still think it can go wrong. The joint distribution can shift after training by confounding factors and effect modification. And the latter is more dangerous, because for the purposes of reporting the confounder matters less (I think), but effect modification can move you outside any distribution you’ve seen in training. And it can be something really stupid you forgot in your training set, like the action to turn off the lights causing some sensors to work while others do not.
You might say, “ah, but the information about the diamond is the same”. But I don’t think that that applies here. It might be that the predictor state as a whole encodes the whereabouts of the diamond and the shift might make it unreadable.
I think that it’s very likely that the real world has effect modification that is not in the training data just by the fact that the world of possibilities is infinite. When the shift occurs your P(z|Q,A) becomes small, causing us to reject everything outside the learned distribution. Which is safe, but also seems to defeat the purpose of our super smart predictor.
As an aside, I think that that property of regression models, in addition to using small networks and poor regularization might be why adversarial examples exist (see http://gradientscience.org/adv.pdf). Some features might not be robust. If we have an image of a cat and the model depends on some non robust feature to tell it apart from dogs, we might be able to use the many degrees of freedom we have available to make a cat look like a dog. On the other hand if we used something like this method we would need to find an image of a cat that is more likely to have been generated from the input “dog” than from the input “cat”, it’s probably not going to happen.
Could be! Though, in my head I see it as a self centering monte carlo sampling of a distribution mimicking some other training distribution, GANs not being the only one in that group. The drawback is that you can never leave that distribution; if your training is narrow, your model is narrow.
As I read this, your proposal seems to hinge on a speed prior (or simplicity prior) over F such that F has good generalization from the simple training set to the complex world. I think you could be more precise if you’d explain how the speed prior (enforced by C) chooses direct translation over simulation. Reading your discussion, your disagreement seems to stem from a disagreement about the effect of the speed prior i.e. are translators faster or less complex than imitators?
The disagreement stems from them not understanding my proposal, I hope that now it is clear what I meant. I explained in my submission why the speed prior works, but in a nutshell it is because both the “direct generator” and the “generator of worlds that a human thinks are consistent with a QA pair” need to generate the activations that represent the world, but the “direct generator” does it directly (e.g. “diamond” → representation of a diamond), while the other one performs additional computation to determine what worlds a human thinks are consistent with a QA pair (e.g. “diamond” → it might be a diamond, but it might also be just a screen in front of the camera and what is really in the vault is…), in the training set where the labels are always correct the direct generator is preferred.
Hmm, why would it require additional computation? The counterexample does not need to be an exact human imitator, only a not-translator that performs well in training. In the worst case there exist multiple parts of the activations of the predictor that correlate to “diamond”, so multiple ‘good results’ by just shifting parameters in the model.
A generative model isn’t like a regression model. If we have two variables that are strongly correlated and want to predict another variable, then we can shift the parameters to either of them and get very close to what we would get by using both. In a generative model on the other hand, we need to predict both, no value is privileged, we can’t just shift the parameters. See my reply to interstice on what I think would happen in the worst case scenario where we have failed with the implementation details and that happens. The result wouldn’t be that bad.
If you can think of another algorithm as simple as a direct generator that performs well in training, say so. I think that almost by definition the direct generator is the simplest one.
And if we make a good enough but still human level dataset (although this isn’t a requirement for my approach to work) the only robust and simple correlation that remains is the one we are interested in.
Ah, I missed that it was a generative model. If you don’t mind I’d like to extend this discussion a bit. I think it’s valuable (and fun).
I do still think it can go wrong. The joint distribution can shift after training by confounding factors and effect modification. And the latter is more dangerous, because for the purposes of reporting the confounder matters less (I think), but effect modification can move you outside any distribution you’ve seen in training. And it can be something really stupid you forgot in your training set, like the action to turn off the lights causing some sensors to work while others do not.
You might say, “ah, but the information about the diamond is the same”. But I don’t think that that applies here. It might be that the predictor state as a whole encodes the whereabouts of the diamond and the shift might make it unreadable.
I think that it’s very likely that the real world has effect modification that is not in the training data just by the fact that the world of possibilities is infinite. When the shift occurs your P(z|Q,A) becomes small, causing us to reject everything outside the learned distribution. Which is safe, but also seems to defeat the purpose of our super smart predictor.
As an aside, I think that that property of regression models, in addition to using small networks and poor regularization might be why adversarial examples exist (see http://gradientscience.org/adv.pdf). Some features might not be robust. If we have an image of a cat and the model depends on some non robust feature to tell it apart from dogs, we might be able to use the many degrees of freedom we have available to make a cat look like a dog. On the other hand if we used something like this method we would need to find an image of a cat that is more likely to have been generated from the input “dog” than from the input “cat”, it’s probably not going to happen.
Could be! Though, in my head I see it as a self centering monte carlo sampling of a distribution mimicking some other training distribution, GANs not being the only one in that group. The drawback is that you can never leave that distribution; if your training is narrow, your model is narrow.