No, I wasn’t advocating adding a speed penalty, I was just pointing at a reason to think that a speed prior would give a more accurate answer to the question of “which is favored” than the bounded simplicity prior you’re assuming:
Suppose that your imitator works by something akin to Bayesian inference with some sort of bounded simplicity prior (I think it’s true of transformers)
But now I realise that I don’t understand why you think this is true of transformers. Could you explain? It seems to me that there are many very simple hypotheses which take a long time to calculate, and which transformers therefore can’t be representing.
The word “bounded” in “bounded simplicity prior” referred to bounded computational resources. A “bounded simplicity prior” is a prior which involves either a “hard” (i.e. some hypotheses are excluded) or a “soft” (i.e. some hypotheses are down-weighted) bound on computational resources (or both), and also inductive bias towards simplicity (specifically it should probably behave as ~ 2^{-description complexity}). For a concrete example, see the prior I described here (w/o any claim to originality).
No, I wasn’t advocating adding a speed penalty, I was just pointing at a reason to think that a speed prior would give a more accurate answer to the question of “which is favored” than the bounded simplicity prior you’re assuming:
But now I realise that I don’t understand why you think this is true of transformers. Could you explain? It seems to me that there are many very simple hypotheses which take a long time to calculate, and which transformers therefore can’t be representing.
The word “bounded” in “bounded simplicity prior” referred to bounded computational resources. A “bounded simplicity prior” is a prior which involves either a “hard” (i.e. some hypotheses are excluded) or a “soft” (i.e. some hypotheses are down-weighted) bound on computational resources (or both), and also inductive bias towards simplicity (specifically it should probably behave as ~ 2^{-description complexity}). For a concrete example, see the prior I described here (w/o any claim to originality).
Ah, I see. That makes sense now!