Here’s a take of mine on how you should think about CaSc that I haven’t so far gotten around to publishing anywhere:
I think you should think of CaSc as being a way to compute a prediction made by the hypothesis. That is, when you claim that the model is computing a particular interpretation graph, and you provide the correspondence between the interpretation graph and the model, CaSc tells you a particularly aggressive prediction made by your hypothesis: your hypothesis predicts that making all the swaps suggested by CaSc won’t affect the average output of your computational graph.
Thinking about it this way is helpful to me for two reasons:
False hypotheses can make true predictions; this is basically why CaSc can fail to reject false hypotheses.
It also emphasizes why I’m unsympathetic to claims that “it sets the bar too high for something being a legit circuit”—IMO, if you claimed that your model has some internal structure well described by hypothesis that fits into the CaSc structure (which is true of almost all interp hypotheses in practice), I don’t really see how the failure of a CaSc test is compatible with that hypothesis being true (modulo my remaining questions about how bad it is for a hypothesis to get a middling CaSc score).
CaSc attempts to compute the single most aggressive prediction made by your hypothesis—this is why we do all allowed swaps. (I’m a bit confused about whether we should think of CaSc as succeeding at being the most aggressive experiment for the hypothesis though, I think there are some subtleties here that my coworkers have worked out that I don’t totally understand.)
I think I regret that we phrased our writeup as “CaSc gives you a test of interp hypotheses” rather than saying “CaSc shows you a strong prediction made by your interp hypothesis, which you can then compare to the truth, and if they don’t match that’s a problem for your hypothesis”.
Why did we look at just the “most aggressive” experiment allowed by a hypothesis H, instead of choosing some other experiment allowed by H?
The argument for CaSc is: “if H was true, then running the full set of swaps shouldn’t affect the computation’s output, and so if the full set of swaps does affect the computation’s output, that means H is false.” But we could just as easily say “if H was true, then the output should be unaffected any set of swaps that H says should be fine.”
Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.
I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially—that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’). And so, when explaining CaSc, I think we should plausibly think about describing it by talking about the hypothesis producing a bunch of allowed experiments, and then you can test your hypothesis by either looking at the maxent one or by looking at the worst one.
Thanks, that’s a useful alternative framing of CaSc!
FWIW, I think this adversarial version of CaSc would avoid the main examples in our post where CaSc fails to reject a false hypothesis. The common feature of our examples is “cancellation” which comes from looking at an average CaSc loss. If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don’t get these kind of cancellation problems.
Plausibly you’d run into different failure modes though, in particular, I guess the maximum measure is less smooth and gives you less information on “how wrong” your hypothesis is.
If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don’t get these kind of cancellation problems
I think this “max loss” procedure is different from what Buck wrote and the same as what I wrote.
Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.
I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially—that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’).
One thing that is not equivalent to joins, which you might also want to do, is to choose the single worst swap that the hypothesis allows. That is, if a set of node values X={x1,x2,…} are all equivalent, you can choose to map all of them to e.g. x1. And that can be more aggressive than any partition of X which is then chosen-from randomly, and does not correspond to joins.
Here’s a take of mine on how you should think about CaSc that I haven’t so far gotten around to publishing anywhere:
I think you should think of CaSc as being a way to compute a prediction made by the hypothesis. That is, when you claim that the model is computing a particular interpretation graph, and you provide the correspondence between the interpretation graph and the model, CaSc tells you a particularly aggressive prediction made by your hypothesis: your hypothesis predicts that making all the swaps suggested by CaSc won’t affect the average output of your computational graph.
Thinking about it this way is helpful to me for two reasons:
False hypotheses can make true predictions; this is basically why CaSc can fail to reject false hypotheses.
It also emphasizes why I’m unsympathetic to claims that “it sets the bar too high for something being a legit circuit”—IMO, if you claimed that your model has some internal structure well described by hypothesis that fits into the CaSc structure (which is true of almost all interp hypotheses in practice), I don’t really see how the failure of a CaSc test is compatible with that hypothesis being true (modulo my remaining questions about how bad it is for a hypothesis to get a middling CaSc score).
CaSc attempts to compute the single most aggressive prediction made by your hypothesis—this is why we do all allowed swaps. (I’m a bit confused about whether we should think of CaSc as succeeding at being the most aggressive experiment for the hypothesis though, I think there are some subtleties here that my coworkers have worked out that I don’t totally understand.)
I think I regret that we phrased our writeup as “CaSc gives you a test of interp hypotheses” rather than saying “CaSc shows you a strong prediction made by your interp hypothesis, which you can then compare to the truth, and if they don’t match that’s a problem for your hypothesis”.
Something I’ve realized over the last few days:
Why did we look at just the “most aggressive” experiment allowed by a hypothesis H, instead of choosing some other experiment allowed by H?
The argument for CaSc is: “if H was true, then running the full set of swaps shouldn’t affect the computation’s output, and so if the full set of swaps does affect the computation’s output, that means H is false.” But we could just as easily say “if H was true, then the output should be unaffected any set of swaps that H says should be fine.”
Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.
I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially—that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’). And so, when explaining CaSc, I think we should plausibly think about describing it by talking about the hypothesis producing a bunch of allowed experiments, and then you can test your hypothesis by either looking at the maxent one or by looking at the worst one.
Thanks, that’s a useful alternative framing of CaSc!
FWIW, I think this adversarial version of CaSc would avoid the main examples in our post where CaSc fails to reject a false hypothesis. The common feature of our examples is “cancellation” which comes from looking at an average CaSc loss. If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don’t get these kind of cancellation problems.
Plausibly you’d run into different failure modes though, in particular, I guess the maximum measure is less smooth and gives you less information on “how wrong” your hypothesis is.
I think this “max loss” procedure is different from what Buck wrote and the same as what I wrote.
One thing that is not equivalent to joins, which you might also want to do, is to choose the single worst swap that the hypothesis allows. That is, if a set of node values X={x1,x2,…} are all equivalent, you can choose to map all of them to e.g. x1. And that can be more aggressive than any partition of X which is then chosen-from randomly, and does not correspond to joins.