Thanks to Chris Olah for a helpful conversation here.
Some more thoughts on this:
One thing that seems pretty important here is to have your evaluation based around worst-case rather than average-case guarantees, and not tied to any particular narrow distribution. If your mechanism for judging understanding is based on an average-case guarantee over a narrow distribution, then you’re sort of still in the same boat as you started with behavioral evaluations, since it’s not clear why understanding that passes such an evaluation would actually help you deal with worst-case failures in the real world. This is highly related to my discussion of best-case vs. worst-case transparency here.
Another thing worth pointing out here regarding using causal scrubbing for something like this is that causal scrubbing requires some base distribution that you’re evaluating over, which means it could fall into a similar sort of trap to that in the first bullet point here. Presumably, if you wanted to build a causal-scrubbing-based safety evaluation, you’d just use the entire training distribution as the distribution you were evaluating over, which seems like it would help a lot with this problem, but it’s still not completely clear that it would solve it, especially if you were just evaluating your average-case causal scrubbing loss over that distribution.
Thanks to Chris Olah for a helpful conversation here.
Some more thoughts on this:
One thing that seems pretty important here is to have your evaluation based around worst-case rather than average-case guarantees, and not tied to any particular narrow distribution. If your mechanism for judging understanding is based on an average-case guarantee over a narrow distribution, then you’re sort of still in the same boat as you started with behavioral evaluations, since it’s not clear why understanding that passes such an evaluation would actually help you deal with worst-case failures in the real world. This is highly related to my discussion of best-case vs. worst-case transparency here.
Another thing worth pointing out here regarding using causal scrubbing for something like this is that causal scrubbing requires some base distribution that you’re evaluating over, which means it could fall into a similar sort of trap to that in the first bullet point here. Presumably, if you wanted to build a causal-scrubbing-based safety evaluation, you’d just use the entire training distribution as the distribution you were evaluating over, which seems like it would help a lot with this problem, but it’s still not completely clear that it would solve it, especially if you were just evaluating your average-case causal scrubbing loss over that distribution.