I think like 99% reliability is about the right threshold for large models based on my napkin math.
Serious question. We have 100% of the information, why can’t we get 100%
Suggestion: Why not test if mechanistic interp can detect lies, for out of distribution data 99% of the time? (It should also generalise to larger models)
It’s a useful and well studied benchmark. And while we haven’t decided on a test suite, [there is some useful code](https://github.com/EleutherAI/elk).
This is refering to 99% in the context of “amount of loss that you explain in a human interpretable way for some component in the model” (a notion of faithfulness). For downstream tasks, either much higher or much lower reliability could be the right target (depending on the exact task).
Serious question. We have 100% of the information, why can’t we get 100%
Suggestion: Why not test if mechanistic interp can detect lies, for out of distribution data 99% of the time? (It should also generalise to larger models)
It’s a useful and well studied benchmark. And while we haven’t decided on a test suite, [there is some useful code](https://github.com/EleutherAI/elk).
This is refering to 99% in the context of “amount of loss that you explain in a human interpretable way for some component in the model” (a notion of faithfulness). For downstream tasks, either much higher or much lower reliability could be the right target (depending on the exact task).