I know this is a necro bump, but could you describe the ambitious interp work you have in mind?
Perhaps something like a probe can detect helpfullness with >90% accuracy, and it works on other models without retraining, once we calibrate to a couple of unrelated concepts.
I know this is a necro bump, but could you describe the ambitious interp work you have in mind?
Perhaps something like a probe can detect helpfullness with >90% accuracy, and it works on other models without retraining, once we calibrate to a couple of unrelated concepts.