wassname comments on On how various plans miss the hard bits of the alignment challenge

wassname 15 Feb 2024 23:12 UTC
1 point
0
I know this is a necro bump, but could you describe the ambitious interp work you have in mind?

Perhaps something like a probe can detect helpfullness with >90% accuracy, and it works on other models without retraining, once we calibrate to a couple of unrelated concepts.