Can you say more about why you would want this to exist? Is it just that “do auto-interpretability well” is a close proxy for “model could be used to help with safety research”? Or are you also thinking about deception / sandbagging, or other considerations.
Can you say more about why you would want this to exist? Is it just that “do auto-interpretability well” is a close proxy for “model could be used to help with safety research”? Or are you also thinking about deception / sandbagging, or other considerations.