A proxy that may be slightly less imperfect is auto-interp, a technique introduced by Bills et al. We take the text that highly activates a proposed feature, and have an LLM like GPT-4 or Gemini Ultra try to find an explanation for the common pattern in these texts. We then give the LLM some new text, and this natural language explanation, and have it predict the activations (often quantized to integers between 0 and 10) on this new text, and score it on those predictions
This seems conceptually very related to cycle consistency and backtranslation losses, on which there are large existing literatures it might be worth having a look at, including e.g. theoretical results like in On Translation and Reconstruction Guarantees of the Cycle-Consistent Generative Adversarial Networks or Towards Identifiable Unsupervised Domain Translation: A Diversified Distribution Matching Approach.