Agreed, though I do find framing them as a warped predictor helpful in some cases. In principle, the deviation from the original unbiased prediction over all inputs should include within it all agentic behaviors, and there might exist some way that you could extract goals from that bias vector. (I don’t have anything super concrete here and I’m not super optimistic that this framing gives you anything extra compared to other interpretability mechanisms, but it’s something I’ve thought about poking.)
Agreed, though I do find framing them as a warped predictor helpful in some cases. In principle, the deviation from the original unbiased prediction over all inputs should include within it all agentic behaviors, and there might exist some way that you could extract goals from that bias vector. (I don’t have anything super concrete here and I’m not super optimistic that this framing gives you anything extra compared to other interpretability mechanisms, but it’s something I’ve thought about poking.)