I also wonder how much interpretability LM agents might help here, e.g. as they could make much cheaper scaling the ‘search’ to many different undesirable kinds of behaviors.
I also wonder how much interpretability LM agents might help here, e.g. as they could make much cheaper scaling the ‘search’ to many different undesirable kinds of behaviors.