Bogdan Ionut Cirstea comments on Interpreting the Learning of Deceit

Bogdan Ionut Cirstea 11 May 2024 0:42 UTC
1 point
0
I also wonder how much interpretability LM agents might help here, e.g. as they could make much cheaper scaling the ‘search’ to many different undesirable kinds of behaviors.