This means they’re somewhat problematic for OOD use cases like treacherous turn detection or detecting misgeneralization.
I kinda want to push back on this since OOD in behavior is not obviously OOD in the activations. Misgeneralization especially might be better thought of as an OOD environment and on-distribution activations?
I think we should come back to this question when SAEs have tackled something like variable binding with SAEs. Right now it’s hard to say how SAEs are going to help us understand more abstract thinking and therefore I think it’s hard to say how problematic they’re going to be for detecting things like a treacherous turn. I think this will depend on how how representations factor. In the ideal world, they generalize with the model’s ability to generalize (Apologies for how high level / vague that idea is).
Some experiments I’d be excited to look at:
If the SAE is trained on a subset of the training distribution, can we distinguish it being used to decompose activations on those data points off the training distribution?
How does that compare to an SAE trained on the whole training distribution from the model, but then looking at when the model is being pushed off distribution?
I kinda want to push back on this since OOD in behavior is not obviously OOD in the activations. Misgeneralization especially might be better thought of as an OOD environment and on-distribution activations?
I think we should come back to this question when SAEs have tackled something like variable binding with SAEs. Right now it’s hard to say how SAEs are going to help us understand more abstract thinking and therefore I think it’s hard to say how problematic they’re going to be for detecting things like a treacherous turn. I think this will depend on how how representations factor. In the ideal world, they generalize with the model’s ability to generalize (Apologies for how high level / vague that idea is).
Some experiments I’d be excited to look at:
If the SAE is trained on a subset of the training distribution, can we distinguish it being used to decompose activations on those data points off the training distribution?
How does that compare to an SAE trained on the whole training distribution from the model, but then looking at when the model is being pushed off distribution?
I think I’m trying to get at—can we distinguish:
Anomalous activations.
Anomalous data points.
Anomalous mechanisms.
Lots of great work to look forward to!