Stephen Fowler comments on Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Stephen Fowler 8 Dec 2024 2:12 UTC
3 points
0
Can you expect that the applications to interpretability would apply on inputs radically outside of distribution?

My naive intuition is that by taking derivatives are you only describing local behaviour.

(I am “shooting from the hip” epistemically)