A quick question: any reason to use CLIP embeddings as the SAE input, instead of directly using the images themselves? I understand that the goal is to understand CLIP inner workings, but curious if you have intuitions on whether directly feeding in images would work as well.
Hey! Late to the party but this is *really* cool.
A quick question: any reason to use CLIP embeddings as the SAE input, instead of directly using the images themselves? I understand that the goal is to understand CLIP inner workings, but curious if you have intuitions on whether directly feeding in images would work as well.