RogerDearnaley comments on Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

RogerDearnaley 20 Dec 2023 2:18 UTC
1 point
0
You could of course quantize Llama-2-70b to work with it inside a single A100 80GB, say to 6 or 8 bits, but that’s obviously going to apply some fuzz to everything, and probably isn’t something you want to have to footnote in an academic paper. Still, for finding an attack, you could find it in a 6-bit quantized version and then confirm it works against the full model.

I’m not sure you need to worry that much about uncomputibility in something with less than 50 layers, but I suppose circuits can get quite large in practice. My hunch is that this particular one actually extends from about layer 16 (midpoint of the model) to about 20-21 (where the big jumps in divergence between refusal and answering happen: I’d guess that’s a “final decision”).