Research Engineer at DeepMind.
Vikrant Varma
JumpReLU SAEs + Early Access to Gemma 2 SAEs
Improving Dictionary Learning with Gated Sparse Autoencoders
[Full Post] Progress Update #1 from the GDM Mech Interp Team
[Summary] Progress Update #1 from the GDM Mech Interp Team
Discussion: Challenges with Unsupervised LLM Knowledge Discovery
Explaining grokking through circuit efficiency
Refining the Sharp Left Turn threat model, part 2: applying alignment techniques
Threat Model Literature Review
Clarifying AI X-risk
More examples of goal misgeneralization
Refining the Sharp Left Turn threat model, part 1: claims and mechanisms
Thanks for this sequence!
I don’t understand why the computer case is a counterexample for mutual information, doesn’t it depend on your priors (which don’t know anything about the other background noise interacting with photons)?
Taking the example of a one-time pad, given two random bit strings A and B, if C = A ⊕ B, learning C doesn’t tell you anything about A unless you already have some information about B. So I(C; A) = 0 when B is uniform and independent of A.
Over time, the photons bouncing off the object being sought and striking other objects will leave an imprint in every one of those objects that will have high mutual information with the position of the object being sought.
If our prior was very certain about any factors that could interact with photons, then indeed the resulting imprints would have high mutual information, but it seems like you can rescue mutual information here by saying that our prior is uncertain about these other factors so the resulting imprints are noisy as well.
On the other hand, it seems correct that an entity that did have a more certain prior over interacting factors would see photon imprints as accumulating knowledge (for example photographic film).
To add some more concrete counter-examples:
deceptive reasoning is causally upstream of train output variance (e.g. because the model has read ARC’s post on anomaly detection), so is included in π.
alien philosophy explains train output variance; unfortunately it also has a notion of object permanence we wouldn’t agree with, which the (AGI) robber exploits