Lucius Bushnaq comments on The “Minimal Latents” Approach to Natural Abstractions

Lucius Bushnaq 20 Dec 2022 19:20 UTC
5 points
2
Epistemic status: sleep deprived musings
If I understand this right, this is starting to sound very testable.
Feed a neural network inputs consisting of variables $[x_{1}, . ., x_{n}]$ . Configurations in a 2D Ising model, cat pictures, or anything else we humans think we know the latent variables for.
Train neural networks to output a set of variables $Λ$ over the inputs. The loss function scores based on how much the output induces conditional independence of inputs over the training data set.
E.g., take the $D_{K L}$ divergence between $P (X_{1}, \dots, X_{n} | Λ)$ and $P (X_{1} | Λ) P (X_{2} | Λ) \dots P (X_{n} | Λ)$ . Then, penalise $Λ$ with a higher information content through a regularisation term. E.g. the $D_{K L}$ divergence between $P (X_{1}, \dots, X_{n} | Λ)$ and $P (X_{1}, \dots, X_{n})$ .^[1]
Then, you can check if the solutions found match the ones other networks, or humans and human science, would give for that system. Either by comparing $P (X_{1}, \dots, X_{n} | Λ)$ , or by looking at $Λ$ directly.
You can also train a second network to reconstruct $[x_{1}, . ., x_{n}]$ from the latents and see what comes out.
You might also be able to take a network stitched together from the latent generation and a latent read-out network, and see how well it does on various tasks over the dataset. Image labelling, calculating the topological charge of field configurations, etc. Then compare that to a generic network trained to solve these tasks.
If the hypothesis holds strongly, generic networks of sufficient generality go through the same process when they propose solutions. Just with less transparency. So you’d expect the stitched and conventional networks to score similarly.
My personal prediction would be that you’d usually need to require solving a few different tasks on the data for that to occur, otherwise the network doesn’t need to understand all the abstractions in the system to get the answer, and can get away with learning less latents.
I think we kind of informally do a lot of this already when we train an image classifier and then Anthropic opens up the network to see e.g. the dog-head-detection function/variable in it. But this seems like a much cleaner, more secure and well defined cooking formula for finding latents to me, which may be rote implementable for any system or problem.
Unless this is already a thing in Interpretability code bases and I don’t know about it?
1. ^
  I haven’t checked the runtime on this yet, you’d need to cycle through the whole dataset once per loss function call to get the distributions. But it’s probably at least doable for smaller problems, and for bigger datasets a stochastic sampling ought to be enough.