the gears to ascension comments on (OLD) An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

the gears to ascension 19 Oct 2022 1:27 UTC
2 points
−2
I mean, personally I’d say it’s the only hope we have of making any of the reflection algorithms have any shot at working. you can’t do formal verification unless your network is at least interpretable enough that, when formal verification fails, you can know what it is about the dataset means the network had to make your non-neural prover run out of compute time when you try to ask what the lowest margin to a misbehavior is. if the network’s innards aren’t readable enough to get an intuitive sense of why a subpath failed the verification or what computation the network performed that timed out the verifier, it’s hard to move the data around to clarify.

of course, this doesn’t help that much if it turns out that a strong planner can amplify a very slight value misalignment quickly as expected by miri folks; afaict, miri is worried that the process of learning (their word is “self improving”) can speed up a huge amount when the network can make use of the full self-rewriting possibility of its substrate and the network properly understands the information geometry of program updates (ie, afaict, they expect significant amounts of generalizing improvement of architecture or learning rule or such things once its strong enough to become a strong quine as an incidental step of doing its core task.)

and so interpretability would be expected to be made useless by the ai breaking into your tensorflow to edit its own compute graph or your pytorch to edit its own matmul invocation order or something. presumably that doesn’t happen at the expected level until you have an ai strong enough to significantly exceed the generalization performance of current architecture search incidentally without being aimed at that, because the ais they’re imagining wouldn’t have even been trained on that specifically the way eg alphatensor was narrowly aimed at matmul itself.

wow this really got me on a train of thinking, I’m going to post more rambling to my shortform.