Progress Report 6: get the tool working

Prev: Progress Report 5

[Edit: Delayed due to catching a bad case of covid while travelling. Back at it now 2022-06-27]

After a bit of a hiatus while vacationing, I’m now back at it. After giving a presentation of my first 3 months of work to the Eleuther AI-interpretability reading group, my main takeaway from the self-reflection that inspired and the feedback from the attendees (if you’re reading this, thank you!) was that my next step should be to get some minimal viable version of my interpretability tool working and try it out on the harmful/​harmless models I’ve made.

Get something working, note its shortcomings, then iteratively improve it. I think I’m not too far off. My aim is to reach this minimum viable product over the next 5 days.


The abstract goal I’m aiming for with this tool is to be able to identify which ‘sub-models’ which match certain ‘natural abstractions’ are present in a generic neural network.

I find it helpful to go back and reread documents which gave me a sense of ‘aha’ or ‘ohhhh, now I see’ when initially reading them, the feeling of my concept-space or worldview expanding a bit. When rereading, I get a feeling of ‘oh yes, that’s where that intuition I’ve been leaning on came from. This is where I started down the path of that idea.’ In that vein, I was rereading Gwern’s piece on GPT-3 and the scaling hypothesis today.

Here’s a quote relevant to my current work that I’d like to share:

Big models work because they encode a dizzyingly vast number of sub-models in an extremely high-dimensional abstract space, representing countless small sub-models (Orseau et al 2020) interpolating over data⁠, one of which is likely to solve the problem well, and so ensures the problem is soluble by the overall model. They function as an ensemble: even though there countless overfit sub-models inside the single big model, they all average out, leading to a preference for simple solutions. This Occam’s razor biases the model towards simple solutions which are flexible enough to gradually expand in complexity to match the data.

So, the larger the model, the better, if there is enough data & compute to push it past the easy convenient sub-models and into the sub-models which express desirable traits like generalizing, factorizing perception into meaningful latent dimensions, meta-learning tasks based on descriptions, learning causal reasoning & logic, and so on. If the ingredients are there, it’s going to happen.

- Gwern https://​​www.gwern.net/​​Scaling-hypothesis

Ideally, when asking for a critical decision from a model, we’d be able to step through each piece of logic in that decision, and identify the the sub-models being used. At each point we could ask, ‘is the deception sub-model being used? the power-seeking sub-model? or just the scientific knowledge sub-model?‘. And then be able to rerun the critical prediction with select sub-models ‘ablated’ and see how that changes the output.

What I’m currently able to do is a long way from that, but it’s good to keep in mind what I’m aiming for.