Daniel Kokotajlo comments on Scaling and evaluating sparse autoencoders

Daniel Kokotajlo Jun 8, 2024, 10:59 AM
LW: 4 AF: 3
0
AF
Well done and thank you! I don’t feel qualified to judge exactly but this seems like a significant step forward. Curious to hear your thoughts on the question of “by what year will [insert milestone X] be achieved assuming research progress continues on-trend.” Some milestones perhaps are in this tech tree https://www.lesswrong.com/posts/nbq2bWLcYmSGup9aF/a-transparency-and-interpretability-tech-tree but the one I’m most interested in is the “we have tools which can tell whether a model is scheming or otherwise egregiously misaligned, though if we trained against those tools they’d stop working” milestone.
- leogao Jun 8, 2024, 11:53 AM
  LW: 11 AF: 9
  0
  AF Parent
  Thanks for your kind words!
  
  My views on interpretability are complicated by the fact that I think it’s quite probable there will be a paradigm shift between current AI and the thing that is actually AGI like 10 years from now or whatever. So I’ll describe first a rough sketch of what I think within-paradigm interp looks like and then what it might imply for 10 year later AGI. (All these numbers will be very low confidence and basically made up)
  
  I think the autoencoder research agenda is currently making significant progress on item #1. The main research bottlenecks here are (a) SAEs might not be able to efficiently capture every kind of information we care about (e.g circular features) and (b) residual stream autoencoders are not exactly the right thing for finding circuits. Probably this stuff will take a year or two to really hammer out. Hopefully our paper helps here by giving a recipe to push autoencoders really quickly so we bump into the limitations faster and with less second guessing about autoencoder quality.
  
  Hopefully #4 can be done to some great part in parallel with #1; there’s a whole bunch of engineering needed to e.g take autoencoders and scale them up to capture all the behavior of the model (which was also a big part of the contribution of this paper). I’m pretty optimistic that if we have a recipe for #1 that we trust, the engineering (and efficiency improvements) for scaling up is doable. Maybe this adds another year of serial time. The big research uncertainty here fmpov is how hard it is to actually identify the structures we’re looking for, because we’ll probably have a tremendously large sparse network where each node does some really boring tiny thing.
  
  However, I mostly expect that GPT-4 (and probably 5) is probably just actually not doing anything super spicy/stabby. So I think most of the value of doing this interpretability will be to sort of pull back the veil, so to speak, of how these models are doing all the impressive stuff. Some theories of impact:
  - Maybe we’ll become less confused about the nature of intelligence in a way that makes us just have better takes about alignment (e.g there will be many mechanistic theories of what the heck GPT-4 is doing that will have been conclusively ruled out)
  - Maybe once the paradigm shift happens, we will be better prepared to identify exactly what interpretability assumptions it broke (or even just notice whether some change is causing a mechanistic paradigm shift)
  Unclear what timeline these later things happen on; probably depends a lot on when the paradigm shift(s) happen.
  - leogao Jun 8, 2024, 9:10 PM
    LW: 19 AF: 11
    0
    AF Parent
    To add some more concreteness: suppose we open up the model and find that it’s basically just a giant k nearest neighbors (it obviously can’t be literally this, but this is easiest to describe as an analogy). Then this would explain why current alignment techniques work and dissolves some of the mystery of generalization. Then suppose we create AGI and we find that it does something very different internally that is more deeply entangled and we can’t really make sense of it because it’s too complicated. Then this would imo also provide strong evidence that we should expect our alignment techniques to break.
    
    In other words, a load bearing assumption is that current models are fundamentally simple/modular in some sense that makes interpretability feasible, and that observing this breaking in the future is probably important evidence that will hopefully come before those future systems actually kill everyone.