Rob Bensinger comments on AI as a science, and three obstacles to alignment strategies

Rob Bensinger 27 Oct 2023 18:39 UTC
LW: 11 AF: 5
2
AF
Some of Nate’s quick thoughts (paraphrased), after chatting with him:
Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-mechanistically-explain a typical system humans have engineered.)
Nobody’s been able to call the specific capabilities of systems in advance. Nobody’s been able to call the specific exploits in advance. Nobody’s been able to build better cognitive algorithms by hand after understanding how the AI does things we can’t yet code by hand. There is clearly some other level of understanding that is possible that we lack, and that we once sought, and that only the interpretability folks continue to seek.
E.g., think of that time Neel Nanda figured out how a small transformer does modular arithmetic (AXRP episode). If nobody had ever thought of that algorithm for an adder, we would have thereby learned a new algorithm for an adder. There are things that these AI systems are doing that aren’t just lots of stuff we know; there are levels of organization of understanding that give you the ability to predict how things work outside of the bands where we’ve observed them.
It seems trendy to declare that they never existed in the first place and that that’s all white tower stuff, but Nate thinks this point of view is missing a pretty important and central thread.
The missing thread isn’t trivial to put into words, but it includes things like:
- This sounds like the same sort of thing some people would say if they were staring at computer binary for the first time and didn’t know about the code behind the scenes: “We have plenty of understanding beyond just how the CPU handles instructions; we understand how memory caching works and we have recognized patterns like the stack and the heap; talking as if there’s some deeper level of organization is talking like a theorist when in fact this is an engineering problem.” Those types of understanding aren’t false, but they aren’t the sort of understanding of someone who has comprehended the codebase they’re looking at.
- There are, predictably, things to learn here; the messiness and complexity of the real world doesn’t mean we already know the relevant principles. You don’t need to understand everything about how a bird works in order to build an airplane; there are compressible principles behind how birds fly; if you understand what’s going on you can build flying devices that have significantly more carrying capacity than a bird, and this holds true even if the practical engineering of an airplane requires a bunch of trial and error and messy engineering work.
- A mind’s causal structure is allowed to be complicated; we can see the weights, but we don’t thereby have a mastery of the high-level patterns. In the case of humans, neuroscience hasn’t actually worked to give us a mastery of the high-level patterns the human brain is implementing.
- Mystery is in the map, not in the territory; reductionism works. Not all sciences that can exist, already exist today.
Possibly the above pointers are only useful if you already grok the point we’re trying to make, and isn’t so useful for communicating a new idea; but perhaps not.