Dmitry Vaintrob’s Shortform

Dmitry VaintrobJan 5, 2025, 6:57 AM

5 points

22 comments LW link

Dmitry Vaintrob Jan 6, 2025, 5:33 PM
59 points
17
Alignment is not all you need. But that doesn’t mean you don’t need alignment.

One of the fairytales I remember reading from my childhood is the “Three sillies”. The story is about a farmer encountering three episodes of human silliness, but it’s set in one more frame story of silliness: his wife is despondent because there is an axe hanging in their cottage, and she thinks that if they have a son, he will walk underneath the axe and it will fall on his head.

The frame story was much more memorable to me than any of the “body” stories, and I randomly remember this story much more often than any other fairytale I read at the age I read fairytales. I think the reason for this is that the “hanging axe” worry is a vibe very familiar from my family and friend circle, and more generally a particular kind of intellectual neuroticism that I encounter all the time, that is terrified of incomplete control or understanding.

I really like the rationalist/EA ecosphere because of its emphasis on the solvability of problems like this: noticing situations where you can just approach the problem, taking down the axe. However, a baseline of intellectual neuroticism persists (after all you wouldn’t expect otherwise from a group of people who pull smoke alarms on pandemics and existential threats that others don’t notice). Sometimes it’s harmless or even beneficial. But a kind of neuroticism in the community that bothers me, and seems counterproductive, is a certain “do it perfectly or you’re screwed” perfectionism that pervades a lot of discussions. (This is also familiar to me from my time as a mathematician: I’ve had discussions with very intelligent and pragmatic friends who rejected even the most basic experimentally confirmed facts of physics because “they aren’t rigorously proven”.)

A particular train of discussion that annoyed me in this vein was the series of responses to Raemon’s “preplanning and baba is you” post. The initial post I think makes a nice point—it suggests as an experiment trying to solve levels of a move-based logic game by pre-planning every step in advance, and points out that this is hard. Various people tried this experiment and found that it’s hard. This was presented as an issue in solving alignment, in worlds where “we get one shot”. But what annoyed me was the takeaway.

I think a lot of the great things about the intellectual vibe in the (extended) LW and EA communities is that “you have more ways to solve problems than you think”. However, there is a particular kind of virtue-signally class of problems where trying to find shortcuts or alternatives is frowned upon and the only accepted form of approach is “trying harder” (another generalized intellectual current in the LW-osphere that I strongly dislike).

Back to the “Baba is you” experiment. The best takeaway, I think, is that we should avoid situations where we need to solve complex problems in one shot, and we should work towards making sure this situation doesn’t exist (and we should just give up on trying to make progress in worlds where we get absolutely no new insights before the do-or-die step of making AGI). Doing so, at least without superhuman assistance, is basically impossible. Attempts at this tend to be not only silly but counterproductive: the “graveyard” of failed idealistic movements are chock-full of wannabe Hari Seldons who believe that they have found the “perfect solution”, and are willing to sacrifice everything to realize their grand vision.

This doesn’t mean we need to give up, or only work on unambitious, practical applications. But it does mean that we have to admit that things can be useful to work on in expectation before we have a “complete story for how they save the world”.

Note that what is being advocated here is not an “anything goes” mentality. I certainly think that AI safety research can be too abstract, too removed from any realistic application in any world. But there is a large spectrum of possibilities between “fully plan how you will solve a complex logic game before trying anything” and “make random jerky moves because they ‘feel right’”.

I’m writing this in response to Adam Jones’ article on AI safety content.. I like a lot of the suggestions. But I think the section on alignment plans suffers from the “axe” fallacy that I claim is somewhat endemic here. Here’s the relevant quote:

For the last few weeks, I’ve been working on trying to find plans for AI safety. They should cover the whole problem, including the major hurdles after intent alignment. Unfortunately, this has not gone well—my rough conclusion is that there aren’t any very clear and well publicised plans (or even very plausible stories) for making this go well. (More context on some of this work can be found in BlueDot Impact’s AI safety strategist job posting). (emphasis mine).

I strongly disagree with this being a good thing to do!

We’re not going to have a good, end-to-end plan about how to save the world from AGI. Even now, with ever more impressive and scary AIs becoming a comonplace, we have very little idea about what AGI will look like, what kinds of misalignment it will have, where the hard bits of checking it for intent and value alignment will be. Trying to make extensive end-to-end plans can be useful, but can also lead to a strong streetlight effect: we’ll be overcommitting to current understanding, current frames of thought (in an alignment community that is growing and integrating new ideas with an exponential rate that can be factored in months, not years).

Don’t get me wrong. I think it’s valuable to try to plan things where our current understanding is likely to at least partially persist: how AI will interface with government, general questions of scaling and rough models of future development. But we should also understand that our map has lots of blanks, especially when we get down to thinking about what we will understand in the future. What kinds of worrying behaviors will turn out to be relevant and which ones will be silly in retrospect? What kinds of guarantees and theoretical foundations will our understanding of AI encompass? We really don’t know, and trying to chart a course through only the parts of the map that are currently filled out is an extremely limited way of looking at things.

So instead of trying to solve the alignment problem end to end what I think we should be doing is:
- getting a variety of good, rough frames on how the future of AI might go
- thinking about how these will integrate with human systems like government, industry, etc.
- understanding more things, to build better models in the future.
I think the last point is crucial, and should be what modern alignment and interpretability is focused on. We really do understand a lot more about AI than we did a few years ago (I’m planning a post on this). And we’ll understand more still. But we don’t know what this understanding will be. We don’t know how it will integrate with existing and emergent actors and incentives. So instead of trying to one-shot the game and write an ab initio plan for how work on quantifying creativity in generative vision models will lead to the world being saved, I think there is a lot of room to just do good research. Fill in the blank patches on that map before routing a definitive course on it. Sure, maybe don’t waste time on the patches in the far corners which are too abstract or speculative or involve too much backchaining. But also don’t try to predict all the axes that will be on the wall in the future before looking more carefully at a specific, potentially interesting, axe.
What links here?
- My January alignment theory Nanowrimo by Dmitry Vaintrob (Jan 2, 2025, 12:07 AM; 42 points)
- Is theory good or bad for AI safety? by Dmitry Vaintrob (Jan 19, 2025, 10:32 AM; 27 points)
- Raemon Jan 6, 2025, 11:27 PM
  27 points
  5
  Parent
  FYI I think by the time I wrote Optimistic Assumptions, Longterm Planning, and “Cope”, I think I had updated on the things you criticize about it here (but, I had started writing it awhile ago from a different frame and there is something disjointed about it)
  But, like, I did mean both halfs of this seriously:
  I think you should be scared about this, if you’re the sort of theoretic researcher, who’s trying to cut at the hardest parts of the alignment problem (whose feedback loops are weak or nonexistent)
  I think you should be scared about this, if you’re the sort of Prosaic ML researcher who does have a bunch of tempting feedback loops for current generation ML, but a) it’s really not clear whether or how those apply to aligning superintelligent agents, b) many of those feedback loops also basically translate into enhancing AI capabilities and moving us toward a more dangerous world.
  ...
  Re:
  For the last few weeks, I’ve been working on trying to find plans for AI safety. They should cover the whole problem, including the major hurdles after intent alignment.
  I strongly disagree with this being a good thing to do! We’re not going to have a good, end-to-end plan about how to save the world from AGI.
  I think in some sense I agree with you – the actual real plans won’t be end-to-end. And I think I agree with you about some kind of neuroticism that unhelpfully bleeds through a lot of rationalist work. (Maybe in particular: actual real solutions to things tend to be a lot messier than the beautiful code/math/coordination-frameworks an autistic idealist dreams up)
  But, there’s still something like “plans are worthless, but planning is essential.” I think you should aim for the standard of “you have a clear story for how your plan fits into something that solves the hard parts of the problem.” (or, we need way more people doing that sort of thing, since most people aren’t really doing it at all)
  Some ways that I think about End to End Planning (and, metastrategy more generally)
  Because there are multiple failure modes, I treat myself as having multiple constraints I have to satisfy:
  - My plans should backchain from solving the key problems I think we ultimately need to solve
  - My plans should forward chain through tractable near goals with at least okay-ish feedback loops. (If the okay-ish feedback loops don’t exist yet, try to be inventing them. Although don’t follow that off a cliff either – I was intigued by Wentworth’s recent note that overly focusing on feedback loops led him to predictably waste some time)
  - Ship something to external people, fairly regularly
  - Be Wholesome (that is to say, when I look at the whole of what I’m doing it feels healthy, not like I’ve accidentally min-maxed my way into some brittle extreme corner of optimization space)
  And for end to end planning, have a plan for...
  - up through the end of my current OODA loop
  - (maybe up through a second OODA loop if I have a strong guess for how the first OODA loop goes)
  - as concrete a plan as I can, assuming no major updates from the first OODA loop, up through the end of the agenda.
  - as concrete a visualization of the followup steps after my plan ends, for how it goes on to positively impact the world.
  End to End plans don’t mean you don’t need to find better feedbackloops or pivot. You should plan that into the plan (And also expect to be surprised about it anyway). But, I think if you don’t concretely visualize how it fits together you’re like to go down some predictably wasteful paths.
  - Dmitry Vaintrob Jan 7, 2025, 8:48 PM
    2 points
    0
    Parent
    Thanks for the context! I didn’t follow this discourse very closely, but I think your “optimistic assumptions” post wasn’t the main offender—it’s reasonable to say that “it’s suspicious when people are bad at backchaining but think they’re good at backchaining or their job depends on backchaining more than they are able to”. I seem to remember reading some responses/ related posts that I had more issues with, where the takeaway was explicitly that “alignment researchers should try harder at backchaining and one-shotting baba-is-you-like problems because that’s the most important thing”, instead of the more obvious but less rationalism-vibed takeaway of “you must (if at all possible) avoid situations where you have to one-shot complicated games”.
    
    I think if I’m reading you correctly, we’re largely in agreement. All plan-making and game-playing depends on some amount of backchaining/ one-shot prediction. And there is a part of doing science that looks a bit like this. But there are ways of getting around having to brute-force this by noticing regularities and developing intuitions, taking “explore” directions in explore-exploit tradeoffs, etc. -- this is sort of the whole point of RL, for example.
    
    I also very much like the points you made about plans. I’d love to understand more about your OODA loop points, but I haven’t yet been able to find a good “layperson” operationalization of OODA that’s not competence porn (in general, I find “sequential problem-solving” stuff coming from pilot training useful as inspiration, but not directly applicable because the context is so different—and I’d love a good reference here that addresses this carefully).
    
    A vaguely related picture I had in my mind when thinking about the Baba is you discourse (and writing this shortform) comes from being a competitive chess player in middle school. Namely, in middle school competitions and in friendly training games in chess club, people make a big deal out of the “touch move” rule: that you’re not allowed to play around with pieces when planning and you need to form a plan entirely in your head. But then when you see a friendly game between two high-level chess players, they will constantly play around with each other’s pieces to show each other positions several moves into the game that would result from various choices. To someone on a high level (higher than I ever got to), there is very little difference between playing out a game on the board and playing it out in your head, but it’s helpful to move pieces around to communicate your ideas to your partner. I think that (even with a scratchpad), there’s a component of this here: there is a kind of qualitative difference between “learning to track hypothetical positions well” / “learning results” / “being good at memorization and flashcards” vs. having better intuitions and ideas. A lot of learning a field / being a novice in anything consists of being good at the former. But I think “science” as it were progresses by people getting good at the latter. Here I actually don’t think that the “do better vibe” corresponds to not being good at generating new ideas: rather, I think that rationalists (correctly) cultivate a novice mentality, where they constantly learn new skills and approach new areas, where the “train an area of your brain to track sequential behaviors well” (analogous to “mentally chain several moves forward in Baba is you”) is the core skill. And then when rationalists do develop this area and start running “have and test your ideas and intuitions in this environment” loops, these are harder to communicate/ analyze, and so their importance sort of falls off in the discourse (while on an individual level people are often quite good at these—in fact, the very skill of “communicating well about sequential thinking” is something that many rationalists have developed deep competence in I think).
- Noosphere89 Jan 6, 2025, 6:43 PM
  8 points
  2
  Parent
  I think the reason for this is that the “hanging axe” worry is a vibe very familiar from my family and friend circle, and more generally a particular kind of intellectual neuroticism that I encounter all the time, that is terrified of incomplete control or understanding.
  
  Something like this is a big reason why I’m not a fan of MIRI, because I think this sort of neuroticism is at somewhat encouraged by that group.
  Also, remember that the current LW community is selected for scrupulosity and neuroticism, which IMO is not that good for solving a lot of problems:
  Richard Ngo and John Maxwell illustrate it here:
  https://www.lesswrong.com/posts/uM6mENiJi2pNPpdnC/takeaways-from-one-year-of-lockdown#g4BJEqLdvzgsjngX2
  https://www.lesswrong.com/posts/uM6mENiJi2pNPpdnC/takeaways-from-one-year-of-lockdown#3GqvMFTNdCqgfRNKZ
  - Dmitry Vaintrob Jan 7, 2025, 8:58 PM
    2 points
    0
    Parent
    Cool! I haven’t seen these, good to have these to point to (and I’m glad that Richard Ngo has thought about this)
Dmitry Vaintrob May 5, 2025, 8:14 PM
36 points
0
On the friendship fallacy and Owen Barfield

I just finished reading the book “The Fellowship: The Literary Lives of the Inklings”, by Philip and Carol Zaleski. It’s a book about an intellectually appealing and socially cohesive group of writers in Oxford who met weekly and critiqued each other’s work, which included JRR Tolkien and CS Lewis. The book is very centered on Christianity (the writers also write Christian apologetics), but this works well, as understanding either Lewis or Tolkien or the Inklings in general without the lens of their deeply held thoughtful Christianity is about as silly as trying to analyze the Lion King without reading Hamlet.

But there is a core character in the book who is treated sympathetically and who I really hate: Owen Barfield, the “founding” Inkling. From his youth, he is a follower of Rudolf Steiner and a devoted Anthroposophist (a particularly benign group of Christian Occultists). Barfield was Lewis’s friend, existing always in his shadow (Lewis was very famous in his lifetime as a philosopher and Christian apologist, a kind of Jordan Peterson of his time if you imagine Jordan Peterson had brains and real literary/academic credentials). He worked in a law firm and consistently saw himself as a thwarted philosopher/writer/poet, and he found recognition late in life after he wrote a Lewis biography and after his woo-adjacent ideas became more popular in the 60s.

Throughout his life, Barfield created a personal philosophy of “all the things I like/ think are interesting are kind of the same thing”, and he was very sad when people he liked disapproved of, or failed to identify as “sort of the same thing” the different things that he mixed into his philosophy. While he generally is a bit of an “intellectual klutz”, his fundamental failure is the “Friendship Fallacy”: the idea of treating ideas as friends, as something deserving of loyalty. When he encounters different ideas he likes, he “wants them to get along,” and when ideas fail to convince skeptics or produce results or interface with reality (or indeed, with faith), he simply fails to impose any kind of falsifiability requirement and treats this as a loyalty test he must pass. He totally lacks the kind of internal courage needed to kill one’s darlings (whether philosophical or literary) and to treat his own ideas with skepticism and view towards falsification—perhaps the core trait of a good thinker (Feynman’s “You must not fool yourself—and you are the easiest person to fool”).

Interestingly, I don’t extend this antipathy to the Christianity of the group’s other famous members. Unlike Barfield, Tolkien and Chesterton largely succeed (imo) in separating the domains of the literary, the psychological, and the religious. They don’t pretend to be scientific authorities or predict things “in the world”. Tolkien in particular is very anti-progress and a bit of a luddite, but in my understanding his work as a linguist is very good for his time. In fact, it’s funny that his deeply Christian mentality created one of the most “atheist nerd”-like behaviors of creating thoroughly crafted fictional languages of fantasy cultures. I’ve been surprised to learn from reading a couple of his biographies that his linguistic worldbuilding in fact preceded his fantasy work: he designed Elvish before writing any work in his canon, and wrote the work to flesh out the mythology behind expressions and poems. He famously said about his work “The making of language and mythology are related functions”. In fact, he viewed the work of producing plausible cultures and languages—in my view an admirable (though non-academic) kind of secular scholarship analogous to studying alternative physical systems, etc. -- as an explicitly Christian task of “subcreation”, a sort of worship-by-imitation of God.

It’s a bit hard to exactly formulate a razor between the kind of “lazy scientism” of Barfield and various other forms of “pseudoscientific woo” and the serious and purely mystical/ inspirational deep religiosity of people like Tolkien (and to a lesser extent Lewis—another interesting thing I learned was that he started out as a devoted atheist in a world where this was actually socially fraught, and was converted through a philosophical struggle involving Barfield and Tolkien in particular). But maybe the idea of a “philosophy without struggle”: a tendency towards confirmation and a total lack of earnest self-questioning goes a part of the way towards explaining this distinction. Another part is the difference between a purely metaphysical personal religion and a more woo idea of a religion that “makes predictions about the world”. I think the thing that really took me aback a bit was the level of academic embrace of Barfield late in his life, not just as a Lewis biographer but as a respected academic philosopher with honorary professorships and the works—a confirmation (if ever more are needed) that lazy pseudointellectualism and confirmation bias are very much not incompatible with academic success. Another theme that I think is interesting is the fact that Lewis and Tolkien were at times genuinely interested and even somewhat inspired by his ideas (though they had no time for occultism or 60-esque woo). The extent to which this happened is hard to gauge (he outlived them and wrote a lot about how he influenced them in his biographies/reminiscences, and this was then picked up by scholars). But unquestioningly, this did occur to some extent. And whether or not you class Tolkien/Lewis as “valuable thinkers”, the history of science and philosophy does seem to abound with examples of clear and robust thinkers whose good ideas were to some extent inspired by charismatic charlatans and woo.

Below are my personal notes on Barfield that I wrote after reading the book.

I despise Barfield. Not in the visceral sense that the first syllable of his name may (Anthroposophically) evoke. Indeed I identify with the underdog/late-bloomer shape of his biography, with his striving towards a higher calling. I readily adopt the book’s sympathy towards him as a literary character with fortunes tied to an idea deeply espoused, a thwarted writer with some modicum of undiscovered talent. My antipathy isn’t even in the specifics of what he espouses: a mild but virulently wrong view of science and philosophy adjacent to all the stupid of my parents’ generation of `anthroposophy’ (Atlantis, Consciousness and Quantum Mechanics, anti-Evolutionism, Vibes). But I despise him as one of a Fundamental Mistake. That of confusing science and personality. Being loyal to a scientific or philosophical discipline isn’t like being loyal to a person: if it’s consistently fucking up and you need to make excuses for its behavior to all your reasonable smart friends, you’re not being a good friend but rather a bad scientist. Barfield is almost an archetype of Bad Science if you project out the crazy/dogmatic/ political/ evil-Nazi component. He really is a nice man. But within his mild-mannered Christian friendliness which I respect, he is inflexible and unscientific. He doesn’t update. He glows when people endorse his preferred view (Anthroposophy and Steiner) and sadly laments when they disagree with him—because he can’t help but feel like ``there’s something there″. He wants to seamlessly draw parallels between all the nice things he and other nice people believe. He draws lines of identification back and forth between all the things he likes (Coleridge <> Himself <> Quantum Mechanics <> Anthroposophy <> Steiner <> Religion <> Consciousness <> Complementary dualism/”polarity”). He has “nothing but symbols” in his brain, and the symbols in his brain aren’t strong enough to notice that they fail to signify. A person without significance, with a philosophy without significance, possessed of a brain without the capacity to grasp the concept of what it means to signify. The first of these is a tragedy (people should matter) and his late-found fame mediated through famous friends is a sweet story, maybe one he even deserves as the first-mover of the Inklings, the reason for the Lewis-Tolkien friendship, etc. The second is a neutral: theories that fail to achieve significance “in their lifetime” may be bunk but may have value: Greek Atomism, various prescient ideas about physics and computers (Babbage/ Lovelace), etc. But the third is a profound personal failing, and it’s only through luck and through (mostly well-placed) trust in much smarter and more rigorous friends that he avoided attaching this vapid form of mentation to something truly vile: Nazism (which he very briefly flirted with, charmed by its interest in magic and the occult), various fundamentalisms (including an anti-evolutionary fundamentalism: his friends believed in evolution but he didn’t really buy it “on vibes”; he was never a fundamentalist), Communism, etc.
What links here?
- ryan_greenblatt's comment on ryan_greenblatt’s Shortform by ryan_greenblatt (May 6, 2025, 7:30 PM; 40 points)
Dmitry Vaintrob Jan 5, 2025, 6:57 AM
33 points
11
We have in fact reverse engineered alien datastructures.

I’m trying to keep up a regular schedule of writing for my Nanowrimo project. I’m working on a post that’s not done yet, so for today I’ll write a quick (and not very high-effort) take related to discussions I’ve had with more doomy friends (I’m looking at you Kaarel Hänni). A good (though incomplete) crux on why something like “faithful interpretability” may be hard (faithful here being “good enough to genuinely decompose enough of the internal thought process to notice and avoid deception”) is contained in Tsvi’s post on “alien datastructures”. I think it’s a great piece, though I largely probably agree with it only about 50%. A very reductive explanation of the core parable in the piece is that we imagine we’ve found an ancient alien computer from a civilization with a totally separate parallel development of math and logic and computing. The protagonist of the koan lectures on why it might be hard to “do science” to reverse engineer what the computer does, and it might be better to “do deep thinking” to try to understand what problems the computer is trying to accomplish, and then try to come up with the right abstractions. In the case of AI, the recommendation is that this should in particular be done by introspecting and trying to understand our own thinking at a high level. There’s a great quote that I’ll include because it sounds nice:

Go off and think well——morally, effectively, funly, cooperatively, creatively, agentically, truth-trackingly, understandingly——and observe this thinking——and investigate/modify/design this thinking——and derive principles of mind that explain the core workhorses of the impressive things we do including self-reprogramming, and that explain what determines our values and how we continue caring across ontology shifts, and that continue to apply across mental change and across the human-AGI gap; where those principles of mind are made of ideas that are revealed by the counterfactual structure of possible ways of thinking revealed by our interventions on our thinking, like how car parts make more sense after you take them out and replace them with other analogous parts.

I don’t want to debate the claims of the piece, since after all it’s a koan and is meant to be a parable and a vibe. But I think that as a parallel parable, it’s useful to know that humanity has actually reverse engineered a massively alien computer, and we did it extremely well, and via a combination of “science” (i.e., iteratively designing and testing and refining models) and high-level thinking, though with more of an emphasis on science.

The alien computer in this analogy is physics. It’s hard to emphasize how different physics is from what people in the past thought it would be. All our past ideas of reality have been challenged: waves aren’t really waves, particles aren’t really particles, time isn’t really time, space isn’t really (just) space. I think that a lot of people who have learned about math or physics, but don’t have experience doing research in it, have a romantic idea that it’s driven by “deep insight”. That physics sits around waiting for the next Newton or Einstein to get a new idea, and then adapts to that idea with the experimentalists there to confirm the great theoreticist’s theories, and the “non-great” rank and file there to compute out their consequences. And like sure, this kind of flow from theorists out is one thing that exists, but it’s mostly wrong. Mostly the way we understand the wild, alien structure that is physics is through just, well, doing science, which involves people tweaking old paradigms, gathering data, finding things that don’t make sense, and then folding the new information back into the big structure. The key point here (that physicists understand viscerally) is that there’s not a big, concept-centric order dependence to discovery. You can understand quantum mechanics first, you can understand relativity first, you can start with waves or particles; in the end, if you are serious about refining and testing your intuitions, and you do calculations to get stuff that increasingly makes more sense, you are on the right track.

A fun example here comes from the discovery of Heisenberg’s “matrix” model of quantum mechanics. One could imagine a romantic Eureka-moment picture of people suddenly grasping the indeterminacy of reality after thinking deeply about experiments that don’t make sense. But the reality was that after Max Planck’s first idea of quanta of energy levels in 1900 through 1925, people used measurements and arbitrary combinatorics that seemed to mostly agree with experiment to cobble together a weird theory (that the next generation called “Old quantum theory”) of allowed and disallowed orbits that everyone admitted didn’t make sense and fell apart on deep consideration, but sorta worked. After a significant amount of progress was made in this way, Heisenberg and Sommerfeld took a two-parameter table of numbers that went into this theory, made a matrix out of it, and then (in my somewhat cartoonish understanding) noticed that this matrix could be used to measure some other phenomena better. Then Heisenberg realized that just by looking at the matrix, making it complex and viewing it as an evolution operator, you could get a unifying and more satisfying explanation of many of the disparate phenomena in this old theory; it would then take more time for people to develop the more modern concepts of measurement, decoherence, and many worlds (the last of which still meets resistance among the old guard today).

The point is that physics doesn’t (only) grow through great insight. This might seem surprising to someone who grew up in a certain purist philosophy of logic that, I think incorrectly, has been cultivated on lesswrong: roughly, that ideas are either right, or useless. In fact, the experience of a physicist is that sloppy stopgap ideas that explain experiment or theory in a slightly more systematic way are often useful, though for reasons that will often elude the sloppy idea-haver (physics has a perennial problem of the developers of sloppy ideas getting overinvested in their “exact correctness”, and being resistant to future more sensible systematization—but this is a different story). In physics ideas are deep, but they’re weirdly un-path dependent. There are many examples of people arriving at the same place by different sloppy and unsystematic routes (the Feynman path integral vs. other interpretations of quantum field theory being a great example). This uncanny ability to make progress by taking stopgap measures and then slowly refining them surprises physicists themselves, and I think that some very cool physics “ideology” comes from trying to make sense of this. Two meta-explanations that I think were in part a response to physicists trying to figure out “what the hell is going on”, but led to physics that is used today (especially in the first case) are:
- effective field theory and the renormalization group flow, coming from ideas of Wilson about how theories at different energy levels can be effectively replaced by simpler theories at other layers (and relating to ideas of Landau and others of complicated concrete theories self-correcting to converge to certain elegant but more abstract “universal” theories at larger scale)
- the “bootstrap” idea that complicated theories with unknown moving parts have measurements that satisfy certain standard formulas and inequalities, and one can often get quite far (indeed sometimes, all the way) by ignoring the “physics” and only looking at the formulas as a formal system.
Both of these physics ideas identify a certain shape of “emergent structure”, where getting at something with even a very small amount of “orientational push” from reality, or from known sloppy structure, will tend to lead to the same “correct” theory at scales we care about.

This doesn’t mean that big ideas and deep thinking are useless. But I think that this does point (at least in physics) to taking more seriously ideas that “aren’t fully self-consistent yet” from a more demanding mathematical standpoint. We can play random fun measurement and calculation and systematization games, and tinker (“morally, effectively, funly, cooperatively, creatively, agentically, truth-trackingly, understandingly”) with systematizations of the results of these games, and we’re likely to get there in the end.

At least if the alien computer is physics.

For later. In my posts I’m hoping to talk in more detail about renormalization and the self-correcting nature of universality, and also about why I think that the complexity of (certain areas of) physics is a good match for the complexity of neural nets, and why I think the modern field of interpretability and “theoretical ML” more generally is much further along this game of tinkering and systematizing than many people think (in particular, in some ways we’re beyond the “Old Quantum Theory”-style mess). But this is it for now.
What links here?
- My January alignment theory Nanowrimo by Dmitry Vaintrob (Jan 2, 2025, 12:07 AM; 42 points)
- Charlie Steiner Jan 5, 2025, 1:55 PM
  4 points
  1
  Parent
  As Sean Carroll likes to say, though, the reason we’ve made so much progress in physics is that it’s way easier than the other sciences :)
  - Noosphere89 Jan 5, 2025, 6:16 PM
    5 points
    0
    Parent
    I’d argue that the easiness of physics probably comes from the fact that we can get effectively unlimited data, combined with the ability to query our reality as an oracle to test certain ideas and importantly get easy verification of a theory, which helps in 2 ways:
    
    The prior matters very little, because you can update to the right theory from almost all but the most dogmatic priors.
    
    Verifying being easy shortcuts a lot of philosophical debates, and makes it easy to update towards correct theories.
    
    However, I think the main insight that sloppy but directionally correct ideas being useful to build upon, combined with partial progress being important is a very important idea that has applicability beyond physics.
    - Dmitry Vaintrob Jan 5, 2025, 7:23 PM
      6 points
      2
      Parent
      This makes sense, but I’d argue that ML and interpretability has even more of both of these properties. Something that makes it harder is that some of the high-level goals of understanding transformers are inherently pretty complex, and also it’s less susceptible to math/ elegance-based analysis, so is even more messy :)
    - Mo Putera Jan 6, 2025, 4:03 AM
      2 points
      0
      Parent
      I think what explains the relative ease of progress in physics has more so to do with its relative compositionality in contrast to other disciplines like biology or economics or the theory of differential equations, in the sense Jules Hedges meant it. To quote that essay:
      For examples of non-compositional systems, we look to nature. Generally speaking, the reductionist methodology of science has difficulty with biology, where an understanding of one scale often does not translate to an understanding on a larger scale. … For example, the behaviour of neurons is well-understood, but groups of neurons are not. Similarly in genetics, individual genes can interact in complex ways that block understanding of genomes at a larger scale.
      Such behaviour is not confined to biology, though. It is also present in economics: two well-understood markets can interact in complex and unexpected ways. Consider a simple but already important example from game theory. The behaviour of an individual player is fully understood: they choose in a way that maximises their utility. Put two such players together, however, and there are already problems with equilibrium selection, where the actual physical behaviour of the system is very hard to predict.
      More generally, I claim that the opposite of compositionality is emergent effects. The common definition of emergence is a system being ‘more than the sum of its parts’, and so it is easy to see that such a system cannot be understood only in terms of its parts, i.e. it is not compositional. Moreover I claim that non-compositionality is a barrier to scientific understanding, because it breaks the reductionist methodology of always dividing a system into smaller components and translating explanations into lower levels.
      More specifically, I claim that compositionality is strictly necessary for working at scale. In a non-compositional setting, a technique for a solving a problem may be of no use whatsoever for solving the problem one order of magnitude larger. To demonstrate that this worst case scenario can actually happen, consider the theory of differential equations: a technique that is known to be effective for some class of equations will usually be of no use for equations removed from that class by even a small modification. In some sense, differential equations is the ultimate non-compositional theory.
- Noosphere89 Jan 5, 2025, 6:33 PM
  2 points
  0
  Parent
  Minor nit: The alien computer is a specific set of physical laws, which shouldn’t be confused with the general case of physics/mathematics, so we only managed to reverse engineer it for 1 universe.
  - Dmitry Vaintrob Jan 5, 2025, 7:25 PM
    2 points
    0
    Parent
    Cute :). Do you mean that we’ve only engineered the alien computer running a single program (the standard model with our universe’s particular coupling constants), or something else?
    - Noosphere89 Jan 5, 2025, 8:47 PM
      2 points
      0
      Parent
      Yes, I was talking about this:
      
      Do you mean that we’ve only engineered the alien computer running a single program (the standard model with our universe’s particular coupling constants)
      
      but more importantly I was focused on the fact that all plausible future efforts will only reverse engineered the alien computer that runs a single program, which is essentially the analogue of the laws of physics for our physical universe.
Dmitry Vaintrob Jan 5, 2025, 9:54 PM
22 points
2
On the surprising effectiveness of linear regression as a toy model of generalization.

Another shortform today (since Sunday is the day of rest). This time it’s really a hot take: I’m not confident about the model described here being correct.

Neural networks aren’t linear—that’s the whole point. They notice interesting, compositional, deep information about reality. So when people use linear regression as a qualitative comparison point for behaviors like generalization and learning, I tend to get suspicious. Nevertheless, the track record of linear regression as a model for “qualitative” asymptotic behaviors is hard to deny. Linear regression models (neatly analyzable using random matrix theory) give surprisingly accurate models of double descent, scaling phenomena, etc. (at least when comparing to relatively shallow networks like mnist or modular addition).

I recently developed a cartoon internal model for why this may be the case. I’m not sure if it’s correct, but I’ll share it here.

The model assumes a few facts about algorithms implemented by NN’s (all of which I believe in much more strongly than the model of comparing them to linear regression):
1. The generalized Linear Representation Hypothesis. An NN’s internal working can be locally factorized into a large collection of low-level features in distinct low-dimensional linear subspaces, and then applying (generally nonlinear) postprocessing to these features independently or in small batches. Note that this is much weaker than a stronger version (such as the one inherent in SAE) that posits 1-dimensional features. In my experience a version of this hypothesis is almost universally believed by engineers, and also agrees with all known toy algorithms discovered so far.
2. Smoothness-ish of the data manifold. Inside the low-dimensional “feature subspace”, the data is kinda smooth—i.e., it’s smooth (i.e., locally approximately linear) in most directions, and in directions where it’s not smooth, it still might behave sorta smoothly in aggregate.
3. Linearity-ish of the classification signal. Even in cases like mnist or transformer learning where the training data is discrete (and the algorithm is mean to approximate it by a continuous function), there is a sense in which it’s locally well-approximated by a linear function. E.g. perhaps some coarse-graining of the discrete data is continuously linear, or at least the data boundary can be locally well approximated by a linear hyperplane (so that a local linear function can attain 100% accuracy). More generally, we can assume a similar local linearity property on the layer-to-layer forward functions, when restricted to either a single feature space or a small group of interacting feature spaces.
4. (At least partial) locality of the effect of weight modification. When I read it this paper left a lasting impression on me. I’m actually not super excited about its main claims (I’ll discuss polytopes later), but a very cool piece of the analysis here is locally modelling ReLU learning as building a convex function as a max of linear functions (and explaining why non-ReLU learning should exhibit a softer version of the same behavior). This is a somewhat “shallow” point of view on learning, but probably captures a nontrivial part of what’s going on, and this predicts that every new weight update only has local effect—i.e., is felt in a significant way only by a small number of datapoints (the idea being that if you’re defining a convex function as the max of a bunch of linear functions, shifting one of the linear functions will only change the values in places where this particular linear function was dominant). The way I think about this phenomenon is that it’s a good model for “local learning”, i.e., learning closer to memorization on the memorization-generalization spectrum that only updates the behavior on a small cluster of similar datapoints (e.g. the LLM circuit that completes “Barack” with “Obama”). There are also possibly also more diffuse phenomena (like “understanding logic”, or other forms of grokking “overarching structure”), but most likely both forms of learning occur (and it’s more like a spectrum than a dichotomy).
If we buy that these four phenomena occur, or even “occur sometimes” in a way relevant for learning, then it naturally follows that a part of the “shape” of learning and generalization is well described qualitatively by linear regression. Indeed, the model then becomes that (by point 4 above), many weight updates exclusively “focus on a single local batch of input points” in some low-dimensional feature manifold. For this particular weight update, locality of the update and smoothness of the data manifold (#2) together imply that we can model it as learning a function on a linear low-dimensional space (since smooth manifolds are locally well-approximated by a linear space). Finally, local linearity of the classification function (#3) implies that we’re learning a locally linear function on this local batch of datapoints. Thus we see that, under this collection of assumptions, the local learning subproblems essentially boil down to linear regression.

Note that the “low-dimensional feature space” assumption, #1, is necessary for any of this to even make sense. Without making this assumption, the whole picture is a non-starter and the other assumptions, #2-#4 don’t make sense, since a sub-exponentially large collection of points on a high-dimensional data manifold with any degree of randomness (something that is true about the data samples in any nontrivial learning problem) will be very far away from each other and the notion of “locality” becomes meaningless. (Note also that a weaker hypothesis than #1 would suffice—in particular, it’s enough that there are low-dimensional “feature mappings” where some clustering occurs at some layer, and these don’t a priori have to be linear.)

What is this model predicting? Generally I think abstract models like this aren’t very interesting until they make a falsifiable prediction or at least lead to some qualitative update on the behavior of NN’s. I haven’t thought about this very much, and would be excited if others have better ideas or can think of reasons why this model is incorrect. But one thing this model likely predicts is that a better model for a NN than a single linear regression model is a collection of qualitatively different linear regression models at different levels of granularity. In other words, depending on how sloppily you chop your data manifold up into feature subspaces, and how strongly you use the “locality” magnifying glass on each subspace, you’ll get a collection of different linear regression behaviors; you then predict that at every level of granularity, you will observe some combination of linear and nonlinear learning behaviors.

This point of view makes me excited about work by Ari Brill on fractal data manifolds and scaling laws. If I the paper correctly, he models a data manifold as a certain stochastic fractal in a low-dimensional space and makes scaling predictions about generalization behavior depending on properties of the fractal, by thinking of the fractal as a hierarchy of smooth but noisy features. Finding similarly-flavored behavior scaling behavior on “linear regression subphenomena” in a real-life machine learning problem would positively update me on my model above being correct.
What links here?
- My January alignment theory Nanowrimo by Dmitry Vaintrob (Jan 2, 2025, 12:07 AM; 42 points)
- Adam Shai Jan 6, 2025, 12:25 AM
  7 points
  0
  Parent
  Ari’s work is on Arxiv here
  - Dmitry Vaintrob Jan 7, 2025, 8:59 PM
    2 points
    0
    Parent
    Thanks so much for this! Will edit
- Alexander Gietelink Oldenziel Jan 7, 2025, 11:52 AM
  4 points
  0
  Parent
  Loving this!
  But one thing this model likely predicts is that a better model for a NN than a single linear regression model is a collection of qualitatively different linear regression models at different levels of granularity. In other words, depending on how sloppily you chop your data manifold up into feature subspaces, and how strongly you use the “locality” magnifying glass on each subspace, you’ll get a collection of different linear regression behaviors; you then predict that at every level of granularity, you will observe some combination of linear and nonlinear learning behaviors.
  Epic.
  A couple things that come to mind.
  - Linear features = sufficients statistics of exponential families ?
    simplest case is case of Gaussians and covariance matrix (which comes down to linear regression)
    formalized by GPD theorem
    see generalization by John
    exponential families are a fairly good class but not closed under hierarchichal structure. Basic example is a mixture of Gaussians is not exponential, i.e. not described in terms of just linear regression.
  - The centrality of ReLU neural networks.
    Understanding ReLU neural networks is probably 80-90% of understanding NN- architectures. At sufficient scale pure MLP have the same or better scaling laws than transformers.
    There is several lines of evidence gradient descent has an inherent bias towards splines/piecewise linear functions/tropical polynomials. see e.g. here and references therein.
    Serious analysis of ReLU neural network can be done through tropical methods. A key paper is here. You say:
    “very cool piece of the analysis here is locally modelling ReLU learning as building a convex function as a max of linear functions (and explaining why non-ReLU learning should exhibit a softer version of the same behavior). This is a somewhat “shallow” point of view on learning, but probably captures a nontrivial part of what’s going on, and this predicts that every new weight update only has local effect—i.e., is felt in a significant way only by a small number of datapoints (the idea being that if you’re defining a convex function as the max of a bunch of linear functions, shifting one of the linear functions will only change the values in places where this particular linear function was dominant). The way I think about this phenomenon is that it’s a good model for “local learning”, i.e., learning closer to memorization on the memorization-generalization spectrum that only updates the behavior on a small cluster of similar datapoints (e.g. the LLM circuit that completes “Barack” with “Obama”). “
    I suspect the notion one should be looking at are the Activation polytope and activation fan in section 5 of the paper. The hypothesis would be something about efficiently learnable features having a ‘locality’ constraint on these activation polytopes, ie. they are ‘small’, ‘active on only a few data points’..
Dmitry Vaintrob Jan 15, 2025, 1:45 PM
15 points
2
Why I’m in AI sequence: 2020 Journal entry about gpt3

I moved from math academia to full-time AI safety a year ago—in this I’m in the same boat as Adam Shai, whose reflection post on the topic I recommend you read instead of this.

In making the decision, I went through a lot of thinking and (attempts at) learning about AI before that. A lot of my thinking had been about whether a pure math academic can make a positive difference in AI, and examples that I thought counterindicated this—I finally decided this might be a good idea after talking to my sister Lizka extensively and doing MATS in Summer of 2023. I’m thinking of doing a more detailed post about my decision and thinking later, in case there are other academics thinking about making this transition (and feel free to reach out in pm’s in this case!).

But one thing I have started to forget is how scary and visceral AI risk felt when I was making the decision. I’m both glad and a little sad that the urgency is less visceral and more theoretical now. AI is “a part of the world”, not an alien feature: part of the “setting” in the Venkat Rao post that was part of my internal lexicon at the time.

For now, in order to fill a gap in my constantly flagging daily writing schedule, I’ll share a meandering entry from 2020 about how I thought about positive AI futures. I don’t endorse a lot of it; much is simplistic and low-context, or alternatively commonplace in these circles, though some of it holds up. It’s interesting reading back that the thing I thought was most interesting as a first attempt at orienting my thinking was fleshing out “positive futures” and what they might entail. Two big directional updates I’ve had since are thinking harder about “human alignment” and “human takeover”, and trying to temper the predictions that assume singularitarian “first-past-the-post” AGI for a messier “AI-is-kinda-AGI” world that we will likely end up in.

journal entry

7/19/2020 [...] I’m also being paranoid about GPT-3.

Let’s think. Will the world end, and if so, when? No one knows, obviously. GPT-3 is a good text generation bot. It can figure out a lot about semantics, mood, style, even a little about humor. It’s probably not going to take over the world yet. But how far away are we from AGI?

GPT-3 makes me think, “less than a decade”. There’s a possibility it will be soon (within the year). I’d assign that probability 10%. It felt like 20% when I first saw its text, but seeing Sam Altman’s remark and thinking a little harder, I don’t think it’s quite realistic for it to go AGI without a significant extra step or two. I think that I’d give it order of 50% within the decade. So it’s a little like living with a potentially fatal disease, with a prognosis of 10 years. Now we have no idea what AGI will be like. It will most likely either be very weird and deadly or revolutionary and good, though disappointing in some ways. I think there’s not much we can do about the weird and deadly scenarios. Humans have lived in sociopathic times (see Venkat’s notes on his 14th Century Europe book). It would probably be shorter and deadlier than the plague; various “human zoo” scenarios may be pleasant to experience (after all zoo animals are happier in general than in the wild, at least from the point of view of basic needs), but harrowing to imagine. In any case, it’s not worth speculating on this.

What would a good outcome look like? Obviously, no one knows. It’s very hard to predict our interaction with a super-human intelligence. But here are some pretty standard “decent” scenarios: (1) After a brief period of a pro-social AI piloted by a team of decent people, we end up with a world much like ours but with AI capabilities curbed for a long period of time [...]. If it were up to me I would design this world with certain “guard rail”-like changes: to me this would be a “Foundation”-style society somewhere in New Zealand (or on the bottom of the ocean perhaps? the moon?) consisting of people screened for decency, intelligence, etc. (but with serious diversity and variance built in), and with control of the world’s nukes, with the responsibility of imposing very basic non-interference and freedom of immigration criteria on the world’s societies (i.e., making the “archipelago” dream a reality, basically). So enforcing no torture, disincentivizing violent conflict, imposing various controls to make sure people can move from country to country and are exposed to the basic existence of a variety of experiences in the world, but allowing for culturally alien or disgusting practices in any given country: such as Russian homophobia, strict Islamic law, unpleasant-seeming (for Western Europeans) traditions in certain tribal cultures, etc. This combined with some sort of non-interventionist altruistic push. In this sci-fi scenario the Foundation-like culture would have de facto monopoly of the digital world (but use it sparingly) and also a system of safe nuclear power plants sufficient to provide the world’s power (but turned on carefully and slowly, to prevent economic jolts), but to carefully and “incontrovertibly” turn most of the proceeds into a universal basic income for the entire world population. Obviously this would have to be carefully thought out first by a community of intelligent and altruistic people with clear rules of debate/decision. —The above was written extremely sleepy. [...]

(2) (Unlikely) AI becomes integrated with (at first, decent and intelligent later, all interested) humans via some kind of mind-machine interface, or alternatively a faithful human modeling in silica. Via a very careful and considered transition (in some sense “adiabatic”, i.e. designed so as not to lose any of our human ethos and meaning that can possibly be recovered safely) we become machines, with a good and meaningful (not wireheaded, other than by considered choice) world left for the hold-outs who chose to remain purely human.

(3) The “Her” scenario: AI takes off on its own, because of human carelessness or desperation. It develops in a way that cherishes and almost venerates humans, and puts effort into making a good, meaningful existence for humans (meaningful and good in sort of the above adiabatic sense, i.e. meaningful via a set of clearly desirable stages of progress from step to step, without hidden agendas, and carefully and thoughtfully avoiding creating or simulating, in an appropriate sense, anything that would be considered a moral horror by locally reasonable intelligences at any point in the journey). AI continues its own existence, either self-organized to facilitate this meaningful existence of humans or doing its own thing, in a clearly separated and “transcendent” world, genuinely giving humans a meaningful amount of self-determination, while also setting up guardrails to prevent horrors and also perhaps eliminating or mitigating some of the more mundane woes of existence (something like cancer, etc.) without turning us into wireheads.

(4) [A little less than ideal by my book, but probably more likely than the others]: The “garden of plenty” scenario. AI takes care of all human needs and jobs, and leaves all humans free to live a nevertheless potentially fulfilling existence, like aristocrats or Victorians but less classist, socializing learning reading, etc., with the realization that all they are doing is a hobby: perhaps “human-generated knowledge” would be a sort of sport, or analog of organic produce (homeopathically better, but via a game that makes everyone who plays it genuinely better in certain subtle ways). Perhaps AI will make certain “safe” types of art, craft and knowledge (maybe math! Here I’m obviously being very biased about my work’s meaning not becoming fully automated) purely the domain of humans, to give us a sense of self-determination. Perhaps humans are guided through a sort of accelerated development over a few generations to get to the no.2 scenario.

(5) There is something between numbers 3 and 4 above, less ideal than all of the above but likely, where AI quickly becomes an equal player to humans in the domain of meaning-generation, and sort of fills up space with itself while leaving a vaguely better (maybe number 4-like) Earth to humans. Perhaps imposes a time limit on humans (enforced via a fertility cap, hopefully with the understanding that humans can raise AI babies with genuine sense of filial consciousness and complete with bizzarre scences of trying to explain the crazy world of AI to their parents), after which the human project becomes the AI project, probably essentially incomprehensible to us.

There’s a sense that I have that while I’m partial to scenarios 1 and 2: I want humans to retain the monopoly on meaning-generation and to be able to feel empowered and important, it will be seen to be old-fashioned and almost dangerous by certain of my peers because of the lack of emphasis on harm-prevention, stable future, etc. I think this is part of the very serious debate, so far abstract and fun, but, as AI gets better, perhaps turning heated and loud, between whether comfort or meaning are more important goals of the human project (and both sides will get weird). I am firmly on the side of meaning, with a strict underpinning of retaining bodily and psychological integrity in all the object-level and meta-level senses (except I guess I’m ok with moving to the cloud eventually? Adiabatic is the word for me). Perhaps my point of view is on the side I think it is just in the weird group of futurists and rationalists that I mostly read when reading about AI: probably the generic human who thinks about AI is horrified by all of the above scenarios and just desperately hoping it will go away on its own, or has some really idiosyncratic mix of the above or other ideas which seem obviously preferable to them.
What links here?
- My January alignment theory Nanowrimo by Dmitry Vaintrob (Jan 2, 2025, 12:07 AM; 42 points)
Dmitry Vaintrob Jan 10, 2025, 9:17 PM
13 points
4
Why you should try degrading NN behavior in experiments.

I got some feedback on the post I wrote yesterday that seems right. The post is trying to do too many things, and not properly explaining what it is doing, why this is reasonable, and how the different parts are related.

I want to try to fix this, since I think the main piece of advice in this post is important, but gets lost in all the mess.

This main point is:

experimentalists should in many cases run an experiment on multiple neural nets with a variable complexity dial that allows some “natural” degradations of the NN’s performance, and certain dials are better than others depending on context.

I am eventually planning splitting out the post into a few parts, one of which explains this more carefully. When I do this I will replace the current version of the post with just a discussion of the “koan” itself: i.e., nitpicks about work that isn’t careful about thinking about the scale at which it is performing interpretability.

For now I want to give a quick reductive take on what I hope to be the main takeaway of this discussion. Namely, why I think “interpretability on degraded networks” is important for better interpretability.

Basically: when ML experiments modify a neural net to identify or induce a particular behavior, this always degrades performance. Now there are two hypotheses for what is going on:
1. You are messily pulling your NN in the direction of a particular behavior, and confusing this spurious messy phenomenon with finding a “genuine” phenomenon from the program’s point of view.
2. You are messily pulling your NN in the direction of a particular behavior, but also singling out a few “real” internal circuits of the NN that are carrying out this behavior.
Because of how many parameters you have to play with and the polysemanticity of everything in a NN, it’s genuinely hard to tell these two behaviors apart. You might find stuff that “looks” like a core circuit, but actually is just bits of other circuits combined together, and your circuit-fitting experiment makes look like a coherent behavior, and any nice properties of the resulting behavior that make it seem like an “authentic” circuit are just artefacts of the way you set up the experiment.

Now the idea behind running this experiment at “natural” degradations of network performance is to try to separate out these two possibilities more cleanly. Namely, an ideal outcome is that in running your experiment on some class of natural degradation of your neural net, you find a regime such that
- the intervention you are running no longer significantly affects the (naturally degraded) performance
- the observed effect still takes place.
Then what you’ve done is effectively “cleaned up” your experiment such that you are still probably finding interpretable behaviors in the original neural net (since a good degradation is likely to contain a subset of circuits/behaviors of your original net and not many “new behaviors), in a way that sufficiently reduced the complexity that the behavior you’re seeking is no longer “entangled” with a bunch of other behaviors; this should significantly update you that the behavior is indeed “natural” and not spurious.

This is of course a very small, idealized sketch. But the basic idea behind looking at neural nets with degraded performance is to “squeeze” the complexity in a controlled way to suitably match the complexity of the circuit (and how it’s embedded in the rest of the network/how it interacts with other circuits). If you then have a circuit of “the correct complexity” that explains a behavior, there is in some sense no “complexity room” for other sneaky phenomena to confound it.

In the post, the natural degradation I suggested is the physics-inspired “SLGD sampling” process which in some sense tries to add a maximal amount of noise to your NN while only having a limited impact on performance (measured by loss); this has a bias of keeping “generally useful” circuits and interactions and noising more inessential/ memorize-y circuits. Other interventions that have different properties are “just adding random noise” (either to weights or activations) to suitable reduce performance, or looking at earlier training checkpoints. I suspect that different degradations (or combinations thereof) are appropriate to isolate the relevant complexity of different experiments.
What links here?
Dmitry Vaintrob Jan 20, 2025, 5:35 PM
12 points
1
Statistical localization in disordered systems, and dreaming of more realistic interpretability endpoints

[epistemic status: half fever dream, half something I think is an important point to get across. Note that the physics I discuss is not my field though close to my interests. I have not carefully engaged with it or read the relevant papers—I am likely to be wrong about the statements made and the language used.]

A frequent discussion I get into in the context of AI is “what is an endpoint for interpretability”. I get into this argument from two sides:
- arguing with interpretability purists, who say that the only way to get robust safety from interpretability is to mathematically prove that behaviors are safe and/or no deception is going on.
- arguing with interpretability skeptics, who say that the only way to get robust safety from interpretability is to prove that behaviors are safe and/or no deception is going on.
My typical response to this is that no, you’re being silly: imagine discussing any other phenomenon in this way: “the only way to show that the sun will rise tomorrow is to completely model the sun on the level of subatomic particles and prove that they will not spontaneously explode”. Or asking a bridge safety expert to model every single particle and provably lower-bound the probability of them losing structural coherence in a way not observed by bulk models.

But there’s a more fundamental intuition here, that I started developing when I started trying to learn statistical physics. There are a few lossy ways of expressing it. One is to talk about renormalization, how assumptions about renormalizability of systems is a “theorem” in statistical mechanics, but is not (and probably never will be) proven mathematically, (in some sense, it feels much more like a “truly new flavor of axiom” than even complexity-theoretic things like P vs. NP). But that’s still not it. There is a more general intuition, that’s hard to get across (in particular for someone who, like me, is only a dabbler in the subject) -- that some genuinely incredibly complex and information-laden systems have some “strong locality” properties, which are (insofar as the physical meaning of the word holds meaning) both provable and very robust to changing and expanding the context.

For a while, I thought that this is just a vibe—a way to guide thinking, but not something that can be operationalized in a way that may significantly convince people without a similar intuition.

However, recently I’ve become more hopeful that an “explicitly formalizable” notion of robust interpretability may fall out of this language in a somewhat natural way.

This is closely related to recent discussions and writeups we’ve been doing with Lauren Greenspan on scale and renormalization in (statistical) QFT and connections to ML.

One direction to operationalize this is through the notion of “localization” in statistical physics, and in particular “Anderson localization”. The idea (if I understand it correctly) is that in certain disordered systems (think of a semiconductor, which is an “ordered” metal with a disordered system of “impurity atoms” sprinkled inside), you can prove a kind of screening property: that from the point of view of the localized dynamics near a particular spin, you can provably ignore spins far away from the point you’re studying (or rather, replace them by an “ordered” field that modifies the local dynamics in a fully controllable way). This idea of of local interactions being “screened” from far-away details is ubiquitous. In a very large and very robust class of systems, interactions are purely local, except for mediation by a small number of hierarchical “smooth” couplings that see only high-level summary statistics of the “non-local” spins and treat them as a background—and moreover, these “locality” properties are provable (insofar as we assume the extra “axioms” of thermodynamics), assuming some (once again, hierarchical and robustly adjustable) assumptions of independence. There are a number of related principles here that (if I understand correctly) get used in similar contexts, sometimes interchangeably: one I liked is “local perturbations perturb locally” (“LPPL”) from this paper.

Note that in the above paragraph I did something I generally disapprove of: I am trying to extract and verbalize “vibes” from science that I don’t understand on a concrete level, and I am almost certainly getting a bunch of things wrong. But I don’t know of another way of gesturing in a “look, there’s something here and it’s worth looking into” way without doing this to some extent.

Now AI systems, just like semiconductors, are statistical systems with a lot of disorder. In particular in a standard operationalization (as e.g. in PDLT), we can conceptualize of neural nets as a field theory. There is a “vacuum theory” that depends only on the architecture, and then adding new datapoints corresponds to adding particles. PDLT only studies a certain perturbative picture here, but it seems plausible that an extension of these techniques may extend to non-perturbative scales (and hope for this is a big part of the reason that Lauren and I have been thinking and writing about renormalization). In a “dream” version of such an extension, the datapoints would form a kind of disordered system, with both ordered components, hierarchical relationships, and some assumption of inherent randomness outside of the relationships. A great aspect of “numerical” QFT, such as gets applied in condensed matter models, is that you don’t need a really great model of the hierarchical relationships: sometimes you can just play around and turn on a handful of extra parameters until you find something that works. (Again, at the moment this is an imprecise interpretation of things I have not deeply engaged with.)

Of course doing this makes some assumptions—but the assumptions are on the level of the data (i.e. particles), not the weights/ model internals (i.e., fields—the place where we are worried about misalignment, etc.). And if you assume these assumptions and write down a “localization theorem” result, then plausibly the kind of statement you will get is something along the lines of the following:

“the way this LLM is completing this sentence is a combination of a sophisticated collection of hierarchical relationships, but I know that the behavior here is equivalent to behaviors on other similar sentences up to small (provably) low-complexity perturbations”.

More generally, the kinds of information this kind of picture would give is a kind of “local provably robust interpretability”—where the text completion behavior of a model is provably (under suitable “disordered system” assumptions) reducible to a collection of several local circuits that depend on understandable phenomena at a few different scales. A guiding “complexity intuition” for me here is provided by the notrivial but tractable grammar task diagrams in the paper Marks et al. (See pages 25-27, and note the shape of these diagrams is more or less straightup typical of the shape of a nonrenormalized interaction diagram you see before you start applying renormalization to simplify a statistical system).

An important caveat here is that in physical models of this type (and in pictures that include renormalization more generally), one does not make—or assume—any “fundamentality” assumptions. In many cases a number of alternative (but equivalent, once the “screening” is factored in) pictures exist, with various levels of granularity, elegance, etc. (this already can be seen in the 2D Ising model—a simple magnet model—where the same behaviors can either be understood in a combinatorial “spin-to-spin interaction” way, which mirrors the “fundamental interpretability” desires of mechinterp, and through this “recursive screening out” model that is more renormalization-flavored; the results are the same (to a very high level of precision), even when looking at very localized effects involving collections of a few spins. So the question of whether an interpretation is “fundamental” or uses the “right latents” is to a large extent obviated here; the world of thermodynamics is much more anarchical and democratic than the world of mathematical formalism and “elegant proof”, at least in this context.

Having handwavily described a putative model, I want to quickly say that I don’t actually believe in this model. There are a bunch of things I probably got wrong, there are a bunch of other, better tools to use, and so on. But the point is not the model: it’s that this kind of stuff exists. There exist languages that show that arbitrarily complex, arbitrarily expressive behaviors are provably reducible to local interactions, where behaviors can be understood as clusters of hierarchical interactions that treat all but a few parts of the system at every point as “screened out noise”.

I think that if models like this are possible, then a solution to “the interpretability component to safety” is possible in this framework. If you have provably localized behaviors then for example you have a good idea where to look for deception: e.g., deception cannot occur on the level “very low-level” local interactions, as they are too simple to express the necessary reasoning, and perhaps it can be carefully operationalized and tracked in the higher-level interactions.

As you’ve no doubt noticed, this whole picture is splotchy and vague. It may be completely wrong. But there also may be something in this direction that works. I’m hoping to think more about this, and very interested in hearing people’s criticisms and thoughts.
What links here?
- Noosphere89 22 Jan 2025 18:18 UTC
  2 points
  0
  Parent
  I think the axiom that they’d use to prove something like this mathematically probably depends on assuming scale separation, such that you can discover laws, that while not fully accurate, are much better than random chance and cheap to compute, which means you can get more compute to learn new laws, until you hit the limiting point of a Theory of Everything:
  
  https://www.lesswrong.com/posts/HcjL8ydHxPezj6wrt/book-review-the-structure-of-scientific-revolutions#qyPgFjpqNDaZ8cneh

Dmitry Vaintrob’s Shortform

On the friendship fallacy and Owen Barfield

Why I’m in AI sequence: 2020 Journal entry about gpt3

journal entry