Alignment is not all you need. But that doesn’t mean you don’t need alignment.
One of the fairytales I remember reading from my childhood is the “Three sillies”. The story is about a farmer encountering three episodes of human silliness, but it’s set in one more frame story of silliness: his wife is despondent because there is an axe hanging in their cottage, and she thinks that if they have a son, he will walk underneath the axe and it will fall on his head.
The frame story was much more memorable to me than any of the “body” stories, and I randomly remember this story much more often than any other fairytale I read at the age I read fairytales. I think the reason for this is that the “hanging axe” worry is a vibe very familiar from my family and friend circle, and more generally a particular kind of intellectual neuroticism that I encounter all the time, that is terrified of incomplete control or understanding.
I really like the rationalist/EA ecosphere because of its emphasis on the solvability of problems like this: noticing situations where you can just approach the problem, taking down the axe. However, a baseline of intellectual neuroticism persists (after all you wouldn’t expect otherwise from a group of people who pull smoke alarms on pandemics and existential threats that others don’t notice). Sometimes it’s harmless or even beneficial. But a kind of neuroticism in the community that bothers me, and seems counterproductive, is a certain “do it perfectly or you’re screwed” perfectionism that pervades a lot of discussions. (This is also familiar to me from my time as a mathematician: I’ve had discussions with very intelligent and pragmatic friends who rejected even the most basic experimentally confirmed facts of physics because “they aren’t rigorously proven”.)
A particular train of discussion that annoyed me in this vein was the series of responses to Raemon’s “preplanning and baba is you” post. The initial post I think makes a nice point—it suggests as an experiment trying to solve levels of a move-based logic game by pre-planning every step in advance, and points out that this is hard. Various people tried this experiment and found that it’s hard. This was presented as an issue in solving alignment, in worlds where “we get one shot”. But what annoyed me was the takeaway.
I think a lot of the great things about the intellectual vibe in the (extended) LW and EA communities is that “you have more ways to solve problems than you think”. However, there is a particular kind of virtue-signally class of problems where trying to find shortcuts or alternatives is frowned upon and the only accepted form of approach is “trying harder” (another generalized intellectual current in the LW-osphere that I strongly dislike).
Back to the “Baba is you” experiment. The best takeaway, I think, is that we should avoid situations where we need to solve complex problems in one shot, and we should work towards making sure this situation doesn’t exist (and we should just give up on trying to make progress in worlds where we get absolutely no new insights before the do-or-die step of making AGI). Doing so, at least without superhuman assistance, is basically impossible. Attempts at this tend to be not only silly but counterproductive: the “graveyard” of failed idealistic movements are chock-full of wannabe Hari Seldons who believe that they have found the “perfect solution”, and are willing to sacrifice everything to realize their grand vision.
This doesn’t mean we need to give up, or only work on unambitious, practical applications. But it does mean that we have to admit that things can be useful to work on in expectation before we have a “complete story for how they save the world”.
Note that what is being advocated here is not an “anything goes” mentality. I certainly think that AI safety research can be too abstract, too removed from any realistic application in any world. But there is a large spectrum of possibilities between “fully plan how you will solve a complex logic game before trying anything” and “make random jerky moves because they ‘feel right’”.
I’m writing this in response to Adam Jones’ article on AI safety content.. I like a lot of the suggestions. But I think the section on alignment plans suffers from the “axe” fallacy that I claim is somewhat endemic here. Here’s the relevant quote:
For the last few weeks, I’ve been working on trying to find plans for AI safety. They should cover the whole problem, including the major hurdles after intent alignment. Unfortunately, this has not gone well—my rough conclusion is that there aren’t any very clear and well publicised plans (or even very plausible stories) for making this go well. (More context on some of this work can be found in BlueDot Impact’s AI safety strategist job posting).
(emphasis mine).
I strongly disagree with this being a good thing to do!
We’re not going to have a good, end-to-end plan about how to save the world from AGI. Even now, with ever more impressive and scary AIs becoming a comonplace, we have very little idea about what AGI will look like, what kinds of misalignment it will have, where the hard bits of checking it for intent and value alignment will be. Trying to make extensive end-to-end plans can be useful, but can also lead to a strong streetlight effect: we’ll be overcommitting to current understanding, current frames of thought (in an alignment community that is growing and integrating new ideas with an exponential rate that can be factored in months, not years).
Don’t get me wrong. I think it’s valuable to try to plan things where our current understanding is likely to at least partially persist: how AI will interface with government, general questions of scaling and rough models of future development. But we should also understand that our map has lots of blanks, especially when we get down to thinking about what we will understand in the future. What kinds of worrying behaviors will turn out to be relevant and which ones will be silly in retrospect? What kinds of guarantees and theoretical foundations will our understanding of AI encompass? We really don’t know, and trying to chart a course through only the parts of the map that are currently filled out is an extremely limited way of looking at things.
So instead of trying to solve the alignment problem end to end what I think we should be doing is:
getting a variety of good, rough frames on how the future of AI might go
thinking about how these will integrate with human systems like government, industry, etc.
understanding more things, to build better models in the future.
I think the last point is crucial, and should be what modern alignment and interpretability is focused on. We really do understand a lot more about AI than we did a few years ago (I’m planning a post on this). And we’ll understand more still. But we don’t know what this understanding will be. We don’t know how it will integrate with existing and emergent actors and incentives. So instead of trying to one-shot the game and write an ab initio plan for how work on quantifying creativity in generative vision models will lead to the world being saved, I think there is a lot of room to just do good research. Fill in the blank patches on that map before routing a definitive course on it. Sure, maybe don’t waste time on the patches in the far corners which are too abstract or speculative or involve too much backchaining. But also don’t try to predict all the axes that will be on the wall in the future before looking more carefully at a specific, potentially interesting, axe.
FYI I think by the time I wrote Optimistic Assumptions, Longterm Planning, and “Cope”, I think I had updated on the things you criticize about it here (but, I had started writing it awhile ago from a different frame and there is something disjointed about it)
But, like, I did mean both halfs of this seriously:
I think you should be scared about this, if you’re the sort of theoretic researcher, who’s trying to cut at the hardest parts of the alignment problem (whose feedback loops are weak or nonexistent)
I think you should be scared about this, if you’re the sort of Prosaic ML researcher who does have a bunch of tempting feedback loops for current generation ML, but a) it’s really not clear whether or how those apply to aligning superintelligent agents, b) many of those feedback loops also basically translate into enhancing AI capabilities and moving us toward a more dangerous world.
...
Re:
For the last few weeks, I’ve been working on trying to find plans for AI safety. They should cover the whole problem, including the major hurdles after intent alignment.
I strongly disagree with this being a good thing to do! We’re not going to have a good, end-to-end plan about how to save the world from AGI.
I think in some sense I agree with you – the actual real plans won’t be end-to-end. And I think I agree with you about some kind of neuroticism that unhelpfully bleeds through a lot of rationalist work. (Maybe in particular: actual real solutions to things tend to be a lot messier than the beautiful code/math/coordination-frameworks an autistic idealist dreams up)
But, there’s still something like “plans are worthless, but planning is essential.” I think you should aim for the standard of “you have a clear story for how your plan fits into something that solves the hard parts of the problem.” (or, we need way more people doing that sort of thing, since most people aren’t really doing it at all)
Some ways that I think about End to End Planning (and, metastrategy more generally)
Because there are multiple failure modes, I treat myself as having multiple constraints I have to satisfy:
My plans should backchain from solving the key problems I think we ultimately need to solve
My plans should forward chain through tractable near goals with at least okay-ish feedback loops. (If the okay-ish feedback loops don’t exist yet, try to be inventing them. Although don’t follow that off a cliff either – I was intigued by Wentworth’s recent note that overly focusing on feedback loops led him to predictably waste some time)
Ship something to external people, fairly regularly
Be Wholesome (that is to say, when I look at the whole of what I’m doing it feels healthy, not like I’ve accidentally min-maxed my way into some brittle extreme corner of optimization space)
And for end to end planning, have a plan for...
up through the end of my current OODA loop
(maybe up through a second OODA loop if I have a strong guess for how the first OODA loop goes)
as concrete a plan as I can, assuming no major updates from the first OODA loop, up through the end of the agenda.
as concrete a visualization of the followup steps after my plan ends, for how it goes on to positively impact the world.
End to End plans don’t mean you don’t need to find better feedbackloops or pivot. You should plan that into the plan (And also expect to be surprised about it anyway). But, I think if you don’t concretely visualize how it fits together you’re like to go down some predictably wasteful paths.
I think the reason for this is that the “hanging axe” worry is a vibe very familiar from my family and friend circle, and more generally a particular kind of intellectual neuroticism that I encounter all the time, that is terrified of incomplete control or understanding.
Something like this is a big reason why I’m not a fan of MIRI, because I think this sort of neuroticism is at somewhat encouraged by that group.
Also, remember that the current LW community is selected for scrupulosity and neuroticism, which IMO is not that good for solving a lot of problems:
We have in fact reverse engineered alien datastructures.
I’m trying to keep up a regular schedule of writing for my Nanowrimo project. I’m working on a post that’s not done yet, so for today I’ll write a quick (and not very high-effort) take related to discussions I’ve had with more doomy friends (I’m looking at you Kaarel Hänni). A good (though incomplete) crux on why something like “faithful interpretability” may be hard (faithful here being “good enough to genuinely decompose enough of the internal thought process to notice and avoid deception”) is contained in Tsvi’s post on “alien datastructures”. I think it’s a great piece, though I largely probably agree with it only about 50%. A very reductive explanation of the core parable in the piece is that we imagine we’ve found an ancient alien computer from a civilization with a totally separate parallel development of math and logic and computing. The protagonist of the koan lectures on why it might be hard to “do science” to reverse engineer what the computer does, and it might be better to “do deep thinking” to try to understand what problems the computer is trying to accomplish, and then try to come up with the right abstractions. In the case of AI, the recommendation is that this should in particular be done by introspecting and trying to understand our own thinking at a high level. There’s a great quote that I’ll include because it sounds nice:
Go off and think well——morally, effectively, funly, cooperatively, creatively, agentically, truth-trackingly, understandingly——and observe this thinking——and investigate/modify/design this thinking——and derive principles of mind that explain the core workhorses of the impressive things we do including self-reprogramming, and that explain what determines our values and how we continue caring across ontology shifts, and that continue to apply across mental change and across the human-AGI gap; where those principles of mind are made of ideas that are revealed by the counterfactual structure of possible ways of thinking revealed by our interventions on our thinking, like how car parts make more sense after you take them out and replace them with other analogous parts.
I don’t want to debate the claims of the piece, since after all it’s a koan and is meant to be a parable and a vibe. But I think that as a parallel parable, it’s useful to know that humanity has actually reverse engineered a massively alien computer, and we did it extremely well, and via a combination of “science” (i.e., iteratively designing and testing and refining models) and high-level thinking, though with more of an emphasis on science.
The alien computer in this analogy is physics. It’s hard to emphasize how different physics is from what people in the past thought it would be. All our past ideas of reality have been challenged: waves aren’t really waves, particles aren’t really particles, time isn’t really time, space isn’t really (just) space. I think that a lot of people who have learned about math or physics, but don’t have experience doing research in it, have a romantic idea that it’s driven by “deep insight”. That physics sits around waiting for the next Newton or Einstein to get a new idea, and then adapts to that idea with the experimentalists there to confirm the great theoreticist’s theories, and the “non-great” rank and file there to compute out their consequences. And like sure, this kind of flow from theorists out is one thing that exists, but it’s mostly wrong. Mostly the way we understand the wild, alien structure that is physics is through just, well, doing science, which involves people tweaking old paradigms, gathering data, finding things that don’t make sense, and then folding the new information back into the big structure. The key point here (that physicists understand viscerally) is that there’s not a big, concept-centric order dependence to discovery. You can understand quantum mechanics first, you can understand relativity first, you can start with waves or particles; in the end, if you are serious about refining and testing your intuitions, and you do calculations to get stuff that increasingly makes more sense, you are on the right track.
A fun example here comes from the discovery of Heisenberg’s “matrix” model of quantum mechanics. One could imagine a romantic Eureka-moment picture of people suddenly grasping the indeterminacy of reality after thinking deeply about experiments that don’t make sense. But the reality was that after Max Planck’s first idea of quanta of energy levels in 1900 through 1925, people used measurements and arbitrary combinatorics that seemed to mostly agree with experiment to cobble together a weird theory (that the next generation called “Old quantum theory”) of allowed and disallowed orbits that everyone admitted didn’t make sense and fell apart on deep consideration, but sorta worked. After a significant amount of progress was made in this way, Heisenberg and Sommerfeld took a two-parameter table of numbers that went into this theory, made a matrix out of it, and then (in my somewhat cartoonish understanding) noticed that this matrix could be used to measure some other phenomena better. Then Heisenberg realized that just by looking at the matrix, making it complex and viewing it as an evolution operator, you could get a unifying and more satisfying explanation of many of the disparate phenomena in this old theory; it would then take more time for people to develop the more modern concepts of measurement, decoherence, and many worlds (the last of which still meets resistance among the old guard today).
The point is that physics doesn’t (only) grow through great insight. This might seem surprising to someone who grew up in a certain purist philosophy of logic that, I think incorrectly, has been cultivated on lesswrong: roughly, that ideas are either right, or useless. In fact, the experience of a physicist is that sloppy stopgap ideas that explain experiment or theory in a slightly more systematic way are often useful, though for reasons that will often elude the sloppy idea-haver (physics has a perennial problem of the developers of sloppy ideas getting overinvested in their “exact correctness”, and being resistant to future more sensible systematization—but this is a different story). In physics ideas are deep, but they’re weirdly un-path dependent. There are many examples of people arriving at the same place by different sloppy and unsystematic routes (the Feynman path integral vs. other interpretations of quantum field theory being a great example). This uncanny ability to make progress by taking stopgap measures and then slowly refining them surprises physicists themselves, and I think that some very cool physics “ideology” comes from trying to make sense of this. Two meta-explanations that I think were in part a response to physicists trying to figure out “what the hell is going on”, but led to physics that is used today (especially in the first case) are:
effective field theory and the renormalization group flow, coming from ideas of Wilson about how theories at different energy levels can be effectively replaced by simpler theories at other layers (and relating to ideas of Landau and others of complicated concrete theories self-correcting to converge to certain elegant but more abstract “universal” theories at larger scale)
the “bootstrap” idea that complicated theories with unknown moving parts have measurements that satisfy certain standard formulas and inequalities, and one can often get quite far (indeed sometimes, all the way) by ignoring the “physics” and only looking at the formulas as a formal system.
Both of these physics ideas identify a certain shape of “emergent structure”, where getting at something with even a very small amount of “orientational push” from reality, or from known sloppy structure, will tend to lead to the same “correct” theory at scales we care about.
This doesn’t mean that big ideas and deep thinking are useless. But I think that this does point (at least in physics) to taking more seriously ideas that “aren’t fully self-consistent yet” from a more demanding mathematical standpoint. We can play random fun measurement and calculation and systematization games, and tinker (“morally, effectively, funly, cooperatively, creatively, agentically, truth-trackingly, understandingly”) with systematizations of the results of these games, and we’re likely to get there in the end.
At least if the alien computer is physics.
For later. In my posts I’m hoping to talk in more detail about renormalization and the self-correcting nature of universality, and also about why I think that the complexity of (certain areas of) physics is a good match for the complexity of neural nets, and why I think the modern field of interpretability and “theoretical ML” more generally is much further along this game of tinkering and systematizing than many people think (in particular, in some ways we’re beyond the “Old Quantum Theory”-style mess). But this is it for now.
I’d argue that the easiness of physics probably comes from the fact that we can get effectively unlimited data, combined with the ability to query our reality as an oracle to test certain ideas and importantly get easy verification of a theory, which helps in 2 ways:
The prior matters very little, because you can update to the right theory from almost all but the most dogmatic priors.
Verifying being easy shortcuts a lot of philosophical debates, and makes it easy to update towards correct theories.
However, I think the main insight that sloppy but directionally correct ideas being useful to build upon, combined with partial progress being important is a very important idea that has applicability beyond physics.
This makes sense, but I’d argue that ML and interpretability has even more of both of these properties. Something that makes it harder is that some of the high-level goals of understanding transformers are inherently pretty complex, and also it’s less susceptible to math/ elegance-based analysis, so is even more messy :)
I think what explains the relative ease of progress in physics has more so to do with its relative compositionality in contrast to other disciplines like biology or economics or the theory of differential equations, in the sense Jules Hedges meant it. To quote that essay:
For examples of non-compositional systems, we look to nature. Generally speaking, the reductionist methodology of science has difficulty with biology, where an understanding of one scale often does not translate to an understanding on a larger scale. … For example, the behaviour of neurons is well-understood, but groups of neurons are not. Similarly in genetics, individual genes can interact in complex ways that block understanding of genomes at a larger scale.
Such behaviour is not confined to biology, though. It is also present in economics: two well-understood markets can interact in complex and unexpected ways. Consider a simple but already important example from game theory. The behaviour of an individual player is fully understood: they choose in a way that maximises their utility. Put two such players together, however, and there are already problems with equilibrium selection, where the actual physical behaviour of the system is very hard to predict.
More generally, I claim that the opposite of compositionality is emergent effects. The common definition of emergence is a system being ‘more than the sum of its parts’, and so it is easy to see that such a system cannot be understood only in terms of its parts, i.e. it is not compositional. Moreover I claim that non-compositionality is a barrier to scientific understanding, because it breaks the reductionist methodology of always dividing a system into smaller components and translating explanations into lower levels.
More specifically, I claim that compositionality is strictly necessary for working at scale. In a non-compositional setting, a technique for a solving a problem may be of no use whatsoever for solving the problem one order of magnitude larger. To demonstrate that this worst case scenario can actually happen, consider the theory of differential equations: a technique that is known to be effective for some class of equations will usually be of no use for equations removed from that class by even a small modification. In some sense, differential equations is the ultimate non-compositional theory.
Minor nit: The alien computer is a specific set of physical laws, which shouldn’t be confused with the general case of physics/mathematics, so we only managed to reverse engineer it for 1 universe.
Cute :). Do you mean that we’ve only engineered the alien computer running a single program (the standard model with our universe’s particular coupling constants), or something else?
Do you mean that we’ve only engineered the alien computer running a single program (the standard model with our universe’s particular coupling constants)
but more importantly I was focused on the fact that all plausible future efforts will only reverse engineered the alien computer that runs a single program, which is essentially the analogue of the laws of physics for our physical universe.
On the surprising effectiveness of linear regression as a toy model of generalization.
Another shortform today (since Sunday is the day of rest). This time it’s really a hot take: I’m not confident about the model described here being correct.
Neural networks aren’t linear—that’s the whole point. They notice interesting, compositional, deep information about reality. So when people use linear regression as a qualitative comparison point for behaviors like generalization and learning, I tend to get suspicious. Nevertheless, the track record of linear regression as a model for “qualitative” asymptotic behaviors is hard to deny. Linear regression models (neatly analyzable using random matrix theory) give surprisingly accurate models of double descent, scaling phenomena, etc. (at least when comparing to relatively shallow networks like mnist or modular addition).
I recently developed a cartoon internal model for why this may be the case. I’m not sure if it’s correct, but I’ll share it here.
The model assumes a few facts about algorithms implemented by NN’s (all of which I believe in much more strongly than the model of comparing them to linear regression):
The generalized Linear Representation Hypothesis. An NN’s internal working can be locally factorized into a large collection of low-level features in distinct low-dimensional linear subspaces, and then applying (generally nonlinear) postprocessing to these features independently or in small batches. Note that this is much weaker than a stronger version (such as the one inherent in SAE) that posits 1-dimensional features. In my experience a version of this hypothesis is almost universally believed by engineers, and also agrees with all known toy algorithms discovered so far.
Smoothness-ish of the data manifold. Inside the low-dimensional “feature subspace”, the data is kinda smooth—i.e., it’s smooth (i.e., locally approximately linear) in most directions, and in directions where it’s not smooth, it still might behave sorta smoothly in aggregate.
Linearity-ish of the classification signal. Even in cases like mnist or transformer learning where the training data is discrete (and the algorithm is mean to approximate it by a continuous function), there is a sense in which it’s locally well-approximated by a linear function. E.g. perhaps some coarse-graining of the discrete data is continuously linear, or at least the data boundary can be locally well approximated by a linear hyperplane (so that a local linear function can attain 100% accuracy). More generally, we can assume a similar local linearity property on the layer-to-layer forward functions, when restricted to either a single feature space or a small group of interacting feature spaces.
(At least partial) locality of the effect of weight modification. When I read it this paper left a lasting impression on me. I’m actually not super excited about its main claims (I’ll discuss polytopes later), but a very cool piece of the analysis here is locally modelling ReLU learning as building a convex function as a max of linear functions (and explaining why non-ReLU learning should exhibit a softer version of the same behavior). This is a somewhat “shallow” point of view on learning, but probably captures a nontrivial part of what’s going on, and this predicts that every new weight update only has local effect—i.e., is felt in a significant way only by a small number of datapoints (the idea being that if you’re defining a convex function as the max of a bunch of linear functions, shifting one of the linear functions will only change the values in places where this particular linear function was dominant). The way I think about this phenomenon is that it’s a good model for “local learning”, i.e., learning closer to memorization on the memorization-generalization spectrum that only updates the behavior on a small cluster of similar datapoints (e.g. the LLM circuit that completes “Barack” with “Obama”). There are also possibly also more diffuse phenomena (like “understanding logic”, or other forms of grokking “overarching structure”), but most likely both forms of learning occur (and it’s more like a spectrum than a dichotomy).
If we buy that these four phenomena occur, or even “occur sometimes” in a way relevant for learning, then it naturally follows that a part of the “shape” of learning and generalization is well described qualitatively by linear regression. Indeed, the model then becomes that (by point 4 above), many weight updates exclusively “focus on a single local batch of input points” in some low-dimensional feature manifold. For this particular weight update, locality of the update and smoothness of the data manifold (#2) together imply that we can model it as learning a function on a linear low-dimensional space (since smooth manifolds are locally well-approximated by a linear space). Finally, local linearity of the classification function (#3) implies that we’re learning a locally linear function on this local batch of datapoints. Thus we see that, under this collection of assumptions, the local learning subproblems essentially boil down to linear regression.
Note that the “low-dimensional feature space” assumption, #1, is necessary for any of this to even make sense. Without making this assumption, the whole picture is a non-starter and the other assumptions, #2-#4 don’t make sense, since a sub-exponentially large collection of points on a high-dimensional data manifold with any degree of randomness (something that is true about the data samples in any nontrivial learning problem) will be very far away from each other and the notion of “locality” becomes meaningless. (Note also that a weaker hypothesis than #1 would suffice—in particular, it’s enough that there are low-dimensional “feature mappings” where some clustering occurs at some layer, and these don’t a priori have to be linear.)
What is this model predicting? Generally I think abstract models like this aren’t very interesting until they make a falsifiable prediction or at least lead to some qualitative update on the behavior of NN’s. I haven’t thought about this very much, and would be excited if others have better ideas or can think of reasons why this model is incorrect. But one thing this model likely predicts is that a better model for a NN than a single linear regression model is a collection of qualitatively different linear regression models at different levels of granularity. In other words, depending on how sloppily you chop your data manifold up into feature subspaces, and how strongly you use the “locality” magnifying glass on each subspace, you’ll get a collection of different linear regression behaviors; you then predict that at every level of granularity, you will observe some combination of linear and nonlinear learning behaviors.
This point of view makes me excited about work by Ari Brill (that, as far as I know is unpublished—I heard a talk on it at the ILIAD conference—see the Saturday schedule, first talk in Bayes). If I understood the talk correctly, he models a data manifold as a certain stochastic fractal in a low-dimensional space and makes scaling predictions about generalization behavior depending on properties of the fractal, by thinking of the fractal as a hierarchy of smooth but noisy features. Finding similarly-flavored behavior scaling behavior on “linear regression subphenomena” in a real-life machine learning problem would positively update me on my model above being correct.
But one thing this model likely predicts is that a better model for a NN than a single linear regression model is a collection of qualitatively different linear regression models at different levels of granularity. In other words, depending on how sloppily you chop your data manifold up into feature subspaces, and how strongly you use the “locality” magnifying glass on each subspace, you’ll get a collection of different linear regression behaviors; you then predict that at every level of granularity, you will observe some combination of linear and nonlinear learning behaviors.
Epic.
A couple things that come to mind.
Linear features = sufficients statistics of exponential families ?
simplest case is case of Gaussians and covariance matrix (which comes down to linear regression)
exponential families are a fairly good class but not closed under hierarchichal structure. Basic example is a mixture of Gaussians is not exponential, i.e. not described in terms of just linear regression.
The centrality of ReLU neural networks.
Understanding ReLU neural networks is probably 80-90% of understanding NN- architectures. At sufficient scale pure MLP have the same or better scaling laws than transformers.
There is several lines of evidence gradient descent has an inherent bias towards splines/piecewise linear functions/tropical polynomials. see e.g. here and references therein.
Serious analysis of ReLU neural network can be done through tropical methods. A key paper is here. You say: “very cool piece of the analysis here is locally modelling ReLU learning as building a convex function as a max of linear functions (and explaining why non-ReLU learning should exhibit a softer version of the same behavior). This is a somewhat “shallow” point of view on learning, but probably captures a nontrivial part of what’s going on, and this predicts that every new weight update only has local effect—i.e., is felt in a significant way only by a small number of datapoints (the idea being that if you’re defining a convex function as the max of a bunch of linear functions, shifting one of the linear functions will only change the values in places where this particular linear function was dominant). The way I think about this phenomenon is that it’s a good model for “local learning”, i.e., learning closer to memorization on the memorization-generalization spectrum that only updates the behavior on a small cluster of similar datapoints (e.g. the LLM circuit that completes “Barack” with “Obama”). “ I suspect the notion one should be looking at are the Activation polytope and activation fan in section 5 of the paper. The hypothesis would be something about efficiently learnable features having a ‘locality’ constraint on these activation polytopes, ie. they are ‘small’, ‘active on only a few data points’..
Alignment is not all you need. But that doesn’t mean you don’t need alignment.
One of the fairytales I remember reading from my childhood is the “Three sillies”. The story is about a farmer encountering three episodes of human silliness, but it’s set in one more frame story of silliness: his wife is despondent because there is an axe hanging in their cottage, and she thinks that if they have a son, he will walk underneath the axe and it will fall on his head.
The frame story was much more memorable to me than any of the “body” stories, and I randomly remember this story much more often than any other fairytale I read at the age I read fairytales. I think the reason for this is that the “hanging axe” worry is a vibe very familiar from my family and friend circle, and more generally a particular kind of intellectual neuroticism that I encounter all the time, that is terrified of incomplete control or understanding.
I really like the rationalist/EA ecosphere because of its emphasis on the solvability of problems like this: noticing situations where you can just approach the problem, taking down the axe. However, a baseline of intellectual neuroticism persists (after all you wouldn’t expect otherwise from a group of people who pull smoke alarms on pandemics and existential threats that others don’t notice). Sometimes it’s harmless or even beneficial. But a kind of neuroticism in the community that bothers me, and seems counterproductive, is a certain “do it perfectly or you’re screwed” perfectionism that pervades a lot of discussions. (This is also familiar to me from my time as a mathematician: I’ve had discussions with very intelligent and pragmatic friends who rejected even the most basic experimentally confirmed facts of physics because “they aren’t rigorously proven”.)
A particular train of discussion that annoyed me in this vein was the series of responses to Raemon’s “preplanning and baba is you” post. The initial post I think makes a nice point—it suggests as an experiment trying to solve levels of a move-based logic game by pre-planning every step in advance, and points out that this is hard. Various people tried this experiment and found that it’s hard. This was presented as an issue in solving alignment, in worlds where “we get one shot”. But what annoyed me was the takeaway.
I think a lot of the great things about the intellectual vibe in the (extended) LW and EA communities is that “you have more ways to solve problems than you think”. However, there is a particular kind of virtue-signally class of problems where trying to find shortcuts or alternatives is frowned upon and the only accepted form of approach is “trying harder” (another generalized intellectual current in the LW-osphere that I strongly dislike).
Back to the “Baba is you” experiment. The best takeaway, I think, is that we should avoid situations where we need to solve complex problems in one shot, and we should work towards making sure this situation doesn’t exist (and we should just give up on trying to make progress in worlds where we get absolutely no new insights before the do-or-die step of making AGI). Doing so, at least without superhuman assistance, is basically impossible. Attempts at this tend to be not only silly but counterproductive: the “graveyard” of failed idealistic movements are chock-full of wannabe Hari Seldons who believe that they have found the “perfect solution”, and are willing to sacrifice everything to realize their grand vision.
This doesn’t mean we need to give up, or only work on unambitious, practical applications. But it does mean that we have to admit that things can be useful to work on in expectation before we have a “complete story for how they save the world”.
Note that what is being advocated here is not an “anything goes” mentality. I certainly think that AI safety research can be too abstract, too removed from any realistic application in any world. But there is a large spectrum of possibilities between “fully plan how you will solve a complex logic game before trying anything” and “make random jerky moves because they ‘feel right’”.
I’m writing this in response to Adam Jones’ article on AI safety content.. I like a lot of the suggestions. But I think the section on alignment plans suffers from the “axe” fallacy that I claim is somewhat endemic here. Here’s the relevant quote:
I strongly disagree with this being a good thing to do!
We’re not going to have a good, end-to-end plan about how to save the world from AGI. Even now, with ever more impressive and scary AIs becoming a comonplace, we have very little idea about what AGI will look like, what kinds of misalignment it will have, where the hard bits of checking it for intent and value alignment will be. Trying to make extensive end-to-end plans can be useful, but can also lead to a strong streetlight effect: we’ll be overcommitting to current understanding, current frames of thought (in an alignment community that is growing and integrating new ideas with an exponential rate that can be factored in months, not years).
Don’t get me wrong. I think it’s valuable to try to plan things where our current understanding is likely to at least partially persist: how AI will interface with government, general questions of scaling and rough models of future development. But we should also understand that our map has lots of blanks, especially when we get down to thinking about what we will understand in the future. What kinds of worrying behaviors will turn out to be relevant and which ones will be silly in retrospect? What kinds of guarantees and theoretical foundations will our understanding of AI encompass? We really don’t know, and trying to chart a course through only the parts of the map that are currently filled out is an extremely limited way of looking at things.
So instead of trying to solve the alignment problem end to end what I think we should be doing is:
getting a variety of good, rough frames on how the future of AI might go
thinking about how these will integrate with human systems like government, industry, etc.
understanding more things, to build better models in the future.
I think the last point is crucial, and should be what modern alignment and interpretability is focused on. We really do understand a lot more about AI than we did a few years ago (I’m planning a post on this). And we’ll understand more still. But we don’t know what this understanding will be. We don’t know how it will integrate with existing and emergent actors and incentives. So instead of trying to one-shot the game and write an ab initio plan for how work on quantifying creativity in generative vision models will lead to the world being saved, I think there is a lot of room to just do good research. Fill in the blank patches on that map before routing a definitive course on it. Sure, maybe don’t waste time on the patches in the far corners which are too abstract or speculative or involve too much backchaining. But also don’t try to predict all the axes that will be on the wall in the future before looking more carefully at a specific, potentially interesting, axe.
FYI I think by the time I wrote Optimistic Assumptions, Longterm Planning, and “Cope”, I think I had updated on the things you criticize about it here (but, I had started writing it awhile ago from a different frame and there is something disjointed about it)
But, like, I did mean both halfs of this seriously:
...
Re:
I think in some sense I agree with you – the actual real plans won’t be end-to-end. And I think I agree with you about some kind of neuroticism that unhelpfully bleeds through a lot of rationalist work. (Maybe in particular: actual real solutions to things tend to be a lot messier than the beautiful code/math/coordination-frameworks an autistic idealist dreams up)
But, there’s still something like “plans are worthless, but planning is essential.” I think you should aim for the standard of “you have a clear story for how your plan fits into something that solves the hard parts of the problem.” (or, we need way more people doing that sort of thing, since most people aren’t really doing it at all)
Some ways that I think about End to End Planning (and, metastrategy more generally)
Because there are multiple failure modes, I treat myself as having multiple constraints I have to satisfy:
My plans should backchain from solving the key problems I think we ultimately need to solve
My plans should forward chain through tractable near goals with at least okay-ish feedback loops. (If the okay-ish feedback loops don’t exist yet, try to be inventing them. Although don’t follow that off a cliff either – I was intigued by Wentworth’s recent note that overly focusing on feedback loops led him to predictably waste some time)
Ship something to external people, fairly regularly
Be Wholesome (that is to say, when I look at the whole of what I’m doing it feels healthy, not like I’ve accidentally min-maxed my way into some brittle extreme corner of optimization space)
And for end to end planning, have a plan for...
up through the end of my current OODA loop
(maybe up through a second OODA loop if I have a strong guess for how the first OODA loop goes)
as concrete a plan as I can, assuming no major updates from the first OODA loop, up through the end of the agenda.
as concrete a visualization of the followup steps after my plan ends, for how it goes on to positively impact the world.
End to End plans don’t mean you don’t need to find better feedbackloops or pivot. You should plan that into the plan (And also expect to be surprised about it anyway). But, I think if you don’t concretely visualize how it fits together you’re like to go down some predictably wasteful paths.
Something like this is a big reason why I’m not a fan of MIRI, because I think this sort of neuroticism is at somewhat encouraged by that group.
Also, remember that the current LW community is selected for scrupulosity and neuroticism, which IMO is not that good for solving a lot of problems:
Richard Ngo and John Maxwell illustrate it here:
https://www.lesswrong.com/posts/uM6mENiJi2pNPpdnC/takeaways-from-one-year-of-lockdown#g4BJEqLdvzgsjngX2
https://www.lesswrong.com/posts/uM6mENiJi2pNPpdnC/takeaways-from-one-year-of-lockdown#3GqvMFTNdCqgfRNKZ
We have in fact reverse engineered alien datastructures.
I’m trying to keep up a regular schedule of writing for my Nanowrimo project. I’m working on a post that’s not done yet, so for today I’ll write a quick (and not very high-effort) take related to discussions I’ve had with more doomy friends (I’m looking at you Kaarel Hänni). A good (though incomplete) crux on why something like “faithful interpretability” may be hard (faithful here being “good enough to genuinely decompose enough of the internal thought process to notice and avoid deception”) is contained in Tsvi’s post on “alien datastructures”. I think it’s a great piece, though I largely probably agree with it only about 50%. A very reductive explanation of the core parable in the piece is that we imagine we’ve found an ancient alien computer from a civilization with a totally separate parallel development of math and logic and computing. The protagonist of the koan lectures on why it might be hard to “do science” to reverse engineer what the computer does, and it might be better to “do deep thinking” to try to understand what problems the computer is trying to accomplish, and then try to come up with the right abstractions. In the case of AI, the recommendation is that this should in particular be done by introspecting and trying to understand our own thinking at a high level. There’s a great quote that I’ll include because it sounds nice:
I don’t want to debate the claims of the piece, since after all it’s a koan and is meant to be a parable and a vibe. But I think that as a parallel parable, it’s useful to know that humanity has actually reverse engineered a massively alien computer, and we did it extremely well, and via a combination of “science” (i.e., iteratively designing and testing and refining models) and high-level thinking, though with more of an emphasis on science.
The alien computer in this analogy is physics. It’s hard to emphasize how different physics is from what people in the past thought it would be. All our past ideas of reality have been challenged: waves aren’t really waves, particles aren’t really particles, time isn’t really time, space isn’t really (just) space. I think that a lot of people who have learned about math or physics, but don’t have experience doing research in it, have a romantic idea that it’s driven by “deep insight”. That physics sits around waiting for the next Newton or Einstein to get a new idea, and then adapts to that idea with the experimentalists there to confirm the great theoreticist’s theories, and the “non-great” rank and file there to compute out their consequences. And like sure, this kind of flow from theorists out is one thing that exists, but it’s mostly wrong. Mostly the way we understand the wild, alien structure that is physics is through just, well, doing science, which involves people tweaking old paradigms, gathering data, finding things that don’t make sense, and then folding the new information back into the big structure. The key point here (that physicists understand viscerally) is that there’s not a big, concept-centric order dependence to discovery. You can understand quantum mechanics first, you can understand relativity first, you can start with waves or particles; in the end, if you are serious about refining and testing your intuitions, and you do calculations to get stuff that increasingly makes more sense, you are on the right track.
A fun example here comes from the discovery of Heisenberg’s “matrix” model of quantum mechanics. One could imagine a romantic Eureka-moment picture of people suddenly grasping the indeterminacy of reality after thinking deeply about experiments that don’t make sense. But the reality was that after Max Planck’s first idea of quanta of energy levels in 1900 through 1925, people used measurements and arbitrary combinatorics that seemed to mostly agree with experiment to cobble together a weird theory (that the next generation called “Old quantum theory”) of allowed and disallowed orbits that everyone admitted didn’t make sense and fell apart on deep consideration, but sorta worked. After a significant amount of progress was made in this way, Heisenberg and Sommerfeld took a two-parameter table of numbers that went into this theory, made a matrix out of it, and then (in my somewhat cartoonish understanding) noticed that this matrix could be used to measure some other phenomena better. Then Heisenberg realized that just by looking at the matrix, making it complex and viewing it as an evolution operator, you could get a unifying and more satisfying explanation of many of the disparate phenomena in this old theory; it would then take more time for people to develop the more modern concepts of measurement, decoherence, and many worlds (the last of which still meets resistance among the old guard today).
The point is that physics doesn’t (only) grow through great insight. This might seem surprising to someone who grew up in a certain purist philosophy of logic that, I think incorrectly, has been cultivated on lesswrong: roughly, that ideas are either right, or useless. In fact, the experience of a physicist is that sloppy stopgap ideas that explain experiment or theory in a slightly more systematic way are often useful, though for reasons that will often elude the sloppy idea-haver (physics has a perennial problem of the developers of sloppy ideas getting overinvested in their “exact correctness”, and being resistant to future more sensible systematization—but this is a different story). In physics ideas are deep, but they’re weirdly un-path dependent. There are many examples of people arriving at the same place by different sloppy and unsystematic routes (the Feynman path integral vs. other interpretations of quantum field theory being a great example). This uncanny ability to make progress by taking stopgap measures and then slowly refining them surprises physicists themselves, and I think that some very cool physics “ideology” comes from trying to make sense of this. Two meta-explanations that I think were in part a response to physicists trying to figure out “what the hell is going on”, but led to physics that is used today (especially in the first case) are:
effective field theory and the renormalization group flow, coming from ideas of Wilson about how theories at different energy levels can be effectively replaced by simpler theories at other layers (and relating to ideas of Landau and others of complicated concrete theories self-correcting to converge to certain elegant but more abstract “universal” theories at larger scale)
the “bootstrap” idea that complicated theories with unknown moving parts have measurements that satisfy certain standard formulas and inequalities, and one can often get quite far (indeed sometimes, all the way) by ignoring the “physics” and only looking at the formulas as a formal system.
Both of these physics ideas identify a certain shape of “emergent structure”, where getting at something with even a very small amount of “orientational push” from reality, or from known sloppy structure, will tend to lead to the same “correct” theory at scales we care about.
This doesn’t mean that big ideas and deep thinking are useless. But I think that this does point (at least in physics) to taking more seriously ideas that “aren’t fully self-consistent yet” from a more demanding mathematical standpoint. We can play random fun measurement and calculation and systematization games, and tinker (“morally, effectively, funly, cooperatively, creatively, agentically, truth-trackingly, understandingly”) with systematizations of the results of these games, and we’re likely to get there in the end.
At least if the alien computer is physics.
For later. In my posts I’m hoping to talk in more detail about renormalization and the self-correcting nature of universality, and also about why I think that the complexity of (certain areas of) physics is a good match for the complexity of neural nets, and why I think the modern field of interpretability and “theoretical ML” more generally is much further along this game of tinkering and systematizing than many people think (in particular, in some ways we’re beyond the “Old Quantum Theory”-style mess). But this is it for now.
As Sean Carroll likes to say, though, the reason we’ve made so much progress in physics is that it’s way easier than the other sciences :)
I’d argue that the easiness of physics probably comes from the fact that we can get effectively unlimited data, combined with the ability to query our reality as an oracle to test certain ideas and importantly get easy verification of a theory, which helps in 2 ways:
The prior matters very little, because you can update to the right theory from almost all but the most dogmatic priors.
Verifying being easy shortcuts a lot of philosophical debates, and makes it easy to update towards correct theories.
However, I think the main insight that sloppy but directionally correct ideas being useful to build upon, combined with partial progress being important is a very important idea that has applicability beyond physics.
This makes sense, but I’d argue that ML and interpretability has even more of both of these properties. Something that makes it harder is that some of the high-level goals of understanding transformers are inherently pretty complex, and also it’s less susceptible to math/ elegance-based analysis, so is even more messy :)
I think what explains the relative ease of progress in physics has more so to do with its relative compositionality in contrast to other disciplines like biology or economics or the theory of differential equations, in the sense Jules Hedges meant it. To quote that essay:
Minor nit: The alien computer is a specific set of physical laws, which shouldn’t be confused with the general case of physics/mathematics, so we only managed to reverse engineer it for 1 universe.
Cute :). Do you mean that we’ve only engineered the alien computer running a single program (the standard model with our universe’s particular coupling constants), or something else?
Yes, I was talking about this:
but more importantly I was focused on the fact that all plausible future efforts will only reverse engineered the alien computer that runs a single program, which is essentially the analogue of the laws of physics for our physical universe.
On the surprising effectiveness of linear regression as a toy model of generalization.
Another shortform today (since Sunday is the day of rest). This time it’s really a hot take: I’m not confident about the model described here being correct.
Neural networks aren’t linear—that’s the whole point. They notice interesting, compositional, deep information about reality. So when people use linear regression as a qualitative comparison point for behaviors like generalization and learning, I tend to get suspicious. Nevertheless, the track record of linear regression as a model for “qualitative” asymptotic behaviors is hard to deny. Linear regression models (neatly analyzable using random matrix theory) give surprisingly accurate models of double descent, scaling phenomena, etc. (at least when comparing to relatively shallow networks like mnist or modular addition).
I recently developed a cartoon internal model for why this may be the case. I’m not sure if it’s correct, but I’ll share it here.
The model assumes a few facts about algorithms implemented by NN’s (all of which I believe in much more strongly than the model of comparing them to linear regression):
The generalized Linear Representation Hypothesis. An NN’s internal working can be locally factorized into a large collection of low-level features in distinct low-dimensional linear subspaces, and then applying (generally nonlinear) postprocessing to these features independently or in small batches. Note that this is much weaker than a stronger version (such as the one inherent in SAE) that posits 1-dimensional features. In my experience a version of this hypothesis is almost universally believed by engineers, and also agrees with all known toy algorithms discovered so far.
Smoothness-ish of the data manifold. Inside the low-dimensional “feature subspace”, the data is kinda smooth—i.e., it’s smooth (i.e., locally approximately linear) in most directions, and in directions where it’s not smooth, it still might behave sorta smoothly in aggregate.
Linearity-ish of the classification signal. Even in cases like mnist or transformer learning where the training data is discrete (and the algorithm is mean to approximate it by a continuous function), there is a sense in which it’s locally well-approximated by a linear function. E.g. perhaps some coarse-graining of the discrete data is continuously linear, or at least the data boundary can be locally well approximated by a linear hyperplane (so that a local linear function can attain 100% accuracy). More generally, we can assume a similar local linearity property on the layer-to-layer forward functions, when restricted to either a single feature space or a small group of interacting feature spaces.
(At least partial) locality of the effect of weight modification. When I read it this paper left a lasting impression on me. I’m actually not super excited about its main claims (I’ll discuss polytopes later), but a very cool piece of the analysis here is locally modelling ReLU learning as building a convex function as a max of linear functions (and explaining why non-ReLU learning should exhibit a softer version of the same behavior). This is a somewhat “shallow” point of view on learning, but probably captures a nontrivial part of what’s going on, and this predicts that every new weight update only has local effect—i.e., is felt in a significant way only by a small number of datapoints (the idea being that if you’re defining a convex function as the max of a bunch of linear functions, shifting one of the linear functions will only change the values in places where this particular linear function was dominant). The way I think about this phenomenon is that it’s a good model for “local learning”, i.e., learning closer to memorization on the memorization-generalization spectrum that only updates the behavior on a small cluster of similar datapoints (e.g. the LLM circuit that completes “Barack” with “Obama”). There are also possibly also more diffuse phenomena (like “understanding logic”, or other forms of grokking “overarching structure”), but most likely both forms of learning occur (and it’s more like a spectrum than a dichotomy).
If we buy that these four phenomena occur, or even “occur sometimes” in a way relevant for learning, then it naturally follows that a part of the “shape” of learning and generalization is well described qualitatively by linear regression. Indeed, the model then becomes that (by point 4 above), many weight updates exclusively “focus on a single local batch of input points” in some low-dimensional feature manifold. For this particular weight update, locality of the update and smoothness of the data manifold (#2) together imply that we can model it as learning a function on a linear low-dimensional space (since smooth manifolds are locally well-approximated by a linear space). Finally, local linearity of the classification function (#3) implies that we’re learning a locally linear function on this local batch of datapoints. Thus we see that, under this collection of assumptions, the local learning subproblems essentially boil down to linear regression.
Note that the “low-dimensional feature space” assumption, #1, is necessary for any of this to even make sense. Without making this assumption, the whole picture is a non-starter and the other assumptions, #2-#4 don’t make sense, since a sub-exponentially large collection of points on a high-dimensional data manifold with any degree of randomness (something that is true about the data samples in any nontrivial learning problem) will be very far away from each other and the notion of “locality” becomes meaningless. (Note also that a weaker hypothesis than #1 would suffice—in particular, it’s enough that there are low-dimensional “feature mappings” where some clustering occurs at some layer, and these don’t a priori have to be linear.)
What is this model predicting? Generally I think abstract models like this aren’t very interesting until they make a falsifiable prediction or at least lead to some qualitative update on the behavior of NN’s. I haven’t thought about this very much, and would be excited if others have better ideas or can think of reasons why this model is incorrect. But one thing this model likely predicts is that a better model for a NN than a single linear regression model is a collection of qualitatively different linear regression models at different levels of granularity. In other words, depending on how sloppily you chop your data manifold up into feature subspaces, and how strongly you use the “locality” magnifying glass on each subspace, you’ll get a collection of different linear regression behaviors; you then predict that at every level of granularity, you will observe some combination of linear and nonlinear learning behaviors.
This point of view makes me excited about work by Ari Brill (that, as far as I know is unpublished—I heard a talk on it at the ILIAD conference—see the Saturday schedule, first talk in Bayes). If I understood the talk correctly, he models a data manifold as a certain stochastic fractal in a low-dimensional space and makes scaling predictions about generalization behavior depending on properties of the fractal, by thinking of the fractal as a hierarchy of smooth but noisy features. Finding similarly-flavored behavior scaling behavior on “linear regression subphenomena” in a real-life machine learning problem would positively update me on my model above being correct.
Ari’s work is on Arxiv here
Loving this!
Epic.
A couple things that come to mind.
Linear features = sufficients statistics of exponential families ?
simplest case is case of Gaussians and covariance matrix (which comes down to linear regression)
formalized by GPD theorem
see generalization by John
exponential families are a fairly good class but not closed under hierarchichal structure. Basic example is a mixture of Gaussians is not exponential, i.e. not described in terms of just linear regression.
The centrality of ReLU neural networks.
Understanding ReLU neural networks is probably 80-90% of understanding NN- architectures. At sufficient scale pure MLP have the same or better scaling laws than transformers.
There is several lines of evidence gradient descent has an inherent bias towards splines/piecewise linear functions/tropical polynomials. see e.g. here and references therein.
Serious analysis of ReLU neural network can be done through tropical methods. A key paper is here. You say:
“very cool piece of the analysis here is locally modelling ReLU learning as building a convex function as a max of linear functions (and explaining why non-ReLU learning should exhibit a softer version of the same behavior). This is a somewhat “shallow” point of view on learning, but probably captures a nontrivial part of what’s going on, and this predicts that every new weight update only has local effect—i.e., is felt in a significant way only by a small number of datapoints (the idea being that if you’re defining a convex function as the max of a bunch of linear functions, shifting one of the linear functions will only change the values in places where this particular linear function was dominant). The way I think about this phenomenon is that it’s a good model for “local learning”, i.e., learning closer to memorization on the memorization-generalization spectrum that only updates the behavior on a small cluster of similar datapoints (e.g. the LLM circuit that completes “Barack” with “Obama”). “
I suspect the notion one should be looking at are the Activation polytope and activation fan in section 5 of the paper. The hypothesis would be something about efficiently learnable features having a ‘locality’ constraint on these activation polytopes, ie. they are ‘small’, ‘active on only a few data points’..