We have in fact reverse engineered alien datastructures.
I’m trying to keep up a regular schedule of writing for my Nanowrimo project. I’m working on a post that’s not done yet, so for today I’ll write a quick (and not very high-effort) take related to discussions I’ve had with more doomy friends (I’m looking at you Kaarel Hänni). A good (though incomplete) crux on why something like “faithful interpretability” may be hard (faithful here being “good enough to genuinely decompose enough of the internal thought process to notice and avoid deception”) is contained in Tsvi’s post on “alien datastructures”. I think it’s a great piece, though I largely probably agree with it only about 50%. A very reductive explanation of the core parable in the piece is that we imagine we’ve found an ancient alien computer from a civilization with a totally separate parallel development of math and logic and computing. The protagonist of the koan lectures on why it might be hard to “do science” to reverse engineer what the computer does, and it might be better to “do deep thinking” to try to understand what problems the computer is trying to accomplish, and then try to come up with the right abstractions. In the case of AI, the recommendation is that this should in particular be done by introspecting and trying to understand our own thinking at a high level. There’s a great quote that I’ll include because it sounds nice:
Go off and think well——morally, effectively, funly, cooperatively, creatively, agentically, truth-trackingly, understandingly——and observe this thinking——and investigate/modify/design this thinking——and derive principles of mind that explain the core workhorses of the impressive things we do including self-reprogramming, and that explain what determines our values and how we continue caring across ontology shifts, and that continue to apply across mental change and across the human-AGI gap; where those principles of mind are made of ideas that are revealed by the counterfactual structure of possible ways of thinking revealed by our interventions on our thinking, like how car parts make more sense after you take them out and replace them with other analogous parts.
I don’t want to debate the claims of the piece, since after all it’s a koan and is meant to be a parable and a vibe. But I think that as a parallel parable, it’s useful to know that humanity has actually reverse engineered a massively alien computer, and we did it extremely well, and via a combination of “science” (i.e., iteratively designing and testing and refining models) and high-level thinking, though with more of an emphasis on science.
The alien computer in this analogy is physics. It’s hard to emphasize how different physics is from what people in the past thought it would be. All our past ideas of reality have been challenged: waves aren’t really waves, particles aren’t really particles, time isn’t really time, space isn’t really (just) space. I think that a lot of people who have learned about math or physics, but don’t have experience doing research in it, have a romantic idea that it’s driven by “deep insight”. That physics sits around waiting for the next Newton or Einstein to get a new idea, and then adapts to that idea with the experimentalists there to confirm the great theoreticist’s theories, and the “non-great” rank and file there to compute out their consequences. And like sure, this kind of flow from theorists out is one thing that exists, but it’s mostly wrong. Mostly the way we understand the wild, alien structure that is physics is through just, well, doing science, which involves people tweaking old paradigms, gathering data, finding things that don’t make sense, and then folding the new information back into the big structure. The key point here (that physicists understand viscerally) is that there’s not a big, concept-centric order dependence to discovery. You can understand quantum mechanics first, you can understand relativity first, you can start with waves or particles; in the end, if you are serious about refining and testing your intuitions, and you do calculations to get stuff that increasingly makes more sense, you are on the right track.
A fun example here comes from the discovery of Heisenberg’s “matrix” model of quantum mechanics. One could imagine a romantic Eureka-moment picture of people suddenly grasping the indeterminacy of reality after thinking deeply about experiments that don’t make sense. But the reality was that after Max Planck’s first idea of quanta of energy levels in 1900 through 1925, people used measurements and arbitrary combinatorics that seemed to mostly agree with experiment to cobble together a weird theory (that the next generation called “Old quantum theory”) of allowed and disallowed orbits that everyone admitted didn’t make sense and fell apart on deep consideration, but sorta worked. After a significant amount of progress was made in this way, Heisenberg and Sommerfeld took a two-parameter table of numbers that went into this theory, made a matrix out of it, and then (in my somewhat cartoonish understanding) noticed that this matrix could be used to measure some other phenomena better. Then Heisenberg realized that just by looking at the matrix, making it complex and viewing it as an evolution operator, you could get a unifying and more satisfying explanation of many of the disparate phenomena in this old theory; it would then take more time for people to develop the more modern concepts of measurement, decoherence, and many worlds (the last of which still meets resistance among the old guard today).
The point is that physics doesn’t (only) grow through great insight. This might seem surprising to someone who grew up in a certain purist philosophy of logic that, I think incorrectly, has been cultivated on lesswrong: roughly, that ideas are either right, or useless. In fact, the experience of a physicist is that sloppy stopgap ideas that explain experiment or theory in a slightly more systematic way are often useful, though for reasons that will often elude the sloppy idea-haver (physics has a perennial problem of the developers of sloppy ideas getting overinvested in their “exact correctness”, and being resistant to future more sensible systematization—but this is a different story). In physics ideas are deep, but they’re weirdly un-path dependent. There are many examples of people arriving at the same place by different sloppy and unsystematic routes (the Feynman path integral vs. other interpretations of quantum field theory being a great example). This uncanny ability to make progress by taking stopgap measures and then slowly refining them surprises physicists themselves, and I think that some very cool physics “ideology” comes from trying to make sense of this. Two meta-explanations that I think were in part a response to physicists trying to figure out “what the hell is going on”, but led to physics that is used today (especially in the first case) are:
effective field theory and the renormalization group flow, coming from ideas of Wilson about how theories at different energy levels can be effectively replaced by simpler theories at other layers (and relating to ideas of Landau and others of complicated concrete theories self-correcting to converge to certain elegant but more abstract “universal” theories at larger scale)
the “bootstrap” idea that complicated theories with unknown moving parts have measurements that satisfy certain standard formulas and inequalities, and one can often get quite far (indeed sometimes, all the way) by ignoring the “physics” and only looking at the formulas as a formal system.
Both of these physics ideas identify a certain shape of “emergent structure”, where getting at something with even a very small amount of “orientational push” from reality, or from known sloppy structure, will tend to lead to the same “correct” theory at scales we care about.
This doesn’t mean that big ideas and deep thinking are useless. But I think that this does point (at least in physics) to taking more seriously ideas that “aren’t fully self-consistent yet” from a more demanding mathematical standpoint. We can play random fun measurement and calculation and systematization games, and tinker (“morally, effectively, funly, cooperatively, creatively, agentically, truth-trackingly, understandingly”) with systematizations of the results of these games, and we’re likely to get there in the end.
At least if the alien computer is physics.
For later. In my posts I’m hoping to talk in more detail about renormalization and the self-correcting nature of universality, and also about why I think that the complexity of (certain areas of) physics is a good match for the complexity of neural nets, and why I think the modern field of interpretability and “theoretical ML” more generally is much further along this game of tinkering and systematizing than many people think (in particular, in some ways we’re beyond the “Old Quantum Theory”-style mess). But this is it for now.
I’d argue that the easiness of physics probably comes from the fact that we can get effectively unlimited data, combined with the ability to query our reality as an oracle to test certain ideas and importantly get easy verification of a theory, which helps in 2 ways:
The prior matters very little, because you can update to the right theory from almost all but the most dogmatic priors.
Verifying being easy shortcuts a lot of philosophical debates, and makes it easy to update towards correct theories.
However, I think the main insight that sloppy but directionally correct ideas being useful to build upon, combined with partial progress being important is a very important idea that has applicability beyond physics.
This makes sense, but I’d argue that ML and interpretability has even more of both of these properties. Something that makes it harder is that some of the high-level goals of understanding transformers are inherently pretty complex, and also it’s less susceptible to math/ elegance-based analysis, so is even more messy :)
I think what explains the relative ease of progress in physics has more so to do with its relative compositionality in contrast to other disciplines like biology or economics or the theory of differential equations, in the sense Jules Hedges meant it. To quote that essay:
For examples of non-compositional systems, we look to nature. Generally speaking, the reductionist methodology of science has difficulty with biology, where an understanding of one scale often does not translate to an understanding on a larger scale. … For example, the behaviour of neurons is well-understood, but groups of neurons are not. Similarly in genetics, individual genes can interact in complex ways that block understanding of genomes at a larger scale.
Such behaviour is not confined to biology, though. It is also present in economics: two well-understood markets can interact in complex and unexpected ways. Consider a simple but already important example from game theory. The behaviour of an individual player is fully understood: they choose in a way that maximises their utility. Put two such players together, however, and there are already problems with equilibrium selection, where the actual physical behaviour of the system is very hard to predict.
More generally, I claim that the opposite of compositionality is emergent effects. The common definition of emergence is a system being ‘more than the sum of its parts’, and so it is easy to see that such a system cannot be understood only in terms of its parts, i.e. it is not compositional. Moreover I claim that non-compositionality is a barrier to scientific understanding, because it breaks the reductionist methodology of always dividing a system into smaller components and translating explanations into lower levels.
More specifically, I claim that compositionality is strictly necessary for working at scale. In a non-compositional setting, a technique for a solving a problem may be of no use whatsoever for solving the problem one order of magnitude larger. To demonstrate that this worst case scenario can actually happen, consider the theory of differential equations: a technique that is known to be effective for some class of equations will usually be of no use for equations removed from that class by even a small modification. In some sense, differential equations is the ultimate non-compositional theory.
Minor nit: The alien computer is a specific set of physical laws, which shouldn’t be confused with the general case of physics/mathematics, so we only managed to reverse engineer it for 1 universe.
Cute :). Do you mean that we’ve only engineered the alien computer running a single program (the standard model with our universe’s particular coupling constants), or something else?
Do you mean that we’ve only engineered the alien computer running a single program (the standard model with our universe’s particular coupling constants)
but more importantly I was focused on the fact that all plausible future efforts will only reverse engineered the alien computer that runs a single program, which is essentially the analogue of the laws of physics for our physical universe.
We have in fact reverse engineered alien datastructures.
I’m trying to keep up a regular schedule of writing for my Nanowrimo project. I’m working on a post that’s not done yet, so for today I’ll write a quick (and not very high-effort) take related to discussions I’ve had with more doomy friends (I’m looking at you Kaarel Hänni). A good (though incomplete) crux on why something like “faithful interpretability” may be hard (faithful here being “good enough to genuinely decompose enough of the internal thought process to notice and avoid deception”) is contained in Tsvi’s post on “alien datastructures”. I think it’s a great piece, though I largely probably agree with it only about 50%. A very reductive explanation of the core parable in the piece is that we imagine we’ve found an ancient alien computer from a civilization with a totally separate parallel development of math and logic and computing. The protagonist of the koan lectures on why it might be hard to “do science” to reverse engineer what the computer does, and it might be better to “do deep thinking” to try to understand what problems the computer is trying to accomplish, and then try to come up with the right abstractions. In the case of AI, the recommendation is that this should in particular be done by introspecting and trying to understand our own thinking at a high level. There’s a great quote that I’ll include because it sounds nice:
I don’t want to debate the claims of the piece, since after all it’s a koan and is meant to be a parable and a vibe. But I think that as a parallel parable, it’s useful to know that humanity has actually reverse engineered a massively alien computer, and we did it extremely well, and via a combination of “science” (i.e., iteratively designing and testing and refining models) and high-level thinking, though with more of an emphasis on science.
The alien computer in this analogy is physics. It’s hard to emphasize how different physics is from what people in the past thought it would be. All our past ideas of reality have been challenged: waves aren’t really waves, particles aren’t really particles, time isn’t really time, space isn’t really (just) space. I think that a lot of people who have learned about math or physics, but don’t have experience doing research in it, have a romantic idea that it’s driven by “deep insight”. That physics sits around waiting for the next Newton or Einstein to get a new idea, and then adapts to that idea with the experimentalists there to confirm the great theoreticist’s theories, and the “non-great” rank and file there to compute out their consequences. And like sure, this kind of flow from theorists out is one thing that exists, but it’s mostly wrong. Mostly the way we understand the wild, alien structure that is physics is through just, well, doing science, which involves people tweaking old paradigms, gathering data, finding things that don’t make sense, and then folding the new information back into the big structure. The key point here (that physicists understand viscerally) is that there’s not a big, concept-centric order dependence to discovery. You can understand quantum mechanics first, you can understand relativity first, you can start with waves or particles; in the end, if you are serious about refining and testing your intuitions, and you do calculations to get stuff that increasingly makes more sense, you are on the right track.
A fun example here comes from the discovery of Heisenberg’s “matrix” model of quantum mechanics. One could imagine a romantic Eureka-moment picture of people suddenly grasping the indeterminacy of reality after thinking deeply about experiments that don’t make sense. But the reality was that after Max Planck’s first idea of quanta of energy levels in 1900 through 1925, people used measurements and arbitrary combinatorics that seemed to mostly agree with experiment to cobble together a weird theory (that the next generation called “Old quantum theory”) of allowed and disallowed orbits that everyone admitted didn’t make sense and fell apart on deep consideration, but sorta worked. After a significant amount of progress was made in this way, Heisenberg and Sommerfeld took a two-parameter table of numbers that went into this theory, made a matrix out of it, and then (in my somewhat cartoonish understanding) noticed that this matrix could be used to measure some other phenomena better. Then Heisenberg realized that just by looking at the matrix, making it complex and viewing it as an evolution operator, you could get a unifying and more satisfying explanation of many of the disparate phenomena in this old theory; it would then take more time for people to develop the more modern concepts of measurement, decoherence, and many worlds (the last of which still meets resistance among the old guard today).
The point is that physics doesn’t (only) grow through great insight. This might seem surprising to someone who grew up in a certain purist philosophy of logic that, I think incorrectly, has been cultivated on lesswrong: roughly, that ideas are either right, or useless. In fact, the experience of a physicist is that sloppy stopgap ideas that explain experiment or theory in a slightly more systematic way are often useful, though for reasons that will often elude the sloppy idea-haver (physics has a perennial problem of the developers of sloppy ideas getting overinvested in their “exact correctness”, and being resistant to future more sensible systematization—but this is a different story). In physics ideas are deep, but they’re weirdly un-path dependent. There are many examples of people arriving at the same place by different sloppy and unsystematic routes (the Feynman path integral vs. other interpretations of quantum field theory being a great example). This uncanny ability to make progress by taking stopgap measures and then slowly refining them surprises physicists themselves, and I think that some very cool physics “ideology” comes from trying to make sense of this. Two meta-explanations that I think were in part a response to physicists trying to figure out “what the hell is going on”, but led to physics that is used today (especially in the first case) are:
effective field theory and the renormalization group flow, coming from ideas of Wilson about how theories at different energy levels can be effectively replaced by simpler theories at other layers (and relating to ideas of Landau and others of complicated concrete theories self-correcting to converge to certain elegant but more abstract “universal” theories at larger scale)
the “bootstrap” idea that complicated theories with unknown moving parts have measurements that satisfy certain standard formulas and inequalities, and one can often get quite far (indeed sometimes, all the way) by ignoring the “physics” and only looking at the formulas as a formal system.
Both of these physics ideas identify a certain shape of “emergent structure”, where getting at something with even a very small amount of “orientational push” from reality, or from known sloppy structure, will tend to lead to the same “correct” theory at scales we care about.
This doesn’t mean that big ideas and deep thinking are useless. But I think that this does point (at least in physics) to taking more seriously ideas that “aren’t fully self-consistent yet” from a more demanding mathematical standpoint. We can play random fun measurement and calculation and systematization games, and tinker (“morally, effectively, funly, cooperatively, creatively, agentically, truth-trackingly, understandingly”) with systematizations of the results of these games, and we’re likely to get there in the end.
At least if the alien computer is physics.
For later. In my posts I’m hoping to talk in more detail about renormalization and the self-correcting nature of universality, and also about why I think that the complexity of (certain areas of) physics is a good match for the complexity of neural nets, and why I think the modern field of interpretability and “theoretical ML” more generally is much further along this game of tinkering and systematizing than many people think (in particular, in some ways we’re beyond the “Old Quantum Theory”-style mess). But this is it for now.
As Sean Carroll likes to say, though, the reason we’ve made so much progress in physics is that it’s way easier than the other sciences :)
I’d argue that the easiness of physics probably comes from the fact that we can get effectively unlimited data, combined with the ability to query our reality as an oracle to test certain ideas and importantly get easy verification of a theory, which helps in 2 ways:
The prior matters very little, because you can update to the right theory from almost all but the most dogmatic priors.
Verifying being easy shortcuts a lot of philosophical debates, and makes it easy to update towards correct theories.
However, I think the main insight that sloppy but directionally correct ideas being useful to build upon, combined with partial progress being important is a very important idea that has applicability beyond physics.
This makes sense, but I’d argue that ML and interpretability has even more of both of these properties. Something that makes it harder is that some of the high-level goals of understanding transformers are inherently pretty complex, and also it’s less susceptible to math/ elegance-based analysis, so is even more messy :)
I think what explains the relative ease of progress in physics has more so to do with its relative compositionality in contrast to other disciplines like biology or economics or the theory of differential equations, in the sense Jules Hedges meant it. To quote that essay:
Minor nit: The alien computer is a specific set of physical laws, which shouldn’t be confused with the general case of physics/mathematics, so we only managed to reverse engineer it for 1 universe.
Cute :). Do you mean that we’ve only engineered the alien computer running a single program (the standard model with our universe’s particular coupling constants), or something else?
Yes, I was talking about this:
but more importantly I was focused on the fact that all plausible future efforts will only reverse engineered the alien computer that runs a single program, which is essentially the analogue of the laws of physics for our physical universe.