Lucius: I recently held a small talk presenting an idea for how and why deep learning generalises. It tried to reduce concepts from Singular Learning theory back to basic algorithmic information theory to sketch a unified picture that starts with Solomonoff induction and, with a lot of hand waving, derives that under some assumptions, just fitting a big function to your data using a local optimisation method like stochastic gradient descent maybe, sorta, kind of, amounts to a cheap bargain bin approximation of running Solomonoff induction on that data.
Lucius: The parametrised function we fit to the data has to be the sort that can act like a vaguely passable approximation of a compute-limited Universal Turing machine, with parameter configurations playing the role of programs.
Lucius: In a Universal Turing machine, ‘simpler’ hypotheses with less ‘burdensome details’ will be implemented by exponentially more equivalent programs. So, simpler hypotheses take up exponentially more volume in program space. Thus, if we start with a uniform prior over all programs of some fixed length l, we’re effectively implementing an approximation of the Solomonoff prior, giving exponentially more weight to simpler hypotheses.[1]
Lucius: Similarly, in a good parametrised function, ‘simpler’ fits to the data with less ‘burdensome details’ will correspond to exponentially more equivalent parameter settings than complicated ones. So, simple functions will take up an exponentially larger volume in parameter space than complicated functions, and a uniform prior over parameter space will exponentially favour them before we even see any data.
Lucius: Later, I posted the slides for the talk in a group chat because someone asked for them. I did not include a GIANT DISCLAIMER that these slides are NOT RIGOROUS and present an OVERSIMPLIFIED STORY to help the audience follow the thread of the explanation better. They should NOT BE PARSED without ACCOMPANYING COMMENTARY.
Lucius: This was very stupid of me.
Lucius: Kaarel and Dmitry promptly jumped in to critique the slides, and I replied to their critiques. Afterwards, we all realised that the ensuring discussion basically acts as the commentary the slides were missing, and also touched on many other ideas regarding the nature of intelligence that may be of interest to some.
Lucius: So we figured we’d post the discussion here, somewhat edited, along with the slides.
Slides
DISCLAIMER: THESE OVERSIMPLIFIED SLIDES present an OVERSIMPLIFIED STORY because they were just PROPS FOR A TALK. Their contents should be taken IN THE CONTEXT of the CRITIQUE AND DISCUSSION BELOW and probably seasoned with LARGE QUANTITIES OF SALT even then.
Kaarel: The idea of a wheel is simple and humans can think about stuff and invent wheels but local learning (evolution) struggles to invent wheels.
Kaarel: More generally, most mathematical concepts probably could not be invented by SGD (at any scale that fits in this universe) (and mathematical concepts are useful for prediction).
Jake: Indeed more the fool evolution for inventing wheels in some places but not using them everywhere where the logical inference is small. But evolution isn’t taking steps in logical inference space.
Kaarel: Humans invent mathematical concepts largely not by doing gradient descent in concept-space, but by thinking about what concept to make. it’s maybe more like solving a constraint satisfaction problem where you can’t find a solution by local search but can find a solution by thinking.
Kaarel: (i’m not saying that concepts cannot be invented by a thing arising from SGD doing some thinking about which concepts to make — that’s plausible with the right setup, though it could require something beyond what’s currently in common use/consideration. certainly evolution is a pretty dumb pretty local thing which sorta built a thing which can make concepts)
Kaarel: A neural net struggling to learn which subset of inputs was XOR-ed despite it being a simple linear algebra problem to determine which subset was XOR-ed (and probably not too hard for a human to think of trying that to solve the same prediction problem) is a good example to have in mind here.
Lucius: I think you may be parsing these slides differently from what I tried to say with them in the talk. I don’t think humans or AGIs being able to find things evolution and neural network training can’t is evidence against what I’m trying to say. Moving locally in some hypothesis space on a second-to-second basis doesn’t mean you’re well described as moving locally in a fitness landscape over long time horizons.
Lucius: If I want to build a car, but don’t know how, this model does not say I have to think through random car designs and update until I find one that seems to work. That’s sort of what this model suggests happens a lot on a second-by-second basis, but over longer time scales, your current position in search space influences what data you’ll even try to fit next and with what sort of function. So you might move to making car designs in some platonic realm where wheels just turn without engines, and if some design works in that space, you might go back to fitting in the original space with a new starting point and new data to fit based on that platonic design.
Lucius: According to this idea, the rules for how to pick what to fit next might actually be really dumb. At least at first. You might memorise what rule worked well on a previous occasion that pattern matches this occasion, for example (where the pattern match is just a dumb function fit, like everything else).
Lucius: Eventually, your rules for what to think about next might get better. But that might be through normal updating that works just like everything else in this setup. Memorise, then try to find more general rules by fitting the memorisations with some functions.
Lucius: Through this process, you might come to acquire rules for what to think about and fit functions to that are quite advanced indeed. Rules like formal logical thinking, that have your search trajectory move in ways very unlike what we’d usually associate with local updating.
Lucius: I don’t think these very fancy rules necessarily contain some sort of core insight required for generality though. I think your mind works ok and is reasonably general even before it gets the very fancy rules. I’m not sure there’s any kind of core pattern that unites all the fancy rules either. I think they’re maybe just good rules you can use.
Lucius: So I guess the difference in (lightly held) opinion here might be this part
Kaarel: Humans invent mathematical concepts largely not by doing gradient descent in concept-space, but by thinking about what concept to make.
The idea in the slides does say that local search in ‘concept space’ is probably pretty central for doing things. There’s also rules for what kind of concept space to search in when to fit what data, but these are more a collection of heuristics learned over time, and the very basics of general intelligence maybe work ok even when these rules are still quite basic.
Kaarel: Some thoughts about intelligence in sorta-continuation of the above discussion. This isn’t a direct response, but some considerations that i think are probably relevant + true/good:
I think it could be reasonable to say either of the following about intelligence:
“there are very many interesting thinking-structures already in use, and very many more will come to be used in the future, and very many more (even comparably useful ones) will never be invented”
OR “there’s nothing mysterious, just a bunch of stupid stuff”
(I feel more like saying the former than the latter)
But I doubt it makes sense to say “it’s sorta just something like SGD + stupid stuff”. Here are some things that are imo more central than something like SGD:
Tool use, tool-finding, tool-making
Language
Logic
Investigating how to think, and then starting to think that way.
Note one is often pretty much thinking about how to think without being very aware that one is doing that. Some examples:
A mathematician coming up with a definition which ends up figuring in certain neat theorems.
More generally, seeing a need to talk about something and coining a word for it.
A philosopher trying to clarify/re-engineer a term, eg by seeing which definition would fit best into the ways one wants to speak using that term.
Noticing and resolving tensions.
Developing/clarifying/inventing/specifying the scientific method.
Trying to figure out how to improve peer review.
Also central: doing stuff which is more like trying out ways to think (and keeping the good ones) than thinking about how to think. But this process of trying will still be richly structured.
This overlaps with tool-making/finding+use, because concepts are thinking-tools. It overlaps with language because human concepts are significantly linguistic — they fit in a language together, they only let us do big things because of fitting together in a language. (Really, we can’t even properly conceive of them / ‘they aren’t themselves’ except in the context of a language or linguistic practices even more broadly. Note that I’m not claiming that language is all the necessary context for concepts to even make sense — one probably needs a lot of additional context also.)
If I had to pick one thing to say doing very much is all about, then maybe I’d say this.
It’s plausible this deserves to be split into many (sub)categories.
Compression
Facing a curriculum of problems. creating problems for yourself to learn. play.
Keeping problems in mind as you go about your business, thinking about whether something you notice could be applied.
Goal-pursuits. working on projects
Seeing/making/using analogies
Teaching/explaining
Scale
Probably this and this could motivate more things to add to this list.
One might object to some or all the above, saying these aren’t responses to the right question. But what question are we asking? Maybe: “which combinations of structures make a process which goes on to do/understand very much?”. Some issues with this question:
One could ask this with a cognitive system with some very fixed structure in mind. like, maybe you have a particular kind of ‘world model’ and particular other components, and voila, this thing goes on to understand arbitrarily much math/science by just ‘filling in some content’ into these structures. which structures make this possible?
I think this picture is basically confused, because to understand very much decently quickly, you probably have to keep reprogramming yourself on all levels forever. We could maybe imagine a mind which gets very far while ‘keeping the same structure’, but I think such a mind would just be building new structures that are not seen in the explicit structure you forced on it (e.g. imagine a system somehow forced into not using probabilities explicitly which effectively starts using probabilities anyway in some contrived way, because that’s a useful thinking-structure). If we insist on no new structures really being incorporated, then imo the answer to “which structures make fooming possible?” is just “there are no such structures — fooming is not possible this way”
One could ask about this allowing new structures to be formed. Like, literally, which processes go on to do/understand very much?
Are we happy with the answer ‘for example, a sufficiently big evolution’? I feel like we’re not happy with this answer despite it being true, so this question isn’t exactly what we’re meaning to ask. I do think it’s a very interesting question though, about which very many interesting things could be said. I think there will be many very substantively different ways to get such a process, and it’d be cool to understand much more about which processes work.
Okay, maybe we wanted to ask instead: “The formation/invention/discovery of which structure first makes such a process foom decently fast?”
I think there are plausibly many structures that could be formed in various orders here that would increase the speed of fooming. I think all of the following structures have a reasonable claim to having contributed significantly to humanity fooming: replication/evolution (made fooming much faster than physics without evolution), vision, (large gap here because i got bored with adding to this list,) learning stuff from parents, opposable thumbs, manual dexterity, persistence hunting in groups, tool use, tool-making/maintenance/improvement, using fire, language, culture, trade, writing, governance, schooling, mathematical proof, the printing press, the scientific method, democracy, the use of statistics, medicine, first-order logic, computers, programming, the internet (and really basically any technology, way of thinking, way of acting, concept, but these are some central ones)
I think the speed of fooming will be increased by many structures in the future, plausibly arbitrarily many times (until the end of time or whatever), and plausibly even including arbitrarily many new structures as profound as developing primitive language.
This kind of thing doesn’t seem like an entirely satisfying answer either.
Okay, maybe we only meant the structures which are in use when a current individual human paces around, temporarily doesn’t interact with anything external, and tries to make research progress?
I think very many interesting structures are in use here. I’ve mentioned some above.
E.g. each word has some little structures associated with it.
Maybe we’re asking for some higher structure organizing these little structures?
I think there’s a major aspect of: often, the context for each little structure is created by all the other little structures, not something external to these little structures.
Anyway, the external thing would also have a lot of rich structure.
If I had to pick something short to say about a separate higher organizing thing, I’d probably say that (1) effectively, we’re investigating how to think and then thinking in those ways (and this happens across the board, on all levels), and that (2) our concepts have close shadows in language / make a language together. And then maybe next i’d want to say a bunch more about language.
Maybe this is still too high-level? but idk, are we interested in neuronal weight update rules here?
Kaarel: Some sorta meta commentary: I think that “what’s the science algorithm?”/”how does problem-solving/thinking work?” is sometimes asked like one is asking for some sort of final formula for thinking. This is imo like asking “what’s the ultimate technology?” or “what’s the formula for mathematics?”. There’s no such thing as ‘a formula for mathematics’ — there are infinitely many cool things to be figured out about mathematical objects. I think there are also infinitely many cool things to be figured out about how to think well, and infinitely many cool structures to incorporate into our thinking.
Kaarel: I would guess that in our circles, this imo mistake is downstream of something like this mistake by yudkowsky, but idk.
Kaarel: Of course, non-ultimate variants of these questions, as in “let’s figure out more about intelligence”, are imo centrally interesting. I just don’t think these questions are the sorts of things to be ‘solved’ — they are the sorts of things to be indefinitely figured out better (it’s imo probably even appropriate to think of oneself as always only understanding a finite fragment of an infinite thing here, like in math)
Dmitry: I finally read Lucius’ slide show. I like it a lot.
Dmitry: A few comments however:
Dmitry: 1. I would say that SLT isn’t about the Bayesian prior (that’s just “information geometry” or whatever). It’s rather an extra few epicycles: giving more details on what the prior looks like geometrically.
Dmitry: It’s just not true that programs correctly sample the “Bayesian” prior. Findability/SGD is a massive part of what is actually learnable. This (not SLT/Bayesianism) is what explains modularity in particular.
Dmitry: I’m planning to make a post about this, but there are lots of ways to see this. The parity problem and noticing that harder problems show up in LLM programs (which doesn’t mean they’re not solvable, just that they’re not solvable by the most compressed/ “Solomonoff-preferred” way), it’s evident in the algorithm learned by associative groups composition, etc.
Dmitry: I think studying the Bayesian/informational prior is still very valuable and a great correction to what people talk about when they talk about priors (it’s just that SGD and learnability are even more important).
Dmitry: I think Kaarel actually said more or less the same thing. Basically: the SGD prior says that NN’s prefer the “most efficient, including degeneracy, way of solving a problem”. This is not true. In fact, as you both agree, NN’s *locally* prefer the “most efficient, including degeneracy, way of solving the problem in a way that’s a perturbation of a previous solution. This means that (with high probability) there needs to be a “learning story” connecting an instance of the algorithm that’s ultimately learned to a generic starting point. The solution predicted by the “true” Bayesian prior is not in fact learnable in any interesting real-world context.
Dmitry: One way to motivate this is, once again, that the most efficient way of learning parity isn’t in fact learnable (and this is provably the case). It’s still possible to train a NN with a more interesting/complex loss to solve the parity problem: e.g. if the NN has a concept of language and logic, in a suitable sense, it can solve parity in polynomial time. It’s just that, provably (in a certain sense), it’s not going to be finding the most optimal “Bayesian-preferred” solution, but rather routing around the un-findability of this solution and going though some more elaborate SGD path to find a less degenerate/natural solution that has the correct I/O behavior.
Dmitry: I think in practice, a lot of the time this means that the generalizing solution found will just be more modular than the optimal solution (this is the case for associative groups). But more generally, training stories can be more complicated, since you can GD to “edit” or post-process a highly modular but bad algorithm to a better but less parallel algorithm. Trying to model this is this discussion about priors that Kaarel described that we had when he was visiting.
Dmitry: Re: Kaarel’s “structure of intelligence”, I think I largely agree. But I think the difference between us is that I am somewhat strongly in the “interp is possible” camp. In physics, you get complex and interesting behavior from taking simple “interaction” behaviors at different characteristic scales and putting them together in a giant superposition. I think NN’s follow the same general framework, and it’s plausible that we’ll be “done” with interp (at least well enough to do things like guarantee non-deception from low-level behaviors) once we have suitably good notions of “interaction”, “characteristic scale”, and “putting together”.
Dmitry: I also think that there are ways in which we can get useful-for-safety interp while our understanding is still bad, so long as we get some partial completeness: e.g. (mostly) completely characterize what can happen at a particular characteristic scale.
Dmitry: This makes me particularly excited about this Sam Marks paper, since while I don’t think that “find all mechanisms of this shape” is a viable path, it would nevertheless be quite exciting and plausible to say something like “a large component of NN behavior is described by a superposition of diagrams of roughly this complexity” (this could be shown by looking at a random phrase and producing a collection of diagrams that explain it). This would tell us something nontrivial about the complexity and branchiness of LLM intelligence, without actually explaining it in a way that is reverse-engineerable (same as people do for physics).
Lucius:
Dmitry: It’s just not true that programs correctly sample the “Bayesian” prior. Findability/SGD is a massive part of what is actually learnable.
I agree with this. These slides are giving a simplified 0-th order toy story to aid understanding. They’re really not complete without the accompanying talk.
Since this has now happened a second time, no seriously, these are just props from a small Apollo-internal talk that Jake asked for. They’re not an essay.
Lucius:
Kaarel: Some sorta meta commentary: …
I feel like there’s some miscommunication here. Let me back up and try to put my words more in the context that produced them.
In the year 2020, if you had asked me to write down a practical, real general intelligence algorithm, I would have told you that I have no clue how to do that. Not the best possible algorithm. Just any algorithm at all. Part of why I was unable to do that might be lack of knowledge about what you call many interesting thinking-structures. But it felt to me like there was a much graver and deeper problem there than just sufficient lack of knowledge about a vast and rich vista of tools for thinking. I certainly did not understand a lot of the little tricks and ingredients that maybe help with generality to the point that I could formalise them as code. But I felt that there was something even more central there that I was far more fundamentally confused about: How do you make an algorithm that generalises at all, ever, even a little bit, from old data to new data?
Even before deep learning, this felt like maybe the most core question here to me. Though I wouldn’t have put it in those terms back then, I was a teenager without the math vocab to express this. It’s kind of in the name: general intelligence. When observing my own thoughts while solving problems, or thinking about how the brain learned over the long term, I felt like this machinery had to somehow, constantly and at every step, navigate a vast number of potentially correct local fits to the data and think only about the ones that generalised to new data. Learning about Solomonoff induction at least clarified how picking a good hypothesis out of many that fit the data was possible at all even in mathematical principle. But my brain was clearly purposefully picking out simple hypotheses, rather than tracking every hypothesis possible. At every moment of conscious perception, the thoughts coming into my conscious awareness seemed to me to be completely undetermined by previous thoughts. There was some seeming ability there to always continue the line of fuzzy logic such that the overall chain was likely to generalise and reach the target, when exponentially many other continuations would have failed. It felt to me like however that trick worked, it might also be the trick through which I could learn to walk or hold a spoon, since this process seemed to have the same generalisation-shaped hole in the middle.
So it used to feel to me that there was an utterly crucial core insight missing here, about how to get any generality whatsoever. I think I was not alone in this. I very much had the impression that Yudkowsky and probably MIRI more generally thought about it in a similar way, for example. Among many others.
I was so confused the first time I saw an explainer video on deep learning. I had heard that they got computers to learn now, so I thought they must have figured out how to solve the generalisation problem. I kept rewatching the video looking for the part where the generalisation trick happens, and came up empty.
Maybe this doesn’t feel like such a central question to slightly younger people who grew up more used to the observed fact that deep learning works. Or maybe you just never made the mistake I and perhaps many others made for so many years when thinking about function fits and the Solomonoff prior, so you never became deeply confused in the first place. But to me, this question of generalisation was the question. All the other stuff that might or might not go into general intelligence could be rich and vast and poorly understood, but it never felt completely mysterious and utterly indispensable to AGI to me the way the generalisation problem did.
If you never thought this was all that mysterious and that it seemed explicable with the math we already had, Bayes points to you. In that case, I could see how you’d be confused or annoyed by people like me walking around ranting about this basic fact like it’s some grant revelation and we understand AGI now. But if you put yourself in my shoes, can you see how slowly resolving this very basic confusion over the last 2-3 years would feel like progress to me?
Kaarel: Ok i’m on board with there being some progress here :)
Bridging NN SGD and Solomonoff induction (from Oct 2024)
Kaarel: Dmitry and I had a few conversations about the stuff he and Louis Jaburi (et al.) are figuring out about how NNs learn group operations in which we discussed how it would be cool to try to spell out an analogy between the kind of structure-selection seen empirically for learning group operations and Solomonoff induction, and how it might be helpful to have some explicit story like that written up for people to have a crisp example to consider when thinking about how NN learning is or isn’t like Solomonoff induction.
Lucius: I dunno about using that as the specific example, but I agree that spelling out the story as we currently see it more explicitly for everyone might be a good idea.
Lucius: I’d love to compare/contrast:
Solomonoff induction
MLP or transformer trained with Bayesian updating, starting from a uniform or Gaussian prior over parameters.
MLP or transformer trained with some local optimisation algorithm like sgd, starting from gaussian or uniform initialisation.
Lucius: We can’t actually run that empirically for most realistic problems for the first two options, of course. So either we keep those to abstract discussion, or we find a really neat example where you can do it on paper.
Kaarel: To say a bit more about the group operation Solomonoff thing: Dmitry was telling me that it’s looking plausible that one could develop a good description whose broad strokes are: “There’s a prior on circuits. These circuits are of some known kind: mapping both inputs into some ‘small’ algebraic object, multiplying there, and mapping back; each such circuit is only somewhat better than random at getting the right output group element but the signal beats the noise when sufficiently many circuits are added. the output of a trained NN on an input (g,h) is a linear combination of outputs of random circuits on (g,h)drawn from this prior. (these circuits get ‘drawn’ during NN learning.) And by this I of course don’t just mean to state a fact about the input-output behavior, but to also claim that the NN is structured this way.”
Kaarel: Given my surface understanding it seems plausible to me that with some significant work the prior could also be explicitly described here.
Kaarel: I guess the thought is that the trained NN looks a bit like a subroutine of a Solomonoff inductor here: “The NN has a bunch of small computations it knows how to execute. To predict an output, it executes all of them and aggregates their outputs. This is like the part of Solomonoff induction that happens after algorithms which have previously made a wrong prediction are eliminated: you run all remaining algorithms and aggregate their predictions for the next bit.”
Kaarel: And NN training is when the small patterns that are to be used get selected. That’s like the subroutine of a Solomonoff inductor that selects which algorithms get their opinions about the next bit taken into account. So we have some analogy between [NN training + inference] and solomonoff induction here.
Kaarel: Of course, the above story is naive in that really there are interaction terms between the circuits — a more sophisticated picture would talk about the ecology of circuits. concretely, e.g. if one looks at the fraction of circuits which make an error at input (g,h), calculating the variance of this quantity as one varies the input, then it’s plausible this variance will be smaller than what one would predict if the circuits were drawn independently, because ‘the circuits are selected to cover each other’s weaknesses’. But the hope is that this kind of thing can be thought as a higher-order correction on top of the above first-order story.
Kaarel: Anyway, I like the themes of
Training producing many algorithms which are literally added together (instead of producing something that’d feel more like a single unified algorithm) — this feels related to how it would be silly to replace Solomonoff induction’s giving weight to a bunch of algorithms by a variant where one only uses the single simplest Turing machine that fits the data so far to predict the next bit.
The algorithms being from some impoverished/weird repertoire, individually making many mistakes, being individually ill-suited for the problem at hand, but still each having a bit of signal and so law-of-large-numbers-ing up to something sensible.
Acknowledgements
Lucius: Thanks to Alexander Odenziel, David Quarel, Matthias Dellagoand probably a bunch of people my brain failed to trackfordiscussions and insight that helped put the ideas in these slides together. I also don’t claim any originality here. A lot of this is arguably latent in Solomonoff’s original work already, and it would not surprise me much if many people in algorithmic information theory or elsewhere put something like this story together at some point over the years. It’s also not the first time Singular Learning Theory has been linked to algorithmic information theory in some way. As pointed out by Daniel Murfet here, Clift-Murfet-Wallbridge used a version of the padding argument to show that the learning coefficient is (in a sense) a lower bound on the Kolmogorov complexity in the setting of noisy Turing machines, and a follow-up result of that was treated in Thomas Waring’s master thesis.
Kaarel: I should also thank Sam Eisenstat, Tsvi Benson-Tilsen, Jake Mendel, Simon Skade, Jessica Taylor, for discussions + many of the ideas in my messages.
If you’re used to seeing the Solomonoff prior formulated as summing over all programs, with each program exponentially weighted by its length, the way it’s written down here: That’s for prefix Turing machines. There’s an equivalent definition for plain Turing machines where you just take a uniform distribution over all programs of length l, then take the limit l→∞.
Deep Learning is cheap Solomonoff induction?
Background
Lucius: I recently held a small talk presenting an idea for how and why deep learning generalises. It tried to reduce concepts from Singular Learning theory back to basic algorithmic information theory to sketch a unified picture that starts with Solomonoff induction and, with a lot of hand waving, derives that under some assumptions, just fitting a big function to your data using a local optimisation method like stochastic gradient descent maybe, sorta, kind of, amounts to a cheap bargain bin approximation of running Solomonoff induction on that data.
Lucius: The parametrised function we fit to the data has to be the sort that can act like a vaguely passable approximation of a compute-limited Universal Turing machine, with parameter configurations playing the role of programs.
Lucius: In a Universal Turing machine, ‘simpler’ hypotheses with less ‘burdensome details’ will be implemented by exponentially more equivalent programs. So, simpler hypotheses take up exponentially more volume in program space. Thus, if we start with a uniform prior over all programs of some fixed length l, we’re effectively implementing an approximation of the Solomonoff prior, giving exponentially more weight to simpler hypotheses.[1]
Lucius: Similarly, in a good parametrised function, ‘simpler’ fits to the data with less ‘burdensome details’ will correspond to exponentially more equivalent parameter settings than complicated ones. So, simple functions will take up an exponentially larger volume in parameter space than complicated functions, and a uniform prior over parameter space will exponentially favour them before we even see any data.
Lucius: Later, I posted the slides for the talk in a group chat because someone asked for them. I did not include a GIANT DISCLAIMER that these slides are NOT RIGOROUS and present an OVERSIMPLIFIED STORY to help the audience follow the thread of the explanation better. They should NOT BE PARSED without ACCOMPANYING COMMENTARY.
Lucius: This was very stupid of me.
Lucius: Kaarel and Dmitry promptly jumped in to critique the slides, and I replied to their critiques. Afterwards, we all realised that the ensuring discussion basically acts as the commentary the slides were missing, and also touched on many other ideas regarding the nature of intelligence that may be of interest to some.
Lucius: So we figured we’d post the discussion here, somewhat edited, along with the slides.
Slides
DISCLAIMER: THESE OVERSIMPLIFIED SLIDES present an OVERSIMPLIFIED STORY because they were just PROPS FOR A TALK. Their contents should be taken IN THE CONTEXT of the CRITIQUE AND DISCUSSION BELOW and probably seasoned with LARGE QUANTITIES OF SALT even then.
Slides: Deep Learning is cheap Solomonoff Induction
Discussion
Lucius: Slides Jake asked for.
Kaarel: The idea of a wheel is simple and humans can think about stuff and invent wheels but local learning (evolution) struggles to invent wheels.
Kaarel: More generally, most mathematical concepts probably could not be invented by SGD (at any scale that fits in this universe) (and mathematical concepts are useful for prediction).
Jake:
Kaarel: Yes, yes
Jake: Yes yes to u
Jake: I’m being facetious.
Jake: No macro wheels, the point stands.
Jake: Indeed more the fool evolution for inventing wheels in some places but not using them everywhere where the logical inference is small. But evolution isn’t taking steps in logical inference space.
Kaarel: Humans invent mathematical concepts largely not by doing gradient descent in concept-space, but by thinking about what concept to make. it’s maybe more like solving a constraint satisfaction problem where you can’t find a solution by local search but can find a solution by thinking.
Kaarel: (i’m not saying that concepts cannot be invented by a thing arising from SGD doing some thinking about which concepts to make — that’s plausible with the right setup, though it could require something beyond what’s currently in common use/consideration. certainly evolution is a pretty dumb pretty local thing which sorta built a thing which can make concepts)
Kaarel: A neural net struggling to learn which subset of inputs was XOR-ed despite it being a simple linear algebra problem to determine which subset was XOR-ed (and probably not too hard for a human to think of trying that to solve the same prediction problem) is a good example to have in mind here.
Lucius: I think you may be parsing these slides differently from what I tried to say with them in the talk. I don’t think humans or AGIs being able to find things evolution and neural network training can’t is evidence against what I’m trying to say. Moving locally in some hypothesis space on a second-to-second basis doesn’t mean you’re well described as moving locally in a fitness landscape over long time horizons.
Lucius: If I want to build a car, but don’t know how, this model does not say I have to think through random car designs and update until I find one that seems to work. That’s sort of what this model suggests happens a lot on a second-by-second basis, but over longer time scales, your current position in search space influences what data you’ll even try to fit next and with what sort of function. So you might move to making car designs in some platonic realm where wheels just turn without engines, and if some design works in that space, you might go back to fitting in the original space with a new starting point and new data to fit based on that platonic design.
Lucius: According to this idea, the rules for how to pick what to fit next might actually be really dumb. At least at first. You might memorise what rule worked well on a previous occasion that pattern matches this occasion, for example (where the pattern match is just a dumb function fit, like everything else).
Lucius: Eventually, your rules for what to think about next might get better. But that might be through normal updating that works just like everything else in this setup. Memorise, then try to find more general rules by fitting the memorisations with some functions.
Lucius: Through this process, you might come to acquire rules for what to think about and fit functions to that are quite advanced indeed. Rules like formal logical thinking, that have your search trajectory move in ways very unlike what we’d usually associate with local updating.
Lucius: I don’t think these very fancy rules necessarily contain some sort of core insight required for generality though. I think your mind works ok and is reasonably general even before it gets the very fancy rules. I’m not sure there’s any kind of core pattern that unites all the fancy rules either. I think they’re maybe just good rules you can use.
Lucius: So I guess the difference in (lightly held) opinion here might be this part
The idea in the slides does say that local search in ‘concept space’ is probably pretty central for doing things. There’s also rules for what kind of concept space to search in when to fit what data, but these are more a collection of heuristics learned over time, and the very basics of general intelligence maybe work ok even when these rules are still quite basic.
Kaarel: Some thoughts about intelligence in sorta-continuation of the above discussion. This isn’t a direct response, but some considerations that i think are probably relevant + true/good:
I think it could be reasonable to say either of the following about intelligence:
“there are very many interesting thinking-structures already in use, and very many more will come to be used in the future, and very many more (even comparably useful ones) will never be invented”
OR “there’s nothing mysterious, just a bunch of stupid stuff”
(I feel more like saying the former than the latter)
But I doubt it makes sense to say “it’s sorta just something like SGD + stupid stuff”. Here are some things that are imo more central than something like SGD:
Tool use, tool-finding, tool-making
Language
Logic
Investigating how to think, and then starting to think that way.
Note one is often pretty much thinking about how to think without being very aware that one is doing that. Some examples:
A mathematician coming up with a definition which ends up figuring in certain neat theorems.
More generally, seeing a need to talk about something and coining a word for it.
A philosopher trying to clarify/re-engineer a term, eg by seeing which definition would fit best into the ways one wants to speak using that term.
Noticing and resolving tensions.
Developing/clarifying/inventing/specifying the scientific method.
Trying to figure out how to improve peer review.
Also central: doing stuff which is more like trying out ways to think (and keeping the good ones) than thinking about how to think. But this process of trying will still be richly structured.
This overlaps with tool-making/finding+use, because concepts are thinking-tools. It overlaps with language because human concepts are significantly linguistic — they fit in a language together, they only let us do big things because of fitting together in a language. (Really, we can’t even properly conceive of them / ‘they aren’t themselves’ except in the context of a language or linguistic practices even more broadly. Note that I’m not claiming that language is all the necessary context for concepts to even make sense — one probably needs a lot of additional context also.)
If I had to pick one thing to say doing very much is all about, then maybe I’d say this.
It’s plausible this deserves to be split into many (sub)categories.
Compression
Facing a curriculum of problems. creating problems for yourself to learn. play.
Keeping problems in mind as you go about your business, thinking about whether something you notice could be applied.
Goal-pursuits. working on projects
Seeing/making/using analogies
Teaching/explaining
Scale
Probably this and this could motivate more things to add to this list.
One might object to some or all the above, saying these aren’t responses to the right question. But what question are we asking? Maybe: “which combinations of structures make a process which goes on to do/understand very much?”. Some issues with this question:
One could ask this with a cognitive system with some very fixed structure in mind. like, maybe you have a particular kind of ‘world model’ and particular other components, and voila, this thing goes on to understand arbitrarily much math/science by just ‘filling in some content’ into these structures. which structures make this possible?
I think this picture is basically confused, because to understand very much decently quickly, you probably have to keep reprogramming yourself on all levels forever. We could maybe imagine a mind which gets very far while ‘keeping the same structure’, but I think such a mind would just be building new structures that are not seen in the explicit structure you forced on it (e.g. imagine a system somehow forced into not using probabilities explicitly which effectively starts using probabilities anyway in some contrived way, because that’s a useful thinking-structure). If we insist on no new structures really being incorporated, then imo the answer to “which structures make fooming possible?” is just “there are no such structures — fooming is not possible this way”
One could ask about this allowing new structures to be formed. Like, literally, which processes go on to do/understand very much?
Are we happy with the answer ‘for example, a sufficiently big evolution’? I feel like we’re not happy with this answer despite it being true, so this question isn’t exactly what we’re meaning to ask. I do think it’s a very interesting question though, about which very many interesting things could be said. I think there will be many very substantively different ways to get such a process, and it’d be cool to understand much more about which processes work.
Okay, maybe we wanted to ask instead: “The formation/invention/discovery of which structure first makes such a process foom decently fast?”
I think there are plausibly many structures that could be formed in various orders here that would increase the speed of fooming. I think all of the following structures have a reasonable claim to having contributed significantly to humanity fooming: replication/evolution (made fooming much faster than physics without evolution), vision, (large gap here because i got bored with adding to this list,) learning stuff from parents, opposable thumbs, manual dexterity, persistence hunting in groups, tool use, tool-making/maintenance/improvement, using fire, language, culture, trade, writing, governance, schooling, mathematical proof, the printing press, the scientific method, democracy, the use of statistics, medicine, first-order logic, computers, programming, the internet (and really basically any technology, way of thinking, way of acting, concept, but these are some central ones)
I think the speed of fooming will be increased by many structures in the future, plausibly arbitrarily many times (until the end of time or whatever), and plausibly even including arbitrarily many new structures as profound as developing primitive language.
This kind of thing doesn’t seem like an entirely satisfying answer either.
Okay, maybe we only meant the structures which are in use when a current individual human paces around, temporarily doesn’t interact with anything external, and tries to make research progress?
I think very many interesting structures are in use here. I’ve mentioned some above.
E.g. each word has some little structures associated with it.
Maybe we’re asking for some higher structure organizing these little structures?
I think there’s a major aspect of: often, the context for each little structure is created by all the other little structures, not something external to these little structures.
Anyway, the external thing would also have a lot of rich structure.
If I had to pick something short to say about a separate higher organizing thing, I’d probably say that (1) effectively, we’re investigating how to think and then thinking in those ways (and this happens across the board, on all levels), and that (2) our concepts have close shadows in language / make a language together. And then maybe next i’d want to say a bunch more about language.
Maybe this is still too high-level? but idk, are we interested in neuronal weight update rules here?
Kaarel: Some sorta meta commentary: I think that “what’s the science algorithm?”/”how does problem-solving/thinking work?” is sometimes asked like one is asking for some sort of final formula for thinking. This is imo like asking “what’s the ultimate technology?” or “what’s the formula for mathematics?”. There’s no such thing as ‘a formula for mathematics’ — there are infinitely many cool things to be figured out about mathematical objects. I think there are also infinitely many cool things to be figured out about how to think well, and infinitely many cool structures to incorporate into our thinking.
Kaarel: I would guess that in our circles, this imo mistake is downstream of something like this mistake by yudkowsky, but idk.
Kaarel: Of course, non-ultimate variants of these questions, as in “let’s figure out more about intelligence”, are imo centrally interesting. I just don’t think these questions are the sorts of things to be ‘solved’ — they are the sorts of things to be indefinitely figured out better (it’s imo probably even appropriate to think of oneself as always only understanding a finite fragment of an infinite thing here, like in math)
Dmitry: I finally read Lucius’ slide show. I like it a lot.
Dmitry: A few comments however:
Dmitry: 1. I would say that SLT isn’t about the Bayesian prior (that’s just “information geometry” or whatever). It’s rather an extra few epicycles: giving more details on what the prior looks like geometrically.
Dmitry: It’s just not true that programs correctly sample the “Bayesian” prior. Findability/SGD is a massive part of what is actually learnable. This (not SLT/Bayesianism) is what explains modularity in particular.
Dmitry: I’m planning to make a post about this, but there are lots of ways to see this. The parity problem and noticing that harder problems show up in LLM programs (which doesn’t mean they’re not solvable, just that they’re not solvable by the most compressed/ “Solomonoff-preferred” way), it’s evident in the algorithm learned by associative groups composition, etc.
Dmitry: I think studying the Bayesian/informational prior is still very valuable and a great correction to what people talk about when they talk about priors (it’s just that SGD and learnability are even more important).
Dmitry: I think Kaarel actually said more or less the same thing. Basically: the SGD prior says that NN’s prefer the “most efficient, including degeneracy, way of solving a problem”. This is not true. In fact, as you both agree, NN’s *locally* prefer the “most efficient, including degeneracy, way of solving the problem in a way that’s a perturbation of a previous solution. This means that (with high probability) there needs to be a “learning story” connecting an instance of the algorithm that’s ultimately learned to a generic starting point. The solution predicted by the “true” Bayesian prior is not in fact learnable in any interesting real-world context.
Dmitry: One way to motivate this is, once again, that the most efficient way of learning parity isn’t in fact learnable (and this is provably the case). It’s still possible to train a NN with a more interesting/complex loss to solve the parity problem: e.g. if the NN has a concept of language and logic, in a suitable sense, it can solve parity in polynomial time. It’s just that, provably (in a certain sense), it’s not going to be finding the most optimal “Bayesian-preferred” solution, but rather routing around the un-findability of this solution and going though some more elaborate SGD path to find a less degenerate/natural solution that has the correct I/O behavior.
Dmitry: I think in practice, a lot of the time this means that the generalizing solution found will just be more modular than the optimal solution (this is the case for associative groups). But more generally, training stories can be more complicated, since you can GD to “edit” or post-process a highly modular but bad algorithm to a better but less parallel algorithm. Trying to model this is this discussion about priors that Kaarel described that we had when he was visiting.
Dmitry: Re: Kaarel’s “structure of intelligence”, I think I largely agree. But I think the difference between us is that I am somewhat strongly in the “interp is possible” camp. In physics, you get complex and interesting behavior from taking simple “interaction” behaviors at different characteristic scales and putting them together in a giant superposition. I think NN’s follow the same general framework, and it’s plausible that we’ll be “done” with interp (at least well enough to do things like guarantee non-deception from low-level behaviors) once we have suitably good notions of “interaction”, “characteristic scale”, and “putting together”.
Dmitry: I also think that there are ways in which we can get useful-for-safety interp while our understanding is still bad, so long as we get some partial completeness: e.g. (mostly) completely characterize what can happen at a particular characteristic scale.
Dmitry: This makes me particularly excited about this Sam Marks paper, since while I don’t think that “find all mechanisms of this shape” is a viable path, it would nevertheless be quite exciting and plausible to say something like “a large component of NN behavior is described by a superposition of diagrams of roughly this complexity” (this could be shown by looking at a random phrase and producing a collection of diagrams that explain it). This would tell us something nontrivial about the complexity and branchiness of LLM intelligence, without actually explaining it in a way that is reverse-engineerable (same as people do for physics).
Lucius:
I agree with this. These slides are giving a simplified 0-th order toy story to aid understanding. They’re really not complete without the accompanying talk.
Since this has now happened a second time, no seriously, these are just props from a small Apollo-internal talk that Jake asked for. They’re not an essay.
Lucius:
I feel like there’s some miscommunication here. Let me back up and try to put my words more in the context that produced them.
In the year 2020, if you had asked me to write down a practical, real general intelligence algorithm, I would have told you that I have no clue how to do that. Not the best possible algorithm. Just any algorithm at all. Part of why I was unable to do that might be lack of knowledge about what you call many interesting thinking-structures. But it felt to me like there was a much graver and deeper problem there than just sufficient lack of knowledge about a vast and rich vista of tools for thinking. I certainly did not understand a lot of the little tricks and ingredients that maybe help with generality to the point that I could formalise them as code. But I felt that there was something even more central there that I was far more fundamentally confused about: How do you make an algorithm that generalises at all, ever, even a little bit, from old data to new data?
Even before deep learning, this felt like maybe the most core question here to me. Though I wouldn’t have put it in those terms back then, I was a teenager without the math vocab to express this. It’s kind of in the name: general intelligence. When observing my own thoughts while solving problems, or thinking about how the brain learned over the long term, I felt like this machinery had to somehow, constantly and at every step, navigate a vast number of potentially correct local fits to the data and think only about the ones that generalised to new data. Learning about Solomonoff induction at least clarified how picking a good hypothesis out of many that fit the data was possible at all even in mathematical principle. But my brain was clearly purposefully picking out simple hypotheses, rather than tracking every hypothesis possible. At every moment of conscious perception, the thoughts coming into my conscious awareness seemed to me to be completely undetermined by previous thoughts. There was some seeming ability there to always continue the line of fuzzy logic such that the overall chain was likely to generalise and reach the target, when exponentially many other continuations would have failed. It felt to me like however that trick worked, it might also be the trick through which I could learn to walk or hold a spoon, since this process seemed to have the same generalisation-shaped hole in the middle.
So it used to feel to me that there was an utterly crucial core insight missing here, about how to get any generality whatsoever. I think I was not alone in this. I very much had the impression that Yudkowsky and probably MIRI more generally thought about it in a similar way, for example. Among many others.
I was so confused the first time I saw an explainer video on deep learning. I had heard that they got computers to learn now, so I thought they must have figured out how to solve the generalisation problem. I kept rewatching the video looking for the part where the generalisation trick happens, and came up empty.
Maybe this doesn’t feel like such a central question to slightly younger people who grew up more used to the observed fact that deep learning works. Or maybe you just never made the mistake I and perhaps many others made for so many years when thinking about function fits and the Solomonoff prior, so you never became deeply confused in the first place. But to me, this question of generalisation was the question. All the other stuff that might or might not go into general intelligence could be rich and vast and poorly understood, but it never felt completely mysterious and utterly indispensable to AGI to me the way the generalisation problem did.
If you never thought this was all that mysterious and that it seemed explicable with the math we already had, Bayes points to you. In that case, I could see how you’d be confused or annoyed by people like me walking around ranting about this basic fact like it’s some grant revelation and we understand AGI now. But if you put yourself in my shoes, can you see how slowly resolving this very basic confusion over the last 2-3 years would feel like progress to me?
Kaarel: Ok i’m on board with there being some progress here :)
Bridging NN SGD and Solomonoff induction (from Oct 2024)
Kaarel: Dmitry and I had a few conversations about the stuff he and Louis Jaburi (et al.) are figuring out about how NNs learn group operations in which we discussed how it would be cool to try to spell out an analogy between the kind of structure-selection seen empirically for learning group operations and Solomonoff induction, and how it might be helpful to have some explicit story like that written up for people to have a crisp example to consider when thinking about how NN learning is or isn’t like Solomonoff induction.
Lucius: I dunno about using that as the specific example, but I agree that spelling out the story as we currently see it more explicitly for everyone might be a good idea.
Lucius: I’d love to compare/contrast:
Solomonoff induction
MLP or transformer trained with Bayesian updating, starting from a uniform or Gaussian prior over parameters.
MLP or transformer trained with some local optimisation algorithm like sgd, starting from gaussian or uniform initialisation.
Lucius: We can’t actually run that empirically for most realistic problems for the first two options, of course. So either we keep those to abstract discussion, or we find a really neat example where you can do it on paper.
Kaarel: To say a bit more about the group operation Solomonoff thing: Dmitry was telling me that it’s looking plausible that one could develop a good description whose broad strokes are: “There’s a prior on circuits. These circuits are of some known kind: mapping both inputs into some ‘small’ algebraic object, multiplying there, and mapping back; each such circuit is only somewhat better than random at getting the right output group element but the signal beats the noise when sufficiently many circuits are added. the output of a trained NN on an input (g,h) is a linear combination of outputs of random circuits on (g,h)drawn from this prior. (these circuits get ‘drawn’ during NN learning.) And by this I of course don’t just mean to state a fact about the input-output behavior, but to also claim that the NN is structured this way.”
Kaarel: Given my surface understanding it seems plausible to me that with some significant work the prior could also be explicitly described here.
Kaarel: I guess the thought is that the trained NN looks a bit like a subroutine of a Solomonoff inductor here: “The NN has a bunch of small computations it knows how to execute. To predict an output, it executes all of them and aggregates their outputs. This is like the part of Solomonoff induction that happens after algorithms which have previously made a wrong prediction are eliminated: you run all remaining algorithms and aggregate their predictions for the next bit.”
Kaarel: And NN training is when the small patterns that are to be used get selected. That’s like the subroutine of a Solomonoff inductor that selects which algorithms get their opinions about the next bit taken into account. So we have some analogy between [NN training + inference] and solomonoff induction here.
Kaarel: Of course, the above story is naive in that really there are interaction terms between the circuits — a more sophisticated picture would talk about the ecology of circuits. concretely, e.g. if one looks at the fraction of circuits which make an error at input (g,h), calculating the variance of this quantity as one varies the input, then it’s plausible this variance will be smaller than what one would predict if the circuits were drawn independently, because ‘the circuits are selected to cover each other’s weaknesses’. But the hope is that this kind of thing can be thought as a higher-order correction on top of the above first-order story.
Kaarel: Anyway, I like the themes of
Training producing many algorithms which are literally added together (instead of producing something that’d feel more like a single unified algorithm) — this feels related to how it would be silly to replace Solomonoff induction’s giving weight to a bunch of algorithms by a variant where one only uses the single simplest Turing machine that fits the data so far to predict the next bit.
The algorithms being from some impoverished/weird repertoire, individually making many mistakes, being individually ill-suited for the problem at hand, but still each having a bit of signal and so law-of-large-numbers-ing up to something sensible.
Acknowledgements
Lucius: Thanks to Alexander Odenziel, David Quarel, Matthias Dellago and probably a bunch of people my brain failed to track for discussions and insight that helped put the ideas in these slides together. I also don’t claim any originality here. A lot of this is arguably latent in Solomonoff’s original work already, and it would not surprise me much if many people in algorithmic information theory or elsewhere put something like this story together at some point over the years. It’s also not the first time Singular Learning Theory has been linked to algorithmic information theory in some way. As pointed out by Daniel Murfet here, Clift-Murfet-Wallbridge used a version of the padding argument to show that the learning coefficient is (in a sense) a lower bound on the Kolmogorov complexity in the setting of noisy Turing machines, and a follow-up result of that was treated in Thomas Waring’s master thesis.
Kaarel: I should also thank Sam Eisenstat, Tsvi Benson-Tilsen, Jake Mendel, Simon Skade, Jessica Taylor, for discussions + many of the ideas in my messages.
If you’re used to seeing the Solomonoff prior formulated as summing over all programs, with each program exponentially weighted by its length, the way it’s written down here: That’s for prefix Turing machines. There’s an equivalent definition for plain Turing machines where you just take a uniform distribution over all programs of length l, then take the limit l→∞.