Solomonoff Induction (SI) focuses on short code. What’s short in English is not necessarily short in code, and vice versa. Most of our intuition in favor of short, simple explanations is from our experience with English, not code. Is there literature arguing that code and English brevity usually or always correspond to each other? If not, then most of our reasons for accepting Occam’s Razor wouldn’t apply to SI.
Another way to think of the issue may be that coding is a low level or reductionist way to deal with an issue, while English is a high level approach that uses high level tools like explanations. Ideas can be represented in many ways, including at different levels of abstraction. It’s unclear that length or simplicity is consistent for the same idea across different levels of abstraction. That is, if you have two ideas X and Y, and X is simpler and shorter when you compare both ideas at a one level of abstraction, it may be unsafe to assume that X will also be simpler and shorter than Y when you compare them at a different level of abstraction. Is there any literature which addresses this?
The relevant argument is equivalence of SI on different universal Turing machines, up to a constant. Briefly: if we have a short program on machine M1 (e.g. python), then in the worst case we can write an equivalent program on M2 (e.g. LISP) by writing an M1-simulator and then using the M1-program (e.g. writing a python interpreter in LISP and then using the python program). The key thing to notice here is that the M1-simulator may be long, but its length is completely independent what we’re predicting—thus, the M2-Kolmogorov-complexity of a string is at most the M1-Kolmogorov-complexity plus a constant (where the constant is the length of the M1-simulator program).
Applied to English: we could simulate an English-speaking human. This would be a lot more complicated than a python interpreter, but the program length would still be constant with respect to the prediction task. Given the English sentence, the simulated human should then be able to predict anything a physical human could predict given the same English sentence. Thus, if something has a short English description, then there exists a short (up to a constant) code description which contains all the same information (i.e. can be used to predict all the same things).
Two gotchas to emphasize here:
The constant is big—it includes everything an English-speaking human knows, from what-trees-look-like to how-to-drive-a-car. All the hidden complexity of individual words is in that constant (or at least the hidden complexity that a human knows; things a human doesn’t know wouldn’t be in there).
The English sentence is a “program” (or part of a program), not data to be predicted; whatever we’re predicting is separate from the English sentence. (This is implicit in the OP, but somebody will likely be confused by it.)
I agree with this, but I don’t think it answers the question. (i.e. it’s not a relevant argument^([1]))
There’s a large edge case where the overhead constant is ~greater than the program. in those cases it’s not the case that simplicity transitions across layers of abstraction.
That edge case means this doesn’t follow:
[1]: Edit: it could be relevant but not the whole story; but in that case it’s missing a sizable chunk.
The solution to the “large overhead” problem is to amortize the cost of the human simulation over a large number of English sentences and predictions. We only need to specify the simulation once, and then we can use it for any number of prediction problems in conjunction with any number of sentences. A short English sentence then adds only a small amount of marginal complexity to the program—i.e. adding one more sentence (and corresponding predictions) only adds a short string to the program.
That seems a fair approach in general, like how can we use the program efficiently/profitably, but I don’t think it answers the question in OP. I think it actually actually implies the opposite effect: as you go through more layers of abstraction you get more and more complex (i.e. simplicity doesn’t hold across layers of abstraction). That’s why the strategy you mention needs to be over ever larger and larger problem spaces to make sense.
So this would still mean most of our reasoning about Occam’s Razor wouldn’t apply to SI.
I’m not sure we (humanity) know enough to claim only a short string needs to be added. I think GPT-3 hints at a counter-example b/c GTP has been growing geometrically.
Moreover, I don’t think we have any programs or ideas for programs that are anywhere near sophisticated enough to answer meaningful Qs—unless they just regurgitate an answer. So we don’t have a good reason to claim to know what we’ll need to add to extend your solution to handle more and more cases (especially increasingly technical/sophisticated cases).
Intuitively I think there is (physically) a way to do something like what you describe efficiently because humans are an example of this—we have no known limit for understanding new ideas. However, it’s not okay to use this as a hypothetical SI program b/c such a program does other stuff we don’t know how to do with SI programs (like taking into account itself, other actors, and the universe broadly).
If the hypothetical program does stuff we don’t understand and we also don’t understand its data encoding methods, then I don’t think we can make claims about how much data we’d need to add.
I think it’s reasonable there would be no upper limit on the amount of data we’d need to add to such a program as we input increasingly sophisticated questions. I also think it’s intuitive there’s no upper limit on this data requirement (for both people and the hypothetical programs you mention).
One somewhat silly reason: for any simple English hypothesis, we can convert it to code by running a simulation of a human and giving them the hypothesis as input, then asking them to predict what will happen next. Therefore the English and code-complexity can differ by at most a constant.
This gives a very loose bound, since it will probably take a lot of bits to specify a human mind. In practice I think the two complexities will usually not differ by too much, because coding languages were designed to be understandable by humans and have syntax similar to human languages. There are some difficult cases like descriptions of visual objects, but even here neural networks should be able to bridge the gap to an extent(in the limit this just becomes the ‘encode a human brain’ strategy)
Regarding ‘levels of abstraction’, I’m not sure if this is a big obstacle, as most programming languages have built-in mechanisms for changing levels of abstraction. e.g. functional programming languages allow you to treat functions as objects.
Some things are quick for people to do and some things are hard. Some ideas have had multiple people continuously arguing for centuries. I think this either means you can’t apply a simulation of a person like this, or some inputs have unbounded overhead.
You should include all levels of abstraction in your reasoning, like raw bytecode. It’s both low level and can be written by humans. It’s not necessarily fun but it’s possible. What about things people design at a transistor level?
I use Haskell and have no idea what you’re talking about.
I think abstraction in programming is different to what you mean; e.g. dictionary might a simple data structure to use, but a list of tuples is harder to use even though you can implement a dictionary using a list of tuples. the abstraction layer (the implementation) is what makes complex operations on list of tuples into simple operations on dictionaries.
Solomonoff induction is fine with inputs taking unboundedly long to run. There might be cases where the human doesn’t converge to a stable answer even after an indefinite amount of time. But if a “simple” hypothesis can have people debating indefinitely about what it actually predicts, I’m okay with saying that it’s not actually simple(or that it’s too vague to count as a hypothesis), so it’s okay if SI doesn’t return an answer in those cases.
Why do you need to include those things? Solomonoff induction can use any Turing-complete programming language for its definition of simplicity, there’s nothing special about low-level languages.
I mean you can pass functions as arguments to other functions and perform operations on them.
Regarding dictionary/list-of-tuples, the point is that you only have to write the abstraction layer *once*. So if you had one programming language with dictionaries built-in and other without, the one with dictionaries gets at most a constant advantage in code-length. In general two different universal programming languages will have at most a constant difference, as johnswentworth mentioned. This means that SI is relatively insensitive to the choice of programming language: as you see more data, the predictions of 2 versions of Solomonoff induction with different programming languages will converge.
Yeah okay, I think that’s fair.
My issue generally (which is in my reply to johnswentworth) is the overhead is non-negligible if you’re going to invoke a human. In that case we can’t conclude that simplicity would carry over from the english representation to the code representation. So this argument doesn’t answer the question.
You do say it’s a loose bound, but I don’t think it’s useful. One big reason is that the overhead would dwarf any program we’d ever run, and pretty much every program would look identical b/c of the overhead. For simplicity to carry over we need relatively small overhead (even like the entire python runtime is only like 20mb extra via py2exe, much smaller than a mind and definitely not simple).
Maybe it’s worth mentioning the question in OP: I read it as: “why would stuff the simplicity an idea had in one form (code) necessarily correspond to simplicity when it is in another form (english)? or more generally: why would the complexity of an idea stay roughly the same when the idea is expressed through different abstraction layers?” After that there’s implications for Occam’s Razor. Particularly it’s relevant b/c occam’s razor would give different answers when comparing ideas at different levels of abstraction, and if that’s the case we can’t be sure that ideas which are simple in english will be simple in code and we don’t have a reason for Occam’s Razor applying to SI.
Does that line up with what you think OP is about? If not we might be talking cross-purposes.
Ahh okay; first class functions.
Re “perform operations on [functions]”: you can make new functions and partially or fully apply functions, but that’s about it. (that does mean you can partially apply functions and pass them on, though, which is super useful)
I agree with you that the theoretical upper bound on the minimum overhead is the size of a compiler/interpreter.
I think we might disagree on this, though: the compiler/interpreter includes data such as initial conditions (e.g. binary extensions, dynamic libraries, etc). I think this is an issue b/c there’s no upper bound on that. If you invoke a whole person then it’s an issue b/c for that person to solve more and more complex problems (or a wider and wider array) those initial conditions are going to grow correspondingly. Our estimates for the data requirements to store a mind are like 1020 bits. I’d expect the minimum required data to drop as problems got “simpler”, but my intuition is that pattern is not the same pattern as what Occam’s Razor gives us (e.g. minds taking less data can still think about what Thor would do).
I think that the argument about emulating one Turing machine with another is the best you’re going to get in full generality. You’re right that we have no guarantee that the explanation that looks simplest to a human will also look the simplest to a newly-initialized SI, because the ‘constant factor’ needed to specify that human could be very large.
I do think it’s meaningful that there is at most a constant difference between different versions of Solomonoff induction(including “human-SI”). This is because of what happens as the two versions update on incoming data: they will necessarily converge in their predictions, differing at most on a constant number of predictions.
So while SI and humans might have very different notions of simplicity at first, they will eventually come to have the same notion, after they see enough data from the world. If an emulation of a human takes X bits to specify, it means a human can beat SI at binary predictions at most X times(roughly) on a given task before SI wises up. For domains with lots of data, such as sensory prediction, this means you should expect SI to converge to giving answers as good as humans relatively quickly, even if the overhead is quite large*.
The quantity that matters is how many bits it takes to specify the mind, not store it(storage is free for SI just like computation time). For the human brain this shouldn’t be too much more than the length of the human genome, about 3.3 GB. Of course, getting your human brain to understand English and have common sense could take a lot more than that.
*Although, those relatively few times when the predictions differ could cause problems. This is an ongoing area of research.
In that case I especially don’t think that argument answers the question in OP.
I’ve left some details in another reply about why I think the constant overhead argument is flawed.
I don’t think this is true. I do agree some conclusions would be converged on by both systems (SI and humans), but I don’t think simplicity needs to be one of them.
Uhh, I don’t follow this. Could you explain or link to an explanation please?
I don’t think that applies here. I think that data is part of the program.
You would have to raise the program like a human child in that case^1. Can you really make the case you’re predicting something or creating new knowledge via SI if you have to spend (the equiv. of) 20 human years to get it to a useful state?
How would you ask multiple questions? Practically, you’d save the state and load that state in a new SI machine (or whatever). This means the data is part of the program.
Moreover, if you did have to raise the program like any other newborn, you have to use some non-SI process to create all the knowledge in that system (because people don’t use SI, or if they do use SI, they have other system(s) too).
1: at least in terms of knowledge; though if you used the complete human genome arguably you’d need to simulate a mother and other ppl too, but they have to be good simulations after the first few years, which is a regressive problem. So it’s probably easier to instantiate it in a body and raise it like a person b/c human people are already suitable. You also need to worry about it becoming mistaken (intuitively one disagrees with most people on most things we’d use an SI program for).
Intuitive explanation: Say it takes X bits to specify a human, and that the human knows how to correctly predict whatever sequence we’re applying SI to. SI has to find the human among the other 2^X programs of length X. Say SI is trying to predict the next bit. There will be some fraction of those 2^X programs that predict it will be 0, and some fraction predicting 1. There fractions define SI’s probabilities for what the next bit will be. Imagine the next bit will be 0. Then SI is predicting badly if greater than half of those programs predict a 1. But then, all those programs will be eliminated in the update phase. Clearly, this can happen at most X times before most of the weight of SI is on the human hypothesis(or a hypothesis that’s just as good at predicting the sequence in question)
The above is a sketch, not quite how SI really works. Rigorous bounds can be found here, in particular the bottom of page 979(“we observe that Theorem 2 implies the number of errors of the universal predictor is finite if the number of errors of the informed prior is finite...”). In the case where the number of errors is not finite, the universal and informed prior still have the same asymptotic rate of growth of error (error of universal prior is in big-O class of error of informed prior)
When I say the ‘sense of simplicity of SI’, I use ‘simple program’ to mean the programs that SI gives the highest weight to in its predictions(these will by definition be the shortest programs that haven’t been ruled out by data). The above results imply that, if humans use their own sense of simplicity to predict things, and their predictions do well at a given task, SI will be able to learn their sense of simplicity after a bounded number of errors.
I think you can input multiple questions by just feeding a sequence of question/answer pairs. Actually getting SI to act like a question-answering oracle is going to involve various implementation details. The above arguments are just meant to establish that SI won’t do much worse than humans at sequence prediction(of any type) -- so, to the extent that we use simplicity to attempt to predict things, SI will “learn” that sense after at most a finite number of mistakes(in particular, it won’t do any *worse* than ‘human-SI’, hypotheses ranked by the shortness of their English description, then fed to a human predictor)
Solomonoff Induction has a free choice of variable, namely the use of the Universal Turing Machine (UTM) on which to run it, which is equivalent to the choice of language (e.g. whether to choose Python or C or some formalization of the english language). How to choose that variable is indeed kind of an open question, and not super obvious, but so are many other things about how to actually use SI in practice (obviously the same is true for less formal versions of Occam’s Razor).
However, you can still prove a substantial number of things about SI without knowing the UTM with which it is run, because as johnswentworth says, there is at most a fixed constant of difference in the length of the resulting programs and in many inference problems it makes sense to think about the convergent behavior over long periods of time/many pieces of evidence, for which all SI machines will converge to the same beliefs.
I do concretely think that this choice of open variable is a really big one, and not one that we should forget when reasoning about SI.
Brevity of code and english can correspond via abstraction.
I don’t know why brevity in low and high abstraction programs/explanations/ideas would correspond (I suspect they wouldn’t). If brevity in low/high abstraction stuff corresponded; isn’t that like contradictory? If a simple explanation in high abstraction is also simple in low abstraction then abstraction feels broken; typically ideas only become simple after abstraction. Put another way: the reason to use abstraction is to make ideas/thing that are highly complex into things that are less complex.
I think Occam’s Razor makes sense only if you take into account abstractions (note: O.R. itself is still a rule of thumb regardless). Occam’s Razor doesn’t make sense if you think about all the extra stuff an explanation invokes—partially because that body of knowledge grows as we learn more, and good ideas become more consistent with the population of other ideas over time.
When people think of short code they think of doing complex stuff with a few lines of code. e.g.
cat asdf.log | cut -d ',' -f 3 | sort | uniq
. When people think of (good) short ideas they think of ideas which are made of a few well-established concepts that are widely accessible and easy to talk about. e.g. we have seasons because energy from sunlight fluctuates ~sinusoidally through our annual orbit.One of the ways SI can use abstraction is via the abstraction being encoded in both the program, program inputs, and the observation data.
(I think) SI uses an arbitrary alphabet of instructions (for both programs and data), so you can design particular abstractions into your SI instruction/data language. Of course the program would be a bit useless for any other problem than the one you designed it for, in this case.
I don’t know of any.
I think some of the reasoning makes sense in a pointless sort of way. e.g. the hypothesis
1100
corresponds to the program “output 1 and stop”. The input data is from an experiment, and the experiment was “does the observation match our theory?”, and the result was1
. The program1100
gets fed into SI pretty early, and it matches the predicted output. The reason this works is that SI found a program which has info about ‘the observation matching the theory’ already encoded, and we fed in observation data with that encoding. Similarly, the question “does the observation match our theory?” is short and elegant like the program. The whole thing works out because all the real work is done elsewhere (in the abstraction layer).On the literature that addresses your question: here is a classic LW post on this sort of question.
You point out that length of a description in English and length in code don’t necessarily correlate. I think for English sentences that are actually constraining expectations, there is a fairly good correlation between length in English and length in code.
There’s the issue that the high-level concepts we use in English can be short, but if we were writing a program from scratch using those concepts, the expansion of the concepts would be large. When I appeal to the concept of a buffer overflow when explaining how someone knows secrets from my email, the invocatory phrase “buffer overflow” is short, but the expansion out in terms of computers and transistors and semiconductors and solid state physics is rather long.
But I’m in the game of trying to explain all of my observations. I get to have a dictionary of concepts that I pay the cost for, and then reuse the words and phrases in the dictionary in all my explanations nice and cheaply. Similarly, the computer program that I use to explain the world can have definitions or a library of code, as long as I pay the cost for those definitions once.
So, I’m already paying the cost of the expansion of “buffer overflow” in my attempt to come up with simple explanations for the world. When new data has to be explained, I can happily consider explanations using concepts I’ve already paid for as rather simple.
The linked post doesn’t seem to answer it, e.g. in the 4th paragraph EY says:
I also don’t think it fully addresses the question—or even partially in a useful way, e.g. EY says:
But this bakes in knowledge about measuring stuff. Maxwell’s equations are—in part—easier to code because we have a way to describe measurements that’s easy to compute. That representation is via an abstraction layer! It uses labels for concepts too.