I suspect the answer is no, at least not the kind of formal languages that have been suggested so far. The problem is this: as soon as you define a formal language, I can say “the lexicographically first object which can’t be described in less than a million bits in your language”. Given the uniqueness of this object, why should it be a priori as unlikely as a random million-bit string?
Note that in general, this sort of specification is uncomputable. To find the lexicographically first object which can’t be described in less than a million bits, we would have to make a list of all objects which can be described in less than a million bits (and return the first thing that’s not on the list). But we don’t know if our UTM will halt for any given string with less than a million bits, so we can never know if our list is complete.
Yes, and that’s sort of intentional. I was trying to come up with a mathematical model of an agent that can deal with uncomputable physics. The physics of our universe seems likely to be computable, but there is no a priori reason to assume that it must be. We may eventually discover a law of physics that’s not computable, or find out that we are in a simulation running inside a larger universe that has uncomputable physics. Agents using UTM-based priors can’t deal with these scenarios.
So I tried to find a “better”, i.e., more expressive, language for describing objects, but then realized that any fixed formal language has a similar problem. Here’s my current idea for solving this: make the language extensible instead of fixed. That is, define a base language, and a procedure for extending the language. Then, when the agent encounters some object that can’t be described concisely using his current language, he recursively extends it until a short description is possible. What the extension procedure should be is still unclear.
I agree that there are very interesting questions here. We have quite natural ways of describing uncomputable functions very far up the arithmetical hierarchy, and it seems that they can be described in some kind of recursive language even if the things they describe are not recursive (using recursive in the recursion theory sense both times). Turing tried something like this in Systems of Logic Based on Ordinals (Turing, 1939), but that was with formal logic and systems where you repeatedly add the Godel sentence of a system into the system as an axiom, repeating this into the transfinite. A similar thing could be done describing levels of computability transfinitely far up the arithmetical hierarchy using recursively represented ordinals to index them. However, then people like you and I will want to use certain well defined but non-recursive ordinals to do the indexing, and it seems to degenerate in the standard kind of way, just a lot further up the hierarchy than before.
And then there are objects that are completely outside the arithmetical hierarchy, but we probably shouldn’t assign zero priors to either. Things like large cardinals, perhaps.
Another mystery is, why did evolution create minds capable of thinking about these issues, given that agents equipped with a fixed UTM-based prior would have done perfectly fine in our place, at least up to now?
We aren’t really Bayesian reasoning machines at all, and it isn’t really accurate to speak of us having a prior. We choose a prior in order to use Bayesian reasoning to analyze a situation, and we seek to bend our natural reasoning to a Bayesian template in order to improve its accuracy, but we cannot wholly succeed in doing so. So the problem you raise should worry someone building AGI, but it’s not realistic to imagine a human agent becoming so Bayesian that they swallow the Solomonoff prior whole and are literally unable to contemplate super-Turing Universes.
I don’t think it’s unreasonable, therefore, to adopt the Solomonoff prior as a useful model to aid reasoning and discussion, and rely on our human ability to make and adopt a new, super-Turing model if some more general prior would have favoured it.
We may eventually discover a law of physics that’s not computable, or find out that we are in a simulation running inside a larger universe that has uncomputable physics. Agents using UTM-based priors can’t deal with these scenarios.
Then how can you deal with these scenarios? Did the idiot God make you better equipped for this task, Oh uncomputable ape-brain?
The idea of agents using UTM-based priors is a human invention, and therefore subject to human error. I’m not claiming to have an uncomputable brain, just that I’ve found such an error.
For a specific example of how human beings might deal with such scenarios, compared to agents using UTM-based priors, see “is induction unformalizable?”.
The model of environment values observations and behaviors, not statements about “uncomputability” and such. No observation should be left out, declared impossible. If you, as a human, decide to trust in something you label “halting oracle”, that’s your decision, and this is a decision you’d want any trusted AI to carry through as well.
I suspect that the roots of this confusion are something not unlike mind projection fallacy, with magical properties attributed to models, but I’m not competent to discuss domain-specific aspects of this question.
Did you understand why the problem is not serious? Your comment was nit-picking the example. Other similar examples make exactly the same point—but are not vulnerable to such nit-picking.
It doesn’t sound too serious—if all the candidate languages have the same problem.
It is hard to deny that some languages are better than other ones, and so therefore there is a set of “best” ones. The intent of my first question was more along the lines of: is one formulation of Occam’s razor optimal for all the different situations in which Occam’s razor is used—or should we be using different descriptive languages under different circumstances.
Is speed of calculation important? What about suitability for humans? I guess you want one where complexities are as small as possible.
Given 2 languages, L1 & L2, and their complexity measures, K1 & K2.
If K1(L2) < K2(L1) then I take that as a sign that L1 is better for use in the context of Ockhams razor. It is also a sign that L1 is more complex than L2, but that effect can be removed by doing lots of comparisons like that, so the unnecessarily complex languages loose against those that are actually good. Or one can use 3 languages in a comparison: K1(L3) < K2(L3) is a sign that L1 is better.
The idea here is that a good language should more easily represent systems, such as other languages.
What sort of languages are best in this measure? Not the simplest ones, but rather the more expressive ones. Programs for Turing machines use some complexity for tape traversing and storage systems, so I guess they are not so good. Cellular automata have the same problems. Combinatory Logic is very simple, with complex operators even for simple stuff, but its system for storage and access can be simpler, since it is a tree. I guess Lambda Calculus is one of the better languages, as it is both simple and very expressive, because of its more direct tree structure.
I guess the best language would use an even more expressive structure, such as graphs, perhaps bi-graphs, since those are maximally expressive. The Scheme language can be used as a general graph language thanks to the set-car! and set-cdr! procedures.
Then I would go for Turing machines, Lambda calculus, or similar. These languages are very simple, and can easily handle input and output.
Even simpler languages, like cellular automaton No.110 or Combinatory Logic might be better, but those are quite difficult to get to handle input and output correctly.
The reason simple languages, or universal machines, should be better, is that the upper bound for error in estimating probabilities is 2 to the power of the complexity of the program simulating one language in another, according to algorithmic information theory.
So, the simpler the language is, the more correct relative probabilities it will give.
The answer I gave before this one, was for the question: Maximize the likelihood that the language will produce the output corresponding to the observed events.
You however want to compare 2 hypotheses, getting 2 probabilities, and compare them. In this case the absolute size of the probabilities do not matter, just their relative size, so you can just go for the simplest language that is practical to use.
Using simple languages is the conventional approach. However, simple languages typically result in more complex programs. The game of life is very simple—yet try writing a program in it. If you are trying to minimise the size of an emulator of other languages, then highly simple languages don’t seem to fit the bill.
Why would one want a decent formulation of Occam’s razor? To help solve the problem of the priors.
I agree. We seem to have the same goal, so my first advice stands, not my second.
I am currently trying to develop a language that is both simple and expressive, and making some progress. The overall design is finished, and I am now down to what instructions it should have. It is a general bi-graph, but with a sequential program structure, and no separation of program and data.
It is somewhat different from what you want, as I also need something that have measurable use of time and memory, and is provable able to run fast.
Could I have a link, or maybe some more information about this language? It’s something I’m very interested in (merging expressiveness and guarantees about resource usage).
I’m sure that’s not possible to do in general (in an algorithmic fashion), since it would solve certain mathematical problems in a flash that aren’t supposed to be solved in polynomial time or even efficiently (like particular halting problems).
I think Tim’s intended meaning was more along the lines of “a language L such that when A is a more likely hypothesis than B, L’s MDL for A is shorter than its MDL for B more often than for any other language”.
Is there a best language in which to express complexity for use in the context of Occam’s razor.
If there is a best language in which to express complexity for use in the context of Occam’s razor, what is that language?
I suspect the answer is no, at least not the kind of formal languages that have been suggested so far. The problem is this: as soon as you define a formal language, I can say “the lexicographically first object which can’t be described in less than a million bits in your language”. Given the uniqueness of this object, why should it be a priori as unlikely as a random million-bit string?
Note that in general, this sort of specification is uncomputable. To find the lexicographically first object which can’t be described in less than a million bits, we would have to make a list of all objects which can be described in less than a million bits (and return the first thing that’s not on the list). But we don’t know if our UTM will halt for any given string with less than a million bits, so we can never know if our list is complete.
Yes, and that’s sort of intentional. I was trying to come up with a mathematical model of an agent that can deal with uncomputable physics. The physics of our universe seems likely to be computable, but there is no a priori reason to assume that it must be. We may eventually discover a law of physics that’s not computable, or find out that we are in a simulation running inside a larger universe that has uncomputable physics. Agents using UTM-based priors can’t deal with these scenarios.
So I tried to find a “better”, i.e., more expressive, language for describing objects, but then realized that any fixed formal language has a similar problem. Here’s my current idea for solving this: make the language extensible instead of fixed. That is, define a base language, and a procedure for extending the language. Then, when the agent encounters some object that can’t be described concisely using his current language, he recursively extends it until a short description is possible. What the extension procedure should be is still unclear.
I agree that there are very interesting questions here. We have quite natural ways of describing uncomputable functions very far up the arithmetical hierarchy, and it seems that they can be described in some kind of recursive language even if the things they describe are not recursive (using recursive in the recursion theory sense both times). Turing tried something like this in Systems of Logic Based on Ordinals (Turing, 1939), but that was with formal logic and systems where you repeatedly add the Godel sentence of a system into the system as an axiom, repeating this into the transfinite. A similar thing could be done describing levels of computability transfinitely far up the arithmetical hierarchy using recursively represented ordinals to index them. However, then people like you and I will want to use certain well defined but non-recursive ordinals to do the indexing, and it seems to degenerate in the standard kind of way, just a lot further up the hierarchy than before.
And then there are objects that are completely outside the arithmetical hierarchy, but we probably shouldn’t assign zero priors to either. Things like large cardinals, perhaps.
Another mystery is, why did evolution create minds capable of thinking about these issues, given that agents equipped with a fixed UTM-based prior would have done perfectly fine in our place, at least up to now?
We aren’t really Bayesian reasoning machines at all, and it isn’t really accurate to speak of us having a prior. We choose a prior in order to use Bayesian reasoning to analyze a situation, and we seek to bend our natural reasoning to a Bayesian template in order to improve its accuracy, but we cannot wholly succeed in doing so. So the problem you raise should worry someone building AGI, but it’s not realistic to imagine a human agent becoming so Bayesian that they swallow the Solomonoff prior whole and are literally unable to contemplate super-Turing Universes.
I don’t think it’s unreasonable, therefore, to adopt the Solomonoff prior as a useful model to aid reasoning and discussion, and rely on our human ability to make and adopt a new, super-Turing model if some more general prior would have favoured it.
Then how can you deal with these scenarios? Did the idiot God make you better equipped for this task, Oh uncomputable ape-brain?
The idea of agents using UTM-based priors is a human invention, and therefore subject to human error. I’m not claiming to have an uncomputable brain, just that I’ve found such an error.
For a specific example of how human beings might deal with such scenarios, compared to agents using UTM-based priors, see “is induction unformalizable?”.
The model of environment values observations and behaviors, not statements about “uncomputability” and such. No observation should be left out, declared impossible. If you, as a human, decide to trust in something you label “halting oracle”, that’s your decision, and this is a decision you’d want any trusted AI to carry through as well.
I suspect that the roots of this confusion are something not unlike mind projection fallacy, with magical properties attributed to models, but I’m not competent to discuss domain-specific aspects of this question.
Not a serious problem—just consider objects which can’t be described in less than a million bits after a million timesteps instead.
If a problem is so serious that you have to change your prior, then I’d call that a serious problem.
Did you understand why the problem is not serious? Your comment was nit-picking the example. Other similar examples make exactly the same point—but are not vulnerable to such nit-picking.
It doesn’t sound too serious—if all the candidate languages have the same problem.
It is hard to deny that some languages are better than other ones, and so therefore there is a set of “best” ones. The intent of my first question was more along the lines of: is one formulation of Occam’s razor optimal for all the different situations in which Occam’s razor is used—or should we be using different descriptive languages under different circumstances.
That depends on what you mean by “best”.
Is speed of calculation important? What about suitability for humans? I guess you want one where complexities are as small as possible.
Given 2 languages, L1 & L2, and their complexity measures, K1 & K2.
If K1(L2) < K2(L1) then I take that as a sign that L1 is better for use in the context of Ockhams razor. It is also a sign that L1 is more complex than L2, but that effect can be removed by doing lots of comparisons like that, so the unnecessarily complex languages loose against those that are actually good. Or one can use 3 languages in a comparison: K1(L3) < K2(L3) is a sign that L1 is better.
The idea here is that a good language should more easily represent systems, such as other languages.
What sort of languages are best in this measure? Not the simplest ones, but rather the more expressive ones. Programs for Turing machines use some complexity for tape traversing and storage systems, so I guess they are not so good. Cellular automata have the same problems. Combinatory Logic is very simple, with complex operators even for simple stuff, but its system for storage and access can be simpler, since it is a tree. I guess Lambda Calculus is one of the better languages, as it is both simple and very expressive, because of its more direct tree structure.
I guess the best language would use an even more expressive structure, such as graphs, perhaps bi-graphs, since those are maximally expressive. The Scheme language can be used as a general graph language thanks to the set-car! and set-cdr! procedures.
“Best”: most accurate—i.e. when Occam’s razor says A is a more likely hypothesis than B, then that is actually true.
Then I would go for Turing machines, Lambda calculus, or similar. These languages are very simple, and can easily handle input and output.
Even simpler languages, like cellular automaton No.110 or Combinatory Logic might be better, but those are quite difficult to get to handle input and output correctly.
The reason simple languages, or universal machines, should be better, is that the upper bound for error in estimating probabilities is 2 to the power of the complexity of the program simulating one language in another, according to algorithmic information theory.
So, the simpler the language is, the more correct relative probabilities it will give.
The answer I gave before this one, was for the question: Maximize the likelihood that the language will produce the output corresponding to the observed events.
You however want to compare 2 hypotheses, getting 2 probabilities, and compare them. In this case the absolute size of the probabilities do not matter, just their relative size, so you can just go for the simplest language that is practical to use.
What do you want to use this for?
Using simple languages is the conventional approach. However, simple languages typically result in more complex programs. The game of life is very simple—yet try writing a program in it. If you are trying to minimise the size of an emulator of other languages, then highly simple languages don’t seem to fit the bill.
Why would one want a decent formulation of Occam’s razor? To help solve the problem of the priors.
I agree. We seem to have the same goal, so my first advice stands, not my second.
I am currently trying to develop a language that is both simple and expressive, and making some progress. The overall design is finished, and I am now down to what instructions it should have. It is a general bi-graph, but with a sequential program structure, and no separation of program and data.
It is somewhat different from what you want, as I also need something that have measurable use of time and memory, and is provable able to run fast.
Could I have a link, or maybe some more information about this language? It’s something I’m very interested in (merging expressiveness and guarantees about resource usage).
I’m sure that’s not possible to do in general (in an algorithmic fashion), since it would solve certain mathematical problems in a flash that aren’t supposed to be solved in polynomial time or even efficiently (like particular halting problems).
You’ll have to hope for something less.
I think Tim’s intended meaning was more along the lines of “a language L such that when A is a more likely hypothesis than B, L’s MDL for A is shorter than its MDL for B more often than for any other language”.