I read it as: “why would stuff the simplicity an idea had in one form (code) necessarily correspond to simplicity when it is in another form (english)? or more generally: why would the complexity of an idea stay roughly the same when the idea is expressed through different abstraction layers?”
I think that the argument about emulating one Turing machine with another is the best you’re going to get in full generality. You’re right that we have no guarantee that the explanation that looks simplest to a human will also look the simplest to a newly-initialized SI, because the ‘constant factor’ needed to specify that human could be very large.
I do think it’s meaningful that there is at most a constant difference between different versions of Solomonoff induction(including “human-SI”). This is because of what happens as the two versions update on incoming data: they will necessarily converge in their predictions, differing at most on a constant number of predictions.
So while SI and humans might have very different notions of simplicity at first, they will eventually come to have the same notion, after they see enough data from the world. If an emulation of a human takes X bits to specify, it means a human can beat SI at binary predictions at most X times(roughly) on a given task before SI wises up. For domains with lots of data, such as sensory prediction, this means you should expect SI to converge to giving answers as good as humans relatively quickly, even if the overhead is quite large*.
Our estimates for the data requirements to store a mind are like 10^20 bits
The quantity that matters is how many bits it takes to specify the mind, not store it(storage is free for SI just like computation time). For the human brain this shouldn’t be too much more than the length of the human genome, about 3.3 GB. Of course, getting your human brain to understand English and have common sense could take a lot more than that.
*Although, those relatively few times when the predictions differ could cause problems. This is an ongoing area of research.
I think that the argument about emulating one Turing machine with another is the best you’re going to get in full generality.
In that case I especially don’t think that argument answers the question in OP.
I’ve left some details in another reply about why I think the constant overhead argument is flawed.
So while SI and humans might have very different notions of simplicity at first, they will eventually come to have the same notion, after they see enough data from the world.
I don’t think this is true. I do agree some conclusions would be converged on by both systems (SI and humans), but I don’t think simplicity needs to be one of them.
If an emulation of a human takes X bits to specify, it means a human can beat SI at binary predictions at most X times(roughly) on a given task before SI wises up.
Uhh, I don’t follow this. Could you explain or link to an explanation please?
The quantity that matters is how many bits it takes to specify the mind, not store it(storage is free for SI just like computation time).
I don’t think that applies here. I think that data is part of the program.
For the human brain this shouldn’t be too much more than the length of the human genome, about 3.3 GB.
You would have to raise the program like a human child in that case^1. Can you really make the case you’re predicting something or creating new knowledge via SI if you have to spend (the equiv. of) 20 human years to get it to a useful state?
How would you ask multiple questions? Practically, you’d save the state and load that state in a new SI machine (or whatever). This means the data is part of the program.
Moreover, if you did have to raise the program like any other newborn, you have to use some non-SI process to create all the knowledge in that system (because people don’t use SI, or if they do use SI, they have other system(s) too).
1: at least in terms of knowledge; though if you used the complete human genome arguably you’d need to simulate a mother and other ppl too, but they have to be good simulations after the first few years, which is a regressive problem. So it’s probably easier to instantiate it in a body and raise it like a person b/c human people are already suitable. You also need to worry about it becoming mistaken (intuitively one disagrees with most people on most things we’d use an SI program for).
Uhh, I don’t follow this. Could you explain or link to an explanation please?
Intuitive explanation: Say it takes X bits to specify a human, and that the human knows how to correctly predict whatever sequence we’re applying SI to. SI has to find the human among the other 2^X programs of length X. Say SI is trying to predict the next bit. There will be some fraction of those 2^X programs that predict it will be 0, and some fraction predicting 1. There fractions define SI’s probabilities for what the next bit will be. Imagine the next bit will be 0. Then SI is predicting badly if greater than half of those programs predict a 1. But then, all those programs will be eliminated in the update phase. Clearly, this can happen at most X times before most of the weight of SI is on the human hypothesis(or a hypothesis that’s just as good at predicting the sequence in question)
The above is a sketch, not quite how SI really works. Rigorous bounds can be found here, in particular the bottom of page 979(“we observe that Theorem 2 implies the number of errors of the universal predictor is finite if the number of errors of the informed prior is finite...”). In the case where the number of errors is not finite, the universal and informed prior still have the same asymptotic rate of growth of error (error of universal prior is in big-O class of error of informed prior)
I don’t think this is true. I do agree some conclusions would be converged on by both systems (SI and humans), but I don’t think simplicity needs to be one of them.
When I say the ‘sense of simplicity of SI’, I use ‘simple program’ to mean the programs that SI gives the highest weight to in its predictions(these will by definition be the shortest programs that haven’t been ruled out by data). The above results imply that, if humans use their own sense of simplicity to predict things, and their predictions do well at a given task, SI will be able to learn their sense of simplicity after a bounded number of errors.
How would you ask multiple questions? Practically, you’d save the state and load that state in a new SI machine (or whatever). This means the data is part of the program.
I think you can input multiple questions by just feeding a sequence of question/answer pairs. Actually getting SI to act like a question-answering oracle is going to involve various implementation details. The above arguments are just meant to establish that SI won’t do much worse than humans at sequence prediction(of any type) -- so, to the extent that we use simplicity to attempt to predict things, SI will “learn” that sense after at most a finite number of mistakes(in particular, it won’t do any *worse* than ‘human-SI’, hypotheses ranked by the shortness of their English description, then fed to a human predictor)
I think that the argument about emulating one Turing machine with another is the best you’re going to get in full generality. You’re right that we have no guarantee that the explanation that looks simplest to a human will also look the simplest to a newly-initialized SI, because the ‘constant factor’ needed to specify that human could be very large.
I do think it’s meaningful that there is at most a constant difference between different versions of Solomonoff induction(including “human-SI”). This is because of what happens as the two versions update on incoming data: they will necessarily converge in their predictions, differing at most on a constant number of predictions.
So while SI and humans might have very different notions of simplicity at first, they will eventually come to have the same notion, after they see enough data from the world. If an emulation of a human takes X bits to specify, it means a human can beat SI at binary predictions at most X times(roughly) on a given task before SI wises up. For domains with lots of data, such as sensory prediction, this means you should expect SI to converge to giving answers as good as humans relatively quickly, even if the overhead is quite large*.
The quantity that matters is how many bits it takes to specify the mind, not store it(storage is free for SI just like computation time). For the human brain this shouldn’t be too much more than the length of the human genome, about 3.3 GB. Of course, getting your human brain to understand English and have common sense could take a lot more than that.
*Although, those relatively few times when the predictions differ could cause problems. This is an ongoing area of research.
In that case I especially don’t think that argument answers the question in OP.
I’ve left some details in another reply about why I think the constant overhead argument is flawed.
I don’t think this is true. I do agree some conclusions would be converged on by both systems (SI and humans), but I don’t think simplicity needs to be one of them.
Uhh, I don’t follow this. Could you explain or link to an explanation please?
I don’t think that applies here. I think that data is part of the program.
You would have to raise the program like a human child in that case^1. Can you really make the case you’re predicting something or creating new knowledge via SI if you have to spend (the equiv. of) 20 human years to get it to a useful state?
How would you ask multiple questions? Practically, you’d save the state and load that state in a new SI machine (or whatever). This means the data is part of the program.
Moreover, if you did have to raise the program like any other newborn, you have to use some non-SI process to create all the knowledge in that system (because people don’t use SI, or if they do use SI, they have other system(s) too).
1: at least in terms of knowledge; though if you used the complete human genome arguably you’d need to simulate a mother and other ppl too, but they have to be good simulations after the first few years, which is a regressive problem. So it’s probably easier to instantiate it in a body and raise it like a person b/c human people are already suitable. You also need to worry about it becoming mistaken (intuitively one disagrees with most people on most things we’d use an SI program for).
Intuitive explanation: Say it takes X bits to specify a human, and that the human knows how to correctly predict whatever sequence we’re applying SI to. SI has to find the human among the other 2^X programs of length X. Say SI is trying to predict the next bit. There will be some fraction of those 2^X programs that predict it will be 0, and some fraction predicting 1. There fractions define SI’s probabilities for what the next bit will be. Imagine the next bit will be 0. Then SI is predicting badly if greater than half of those programs predict a 1. But then, all those programs will be eliminated in the update phase. Clearly, this can happen at most X times before most of the weight of SI is on the human hypothesis(or a hypothesis that’s just as good at predicting the sequence in question)
The above is a sketch, not quite how SI really works. Rigorous bounds can be found here, in particular the bottom of page 979(“we observe that Theorem 2 implies the number of errors of the universal predictor is finite if the number of errors of the informed prior is finite...”). In the case where the number of errors is not finite, the universal and informed prior still have the same asymptotic rate of growth of error (error of universal prior is in big-O class of error of informed prior)
When I say the ‘sense of simplicity of SI’, I use ‘simple program’ to mean the programs that SI gives the highest weight to in its predictions(these will by definition be the shortest programs that haven’t been ruled out by data). The above results imply that, if humans use their own sense of simplicity to predict things, and their predictions do well at a given task, SI will be able to learn their sense of simplicity after a bounded number of errors.
I think you can input multiple questions by just feeding a sequence of question/answer pairs. Actually getting SI to act like a question-answering oracle is going to involve various implementation details. The above arguments are just meant to establish that SI won’t do much worse than humans at sequence prediction(of any type) -- so, to the extent that we use simplicity to attempt to predict things, SI will “learn” that sense after at most a finite number of mistakes(in particular, it won’t do any *worse* than ‘human-SI’, hypotheses ranked by the shortness of their English description, then fed to a human predictor)