The relevant argument is equivalence of SI on different universal Turing machines, up to a constant. Briefly: if we have a short program on machine M1 (e.g. python), then in the worst case we can write an equivalent program on M2 (e.g. LISP) by writing an M1-simulator and then using the M1-program (e.g. writing a python interpreter in LISP and then using the python program). The key thing to notice here is that the M1-simulator may be long, but its length is completely independent what we’re predicting—thus, the M2-Kolmogorov-complexity of a string is at most the M1-Kolmogorov-complexity plus a constant (where the constant is the length of the M1-simulator program).
Applied to English: we could simulate an English-speaking human. This would be a lot more complicated than a python interpreter, but the program length would still be constant with respect to the prediction task. Given the English sentence, the simulated human should then be able to predict anything a physical human could predict given the same English sentence. Thus, if something has a short English description, then there exists a short (up to a constant) code description which contains all the same information (i.e. can be used to predict all the same things).
Two gotchas to emphasize here:
The constant is big—it includes everything an English-speaking human knows, from what-trees-look-like to how-to-drive-a-car. All the hidden complexity of individual words is in that constant (or at least the hidden complexity that a human knows; things a human doesn’t know wouldn’t be in there).
The English sentence is a “program” (or part of a program), not data to be predicted; whatever we’re predicting is separate from the English sentence. (This is implicit in the OP, but somebody will likely be confused by it.)
the M1-simulator may be long, but its length is completely independent what we’re predicting—thus, the M2-Kolmogorov-complexity of a string is at most the M1-Kolmogorov-complexity plus a constant (where the constant is the length of the M1-simulator program).
I agree with this, but I don’t think it answers the question. (i.e. it’s not a relevant argument^([1]))
Given the English sentence, the simulated human should then be able to predict anything a physical human could predict given the same English sentence.
There’s a large edge case where the overhead constant is ~greater than the program. in those cases it’s not the case that simplicity transitions across layers of abstraction.
That edge case means this doesn’t follow:
Thus, if something has a short English description, then there exists a short (up to a constant) code description
[1]: Edit: it could be relevant but not the whole story; but in that case it’s missing a sizable chunk.
The solution to the “large overhead” problem is to amortize the cost of the human simulation over a large number of English sentences and predictions. We only need to specify the simulation once, and then we can use it for any number of prediction problems in conjunction with any number of sentences. A short English sentence then adds only a small amount of marginal complexity to the program—i.e. adding one more sentence (and corresponding predictions) only adds a short string to the program.
The solution to the “large overhead” problem is to amortize the cost of the human simulation over a large number of English sentences and predictions.
That seems a fair approach in general, like how can we use the program efficiently/profitably, but I don’t think it answers the question in OP. I think it actually actually implies the opposite effect: as you go through more layers of abstraction you get more and more complex (i.e. simplicity doesn’t hold across layers of abstraction). That’s why the strategy you mention needs to be over ever larger and larger problem spaces to make sense.
So this would still mean most of our reasoning about Occam’s Razor wouldn’t apply to SI.
A short English sentence then adds only a small amount of marginal complexity to the program—i.e. adding one more sentence (and corresponding predictions) only adds a short string to the program.
I’m not sure we (humanity) know enough to claim only a short string needs to be added. I think GPT-3 hints at a counter-example b/c GTP has been growing geometrically.
Moreover, I don’t think we have any programs or ideas for programs that are anywhere near sophisticated enough to answer meaningful Qs—unless they just regurgitate an answer. So we don’t have a good reason to claim to know what we’ll need to add to extend your solution to handle more and more cases (especially increasingly technical/sophisticated cases).
Intuitively I think there is (physically) a way to do something like what you describe efficiently because humans are an example of this—we have no known limit for understanding new ideas. However, it’s not okay to use this as a hypothetical SI program b/c such a program does other stuff we don’t know how to do with SI programs (like taking into account itself, other actors, and the universe broadly).
If the hypothetical program does stuff we don’t understand and we also don’t understand its data encoding methods, then I don’t think we can make claims about how much data we’d need to add.
I think it’s reasonable there would be no upper limit on the amount of data we’d need to add to such a program as we input increasingly sophisticated questions. I also think it’s intuitive there’s no upper limit on this data requirement (for both people and the hypothetical programs you mention).
The relevant argument is equivalence of SI on different universal Turing machines, up to a constant. Briefly: if we have a short program on machine M1 (e.g. python), then in the worst case we can write an equivalent program on M2 (e.g. LISP) by writing an M1-simulator and then using the M1-program (e.g. writing a python interpreter in LISP and then using the python program). The key thing to notice here is that the M1-simulator may be long, but its length is completely independent what we’re predicting—thus, the M2-Kolmogorov-complexity of a string is at most the M1-Kolmogorov-complexity plus a constant (where the constant is the length of the M1-simulator program).
Applied to English: we could simulate an English-speaking human. This would be a lot more complicated than a python interpreter, but the program length would still be constant with respect to the prediction task. Given the English sentence, the simulated human should then be able to predict anything a physical human could predict given the same English sentence. Thus, if something has a short English description, then there exists a short (up to a constant) code description which contains all the same information (i.e. can be used to predict all the same things).
Two gotchas to emphasize here:
The constant is big—it includes everything an English-speaking human knows, from what-trees-look-like to how-to-drive-a-car. All the hidden complexity of individual words is in that constant (or at least the hidden complexity that a human knows; things a human doesn’t know wouldn’t be in there).
The English sentence is a “program” (or part of a program), not data to be predicted; whatever we’re predicting is separate from the English sentence. (This is implicit in the OP, but somebody will likely be confused by it.)
I agree with this, but I don’t think it answers the question. (i.e. it’s not a relevant argument^([1]))
There’s a large edge case where the overhead constant is ~greater than the program. in those cases it’s not the case that simplicity transitions across layers of abstraction.
That edge case means this doesn’t follow:
[1]: Edit: it could be relevant but not the whole story; but in that case it’s missing a sizable chunk.
The solution to the “large overhead” problem is to amortize the cost of the human simulation over a large number of English sentences and predictions. We only need to specify the simulation once, and then we can use it for any number of prediction problems in conjunction with any number of sentences. A short English sentence then adds only a small amount of marginal complexity to the program—i.e. adding one more sentence (and corresponding predictions) only adds a short string to the program.
That seems a fair approach in general, like how can we use the program efficiently/profitably, but I don’t think it answers the question in OP. I think it actually actually implies the opposite effect: as you go through more layers of abstraction you get more and more complex (i.e. simplicity doesn’t hold across layers of abstraction). That’s why the strategy you mention needs to be over ever larger and larger problem spaces to make sense.
So this would still mean most of our reasoning about Occam’s Razor wouldn’t apply to SI.
I’m not sure we (humanity) know enough to claim only a short string needs to be added. I think GPT-3 hints at a counter-example b/c GTP has been growing geometrically.
Moreover, I don’t think we have any programs or ideas for programs that are anywhere near sophisticated enough to answer meaningful Qs—unless they just regurgitate an answer. So we don’t have a good reason to claim to know what we’ll need to add to extend your solution to handle more and more cases (especially increasingly technical/sophisticated cases).
Intuitively I think there is (physically) a way to do something like what you describe efficiently because humans are an example of this—we have no known limit for understanding new ideas. However, it’s not okay to use this as a hypothetical SI program b/c such a program does other stuff we don’t know how to do with SI programs (like taking into account itself, other actors, and the universe broadly).
If the hypothetical program does stuff we don’t understand and we also don’t understand its data encoding methods, then I don’t think we can make claims about how much data we’d need to add.
I think it’s reasonable there would be no upper limit on the amount of data we’d need to add to such a program as we input increasingly sophisticated questions. I also think it’s intuitive there’s no upper limit on this data requirement (for both people and the hypothetical programs you mention).