Very apposite, thanks!
Anatoly_Vorobey
A lot of your impressions seem to go something like “the workshop was useful because it made me think about X, and that’s more important than specific answers/techniques it gave for X”. Lately, I’ve been noticing more and more examples of this around me. A particular book would offer a frankly poor argument in favor of Y, but I’d still recommend it to my friends because reading it makes you think about Y and reach your own conclusion. An online community centered around boosting Z in your life may be somewhat cultish and prone to pseudo-scientific explanations about why more Z is awesome, but it’s still worth reading their FAQs because you didn’t even think of Z as something that might be adjusted.
This is one of my favorite hammers now, and it finds nails everywhere. So much advice turns out to be helpful indirectly, because it makes you reflect carefully on its domain. The actual direct value of the advice may be almost irrelevant, be it good or bad: the indirect contribution is much greater anyway.
I read in my native language without subvocalizing, and in English with subvocalizing. I can make an effort and read w/o subvocalization in English, but then I get an unpleasant feeling that I’m reading in a very shallow way, understanding and retaining much less than usual. I don’t know for certain that this feeling is actually correct, but the evidence leans that way.
Some of your fake numbers fall out of the common practice of shoehorning a partial order into the number line. Suppose you have some quality Foo relative to which you can compare, in a somewhat well-defined manner in at least some cases. For example, Foo = desirable: X is definitely more desirable to me than Y, or Foo = complex: software project A is definitely more complex than software project B. It’s not necessarily the case that any X and Y are comparable. It’s then tempting to invent a numerical notion of Foo-ness, and assign numerical values of Foo-ness to all things in such a way that your intuitive Foo-wise comparisons hold. The values turn out to be essentially arbitrary on their own, only their relative order is important.
(In mathematical terms, you have a finite (in practice) partially ordered set which you can always order-embed into the integers; if the set is dynamically growing, it’s even more convenient to order-embed it into the reals so you can always find intermediate values between existing ones).
After this process, you end up with a linear order, so any X and Y are always comparable. It’s easy to forget that this may not have been the case in your intuition when you started out, because these new comparisons do not contradict any older comparisons that you held. If you had no firm opinion on comparative utility of eating ice-cream versus solving a crossword, it doesn’t seem a huge travesty that both activities now get a specific utility value and one of them outranks the other.
The advantages of this process are that Foo-ness is now a specific thing that can be estimated, used in calculations, reasoned about more flexibly, etc. The disadvantages, as you describe, are that the numbers are “fake”, they’re really just psychologically convenient markers in a linear order; and the enforced linearity may mislead you into oversimplifying the phenomenon and failing to investigate why the real partial order is the way it is.
The answer is “probably not”. Cormen is too comprehensive and dry for self-study; it’s best used as the textbook to back an algorithms course or as a reference to consult later on.
A very good book is Skiena, The Algorithm Design Manual. I usually recommend it to people who want to brush up on their algorithms before programming interviews, but I think it’s accessible enough to a novice as well. Its strengths are an intelligent selection of topics and an emphasis on teaching how to select an algorithm in a real-life situation.
Out of curiosity, did you consider sending this comment via PM, and if so, what made you decide to post it publicly?
Sure, if that’s the way you like it, but for me that just doesn’t work. Occam’s Razor is a principle that is supposed to help me think better here and now; to decide that its justification rests on whether, say, the Universe is discrete at Planck scale or not, when this choice has to be compatible with QM and Newtonian mechanics at larger scales and therefore doesn’t change anything in practical terms in my life here and now—that seems absurd. To me, it’s a clear evidence that this is no justification at all.
I don’t understand the question. I thought that’s what we were talking about. Am I missing something?
To be more explicit: setting up a UTM with random bits on the input tape is a natural-seeming way of getting the probability distribution over programs (2^-length(p)) that goes into the Solomonoff prior. But as I’m trying to say in the comment you replied to, I don’t think it’s really natural at all. And of course SI doesn’t need this particular distribution in order to be effective at its job.
You can. But is there any reason to think that this models well the inherent complexity of programs? Do we ever execute programs by choosing randomly bits that constitute them? We have algorithms that utilize randomness, to be sure, but UTMs are not, normally, such algorithms. I appreciate that “choosing program bits randomly” is a simple, short, succinct way of getting to 2^-length(p), but I don’t see a reason to think it’s natural in any interesting sense.
I didn’t miss your point; instead, I’ve tried to explain your point to you, and perhaps didn’t do a really good job. You’re confusing two different things which are both spelled out in the paper you referenced on page 1120, but in your comment you linked me to you’ve only recounted the first one, perhaps thinking that the second one is the same thing, which it is not.
The first thing is setting up probabilities over programs of a given fixed length L, so that padding helps us take into account also shorter programs. In the paper, this is the paragraph starting “This sum is still only over programs of length ℓ...”. This is what you talked about in your other comment, and this is what I talked about in my first paragraph of the reply above.
The second thing is a way of setting up probabilities over programs of any length without bound, using a UTM-with-random-noise model as justification for the formula. This is an inherently different way of assigning probabilities, where in particular we do not use padding, but on the contrary only take into account minimal programs. This is what the paper talks about in the next two paragraphs, the second of which, starting “This is a highly technical explanation...” explains the UTM-with-random-noise. I talk about this second thing in the third paragraph of my reply above.
Hopefully I made it clearer. If you still think I missed your point, let’s leave it at that.
Err, no, I wouldn’t call it a justification. It’s more like a technical setup of why shorter programs turn out to be more likely (because the padding happens to include them in this way). This is not a profound thing. It’s just one possible way to get where you want, which is to assign 2^-length(p) to p. Note that it’s brittle in the sense that you need a support from your formalism for that. You need your notion of what is a program to allow adding arbitrary bits to the end without affecting its validity or what it does. That’s not a very interesting thing to do: why would you even consider all those programs with dummy padding as real programs, given that whatever’s running them can probably easily rule them out? E.g. modern compilers will usually reject your program if you add some junk at the end.
If you restrict yourself to a finite program length, you could still only count minimal programs, those w/o padding, and give them appropriate weights (the way they do in the paper immediately afterwards for the infinite case, the standard way to set it up) and end up with the same thing. There’s no profound reason I can think of to say “oh, for fixed bound L on program length we’re just only going to look at programs of length exactly L, including padded programs, but for unrestricted length we’re throwing out padded programs and only looking at minimal ones”. The real reason is that you want to sum up to 1, and you want to prefer short programs while doing so.
Now, something closer to justiication is when they point our right afterwards that 2^-length(p) is what you get if you randomly choose program bits one by one. But as justifications go, I think it’s a pretty weak one, because no one really computes anything in any way that looks like that. And more importantly, as I said above, it just doesn’t matter for SI prediction. You don’t have to use 2^-length(p) for SI prediction to work. You just need to have it all sum up to 1.
Yep, you’re confused. You can’t possibly assign equal probability to every program in an infinite space, because then the sum of all probabilities will diverge (go to infinity), and you need it to sum up to 1. What you do is assign probabilities in a way that the infinite sum will converge to 1, for example by lining all programs up in an infinite list: M1, M2, M3… and assigning probabilities 1⁄2, 1⁄4, 1⁄8, 1⁄16, 1⁄32 and so on to infinity. That’s essentially what the universal prior does, using a length-sorted infinite list of all possible programs (I’m skipping a few small technical difficulties).
The simpler predictions are typically more likely because of the one shortest program that generates them, not the many longer programs that also do—they do help, but because of the exponential decay they usually don’t amount to much. The shortest program tends to dominate.
Except I just lied a little in what I said. SI prediction doesn’t care if your program is short or long. It cares how early it comes on the infinite list M1,M2,M3… of all programs. When you arrange this list by program length, “earliest in the list” and “shortest” are basically the same. And in the usual infinite case, in a sense you cannot NOT arrange the list by program length. You can give your favourite very long programs some of the first spots on the list, and so high probabilities, but there’s an infinity of long programs out there, and you can’t nudge the infinity to where you want to go. But in the finite thought experiment, you’re free to arrange your list any way you want (and you don’t even need to have probabilities go down exponentially either). So you can just piss on Occam’s Razor and strictly reverse it by having longer programs always be more likely than shorter ones, if you feel like it. And SI prediction will still work.
Because the prediction itself doesn’t use the shortness of programs in any essential way, and the shortness only matters because of the unavoidable nature of the infinite formalism.
A thought experiment: suppose we’ve determined that the Universe is finite (possible) and worked out a higher bound M on its informational capacity, and we know that programs longer than M will never be run in this universe. We might want to rework SI using a very large finite probability space of programs rather than an infinite one. Given a valid prior distribution over this finite space, SI itself, as a prediction procedure, will continue to work, and its promise of correctness will still be there in an updated fashion. But for SI to work there will be no reason whatsoever to assign greater probabilities to shorter program in this large finite space. A uniform distribution will work just as well, or even giving really large weight to the longest programs and systematically biasing against shorter ones. It will be all the same to SI. Clearly it doesn’t have anything to do with Occam’s Razor in this thought experiment. The connection, then, is forced upon us by the mere fact that we’re using an infinite set of finite-length algorithms, and nothing else, not anything about SI: about what we predict, how we do it, etc.
Yep, that’s correct. Basically, short programs suffer from there only being a finite supply of them. The actual distribution of probabilities to programs doesn’t matter that much.
That’s why I think SI doesn’t work very well as justification/support for Occam’s Razor.
for every Solomonoff-like priors, shorter program have almost always higher probability than longer ones?
No, but something weaker is true. Take any probability P, e.g. 1/1000. No more than 1/P (e.g. 1000) programs can carry this or higher probability, and they’re all short in the trivial sense of being shorter in length than some fixed L, depending only on P and your formalism. So in a sense, the likely programs are all short, and all the really long programs are really unlikely.
The one concept from Nietzsche I see everywhere around me in the world is ressentiment. I think much of the master-slave morality stuff was too specific and now feels dated 130 years later, but ressentiment is the important core that’s still true and going to stay with us for a while; it’s like a powerful drug that won’t let humanity go. Ideological convictions and interactions, myths and movements, all tied up with ressentiment or even entirely based on it. And you’re right, I would have everyone read Nietzsche—not for practical advice or predictions, but to be able, hopefully, to understand and detect this illness in others and especially oneself.
I know, I was joking. And it was a good opportunity to link to this (genuinely interesting) paper.
… well, mostly joking. There’s a kernel of truth there. “There are no photons” says more than just banning a word. “Wavepackets of light” don’t exist either. There’s just the electromagnetic field, its intensity changes with time, and the change propagates in space. Looking at it like this may help understand the other responses to the question (which are all correct).
When you think of a photon as a particle flying in space, it’s hard to shake off the feeling that you somehow ought to be able to attach yourself to it and come along for the ride, or to imagine how the particle itself “feels” about its existence, how its inner time passes. And then the answer that for a photon, time doesn’t pass at all, feels weird and counter-intuitive. If you tell yourself there’s no particle, just a bunch of numbers everywhere in space (expressing the EM field) and a slight change in those numbers travels down the line, it may be easier to process. A change is not an object to strap yourself to. It doesn’t have “inner time”.
The way I understand it, academic CS reached a state of affairs where peer-reviewed journal publications (almost) don’t matter, and conferences are the primary vehicles of communicating research and acquiring status. Do you have an opinion on this state of affairs? Is it harmful? beneficial? Why did CS converge on it, and not say math/physics?
To an outsider like me, it just seems so weird.
What if lesswrong.org hosted individual users’ blogs? They would live on the user profile page, as a separate tab (perhaps the default one). That (and an RSS feed) would be the primary way to get to them. Public homepage would not aggregate from them, Main and Discussion remaining as they are.
Has this been discussed before? Pros/cons? Would you use this mechanism if it were available?
(technically, under the hood, they’d be easy to implement as just separate individual subreddits, I guess)
I also read it based on your recommendation (I think—don’t remember clearly) and I really really liked it. The near-future science is overwhelmingly convincing in a good way. What’s funny is that I thought the characters were pretty shallow and the constantly peppy attitude of the hero not believable and somewhat grating; usually the quality of characters and their development is a must for me, their shallowness ruins any book. Somehow it didn’t happen here. There was just so much of this juicy mind-opening fascinating engaging sciency stuff that kept me at the edge of the chair. I’m really glad I read this book—thanks!