Transfuturist comments on Open thread, Mar. 16 - Mar. 22, 2015

Transfuturist 23 Mar 2015 7:38 UTC
2 points
If this post doesn’t get answered, I’ll repost in the next open thread. A test to see if more frequent threads are actually necessary.

I’m trying to make a prior probability mass distribution for the length of a binary string, and then generalize to strings of any quantity of symbols. I’m struggling to find one with the right properties under the log-odds transformation that still obeys the laws of probability. The one I like the most is P(len(x)) = 1/(x+2), as under log-odds it requires log(x)+1 bits of evidence for strings of len(x) to meet even odds. For a length of 15, it uses all 4 bits in 1111, so its cost is 4 bits.

The problem is that 1/(x+2) does not converge, making it an improper prior. Are there some restrictions by which I can use this improper prior, or to find a proper prior with similarly desirable qualities?
- Kindly 25 Mar 2015 14:05 UTC
  6 points
  Parent
  Here is a different answer to your question, hopefully a better one.
  
  It is no coincidence that the prior that requires log(x)+1 bits of evidence for length x does not converge. The reason for this is that you cannot specify using only log(x)+1 bits that a string has length x. Standard methods of specifying string length have various drawbacks, and correspond to different prior distributions in a natural way. (I will assume 32-bit words, and measure length in words, but you can measure length in bits if you like.)
  
  Suppose you have a length-prefixed string. Then you pay 32 bits to encode the length; but the length can be at most 2^32-1. This corresponds to the uniform distribution that assigns all lengths between 0 and 2^32-1 equal probability. (We derive this distribution by supposing that every bit doing length-encoding duty is random and equally likely.)
  
  Suppose you have a null-terminated string. Then you are paying a hidden linear cost: the 0 word is reserved for the terminator, so you have only 2^32-1 words to use in your message, which means you only convey log(2^32-1) bits of information per 32 bits of message. The natural distribution here is one in which every bit conveys maximal information, so each word has a 1 in 2^32 chance of being the terminator, and so the length of your string is Geometric with parameter 1/2^32.
  
  A common scheme for big-integer types is to have a flag bit in every word that is 1 if another word follows, and 0 otherwise. This is very similar to the null-terminator scheme, and in fact the natural distribution here is also Geometric, but with parameter ¹⁄₂ because each flag bit has a probability of ¹⁄₂ of being set to 0, if chosen randomly.
  
  If you are encoding truly enormous strings, you could use a length-prefixed string in which the length is a big integer. This is much more efficient and the natural distribution here is also much more heavy-tailed: it is something like a smoothed-out version of 2^(32 Geometric(1/2)). We have come closer to encoding a length of x in log(x) bits, but it’s more like C log(x) for some constant C. (The constant is actually ³²⁄₃₁, since for every 31 bits of length, we have 1 bit of “length of length”.)
  
  If we iterate this, we can produce even more efficient schemes, but log(x)+1 is unreachable.
  
  (What I have been calling the natural distribution to use for each of these schemes is something like a max-entropy distribution. The way I am defining it is to assume an infinite sequence of random bits, let the scheme read this infinite sequence until it decides that it’s reached the end of the string, and take that string’s length.)
  - Transfuturist 26 Mar 2015 4:38 UTC
    0 points
    Parent
    This was very informative, thank you.
- Kindly 25 Mar 2015 14:04 UTC
  0 points
  Parent
  Here is a different answer to your question, hopefully a better one.
  
  It is no coincidence that the prior that requires log(x)+1 bits of evidence for length x does not converge. The reason for this is that you cannot specify using only log(x)+1 bits that a string has length x. Standard methods of specifying string length have various drawbacks, and correspond to different prior distributions in a natural way. (I will assume 32-bit words, and measure length in words, but you can measure length in bits if you like.)
  
  Suppose you have a length-prefixed string. Then you pay 32 bits to encode the length; but the length can be at most 2^32-1. This corresponds to the uniform distribution that assigns all lengths between 0 and 2^32-1 equal probability. (We derive this distribution by supposing that every bit doing length-encoding duty is random and equally likely.)
  
  Suppose you have a null-terminated string. Then you are paying a hidden linear cost: the 0 word is reserved for the terminator, so you have only 2^32-1 words to use in your message, which means you only convey log(2^32-1) bits of information per 32 bits of message. The natural distribution here is one in which every bit conveys maximal information, so each word has a 1 in 2^32 chance of being the terminator, and so the length of your string is Geometric with parameter 1/2^32.
  
  A common scheme for big-integer types is to have a flag bit in every word that is 1 if another word follows, and 0 otherwise. This is very similar to the null-terminator scheme, and in fact the natural distribution here is also Geometric, but with parameter ¹⁄₂ because each flag bit has a probability of ¹⁄₂ of being set to 0, if chosen randomly.
  
  If you are encoding truly enormous strings, you could use a length-prefixed string in which the length is a big integer. This is much more efficient and the natural distribution here is also much more heavy-tailed: it is something like a smoothed-out version of 2^(32 Geometric(1/2)). We have come closer to encoding a length of x in log(x) bits, but it’s more like C log(x) for some constant C. (The constant is actually ³²⁄₃₁, since for every 31 bits of length, we have 1 bit of “length of length”.)
  
  If we iterate this, we can produce even more efficient schemes, but log(x)+1 is unreachable.
  
  (What I have been calling the natural distribution to use for each of these schemes is something like a max-entropy distribution. The way I am defining it is to assume an infinite sequence of random bits, let the scheme read this infinite sequence until it decides that it’s reached the end of the string, and take that string’s length.)
- Kindly 23 Mar 2015 13:50 UTC
  0 points
  Parent
  What sort of evidence about x do you expect to update on?
  - Transfuturist 23 Mar 2015 20:51 UTC
    0 points
    Parent
    The result of some built-in string function length(s), that, depending on the implementation of the string type, either returns the header integer stating the size, or counts the length until the terminator symbol and returns that integer.
    - Kindly 24 Mar 2015 15:06 UTC
      0 points
      Parent
      That doesn’t sound like something you’d need to do statistics on. Once you learn something about the string length, you basically just know it.
      
      Improper priors are not useful on their own: the point of using them is that you will get a proper distribution after you update on some evidence. In your case, after you update on some evidence, you’ll just have a point distribution, so it doesn’t matter what your prior is.
      - Transfuturist 24 Mar 2015 23:22 UTC
        0 points
        Parent
        Not so. I’m trying to figure out how to find the maximum entropy distribution for simple types, and recursively defined types are a part of that. This does not only apply to strings, it applies to sequences of all sorts, and I’m attempting to allow the possibility of error correction in these techniques. What is the point of doing statistics on coin flips? Once you learn something about the flip result, you basically just know it.
        Kindly 24 Mar 2015 23:48 UTC
        0 points
        Parent
        Well, in the coin flip case, the thing you care about learning about isn’t the value in {Heads, Tails} of a coin flip, but the value in [0,1] of the underlying probability that the coin comes up heads. We can then put an improper prior on that underlying probability, with the idea that after a single coin flip, we update it to a proper prior.
        
        Similarly, you could define here a family of distributions of string lengths, and have a prior (improper or otherwise) about which distribution in the family you’re working with. For example, you could assume that the length of a string is distributed as a Geometric(p) variable for some unknown parameter p, and then sampling a single string gives you some evidence about what p might be.
        
        Having an improper prior on the length of a single string, on the other hand, only makes sense if you expect to gain (and update on) partial evidence about the length of that string.