gwern comments on Worse Than Random

gwern 14 Nov 2009 15:55 UTC
1 point

If the only way you use the unique data is to feed it into a hash, you might as well be using a random number. If you get different results from the hash than from randomness, you’ve broken the hash.

The randomness is not as good. Even if the randomness is different from agent to agent, we still can get collisions. If we take the unique aspect of the agent, then by definition it isn’t shared by other agents and so we can avoid any collision with other agents:

We can then come up with a injective mapping from the agent’s number to the indexes of the array, using a hash, for example, or perhaps just taking the number as a straight index.

A hash doesn’t have to collide; it only has to have collisions (by the Pigeonhole Principle) if the end hash is ‘smaller’ than the input. If I’m using SHA512 on data that is always less than 512 bits, then I’ll never get any collisions. (Let’s assume SHA512 is a perfect hash; if this bothers you, replace ‘SHA512’ with ‘$FAVORITE_PERFECT_HASH’.) But using the hash isn’t essential: we just need a mapping from agent to index. ‘Hash’ was the first term I thought of.

(And the XOR trick lets us make use of a injective mapping regardless of whether it’s the randomness or agent-related-data that is unique; we XOR them together and get something that is unique if either was.)
- pengvado 16 Nov 2009 6:48 UTC
  4 points
  Parent
  
  A hash doesn’t have to collide; it only has to have collisions (by the Pigeonhole Principle) if the end hash is ‘smaller’ than the input. If I’m using SHA512 on data that is always less than 512 bits, then I’ll never get any collisions.
  
  How do you know which aspect of the agent is unique, without prior communication? If it’s merely that agents have so many degrees of freedom that there’s a negligible probability that any two agents are identical in all aspects, then your hash output is smaller than its input. Also, you can’t use the 2^512 figure for SHA-512 unless you actually want to split a 2^512 size array. If you only have, say, 20 choices to split, then 20 is the size that counts for collision frequency, no matter what hash algorithm you use.
  
  we XOR them together and get something that is unique if either was.
  
  If hash-of-agent outputs are unique and your RNG is random, then the XOR is just random, not guaranteed-unique.
  - gwern 16 Nov 2009 16:00 UTC
    1 point
    Parent
    
    How do you know which aspect of the agent is unique, without prior communication? If it’s merely that agents have so many degrees of freedom that there’s a negligible probability that any two agents are identical in all aspects, then your hash output is smaller than its input.
    
    If your array is size 20, say, then why not just take the first x bits of your identity (where 2^x=20)? (Why ‘first’, why not ‘last’? This is another Schelling point, like choice of injective mapping.)
    
    If hash-of-agent outputs are unique and your RNG is random, then the XOR is just random, not guaranteed-unique.
    
    This is a good point; I wasn’t sure whether it was true when I was writing it, but since you’ve pointed it out, I’ll assume it is. But this doesn’t destroy my argument: you don’t do any worse by adopting this more complex strategy. You still do just as well as a random pick.
    
    (Come to think of it: so what if you have to use the full bitstring specifying your uniqueness? You’ll still do better on problems the same size as your full bitstring, and if your mapping is good, the collisions will be as ‘random’ as the RNGs and you won’t do worse.)