I split the problem into two parts: first, modeling how much noise will be produced by a given Who child with given presents, and second, how to optimize that value.
I declined to use the names of the Who children, since my intuition said that those shouldn’t be predictive of anything. Also, there were Who children with the same name and same ID who lived years apart, which seemed like a bug.
I tried several models (random forest, gradient boosted forest) but got the best cross-validation accuracy when I used a ridge regression with product features. I ended up using the following features:
To optimize the noise, I assigned the presents randomly, checking that each was unique. Then I did a Markov chain optimization procedure where I swapped presents if it improved the score or made it worse by less than a random threshold. This procedure could probably be improved. I’m thinking about applying a quadratic programming library to the optimization procedure, but that seems kind of difficult.
Also, there were Who children with the same name and same ID who lived years apart, which seemed like a bug.
The same Who Child can appear in several rows while they are growing up (for instance, Who Child 1530, Eddie Lou Who, appears in the dataset 12 times, from 1 year old in year 19 to 12 years old in year 30). This is not a bug, children do in fact have more than one Christmas while growing up.
I don’t think there are any cases where the same ID shows up many years apart—if there were that would indeed be a bug. Do you have an example of that you can point me at?
After I posted my first post, but before reading the other answers, it occurred to me that I was probably leaving noise on the table by not modeling the individual Who children. Reading the other answers, it seems like doing that is key.
Revised results below when taking individual idosyncrasies into account in the ridge regression:
Approach:
I split the problem into two parts: first, modeling how much noise will be produced by a given Who child with given presents, and second, how to optimize that value.
I declined to use the names of the Who children, since my intuition said that those shouldn’t be predictive of anything. Also, there were Who children with the same name and same ID who lived years apart, which seemed like a bug.
I tried several models (random forest, gradient boosted forest) but got the best cross-validation accuracy when I used a ridge regression with product features. I ended up using the following features:
[‘Age’, ‘BlumBlooper__Age’, ‘BlumBlooper’, ‘FumFoozler__Age’, ‘FumFoozler__BlumBlooper’, ‘FumFoozler’, ‘GahGinka__Age’, ‘GahGinka__BlumBlooper’, ‘GahGinka__FumFoozler’, ‘GahGinka’, ‘SlooSlonker__Age’, ‘SlooSlonker__BlumBlooper’, ‘SlooSlonker__FumFoozler’, ‘SlooSlonker__GahGinka’, ‘SlooSlonker’, ‘SlooSlonker__GenderDummy_F’, ‘SlooSlonker__GenderDummy_M’, ‘TrumTroopa__Age’, ‘TrumTroopa__BlumBlooper’, ‘TrumTroopa__FumFoozler’, ‘TrumTroopa__GahGinka’, ‘TrumTroopa__SlooSlonker’, ‘TrumTroopa’, ‘TrumTroopa__GenderDummy_F’, ‘TrumTroopa__GenderDummy_M’, ‘WhoWhonker__Age’, ‘WhoWhonker__BlumBlooper’, ‘WhoWhonker__FumFoozler’, ‘WhoWhonker__GahGinka’, ‘WhoWhonker__SlooSlonker’, ‘WhoWhonker__TrumTroopa’, ‘WhoWhonker’, ‘WhoWhonker__GenderDummy_F’, ‘WhoWhonker__GenderDummy_M’, ‘GenderDummy_F__Age’, ‘GenderDummy_F__BlumBlooper’, ‘GenderDummy_F__FumFoozler’, ‘GenderDummy_F__GahGinka’, ‘GenderDummy_F’, ‘GenderDummy_M__Age’, ‘GenderDummy_M__BlumBlooper’, ‘GenderDummy_M__FumFoozler’, ‘GenderDummy_M__GahGinka’, ‘GenderDummy_M__GenderDummy_F’, ‘GenderDummy_M’]
To optimize the noise, I assigned the presents randomly, checking that each was unique. Then I did a Markov chain optimization procedure where I swapped presents if it improved the score or made it worse by less than a random threshold. This procedure could probably be improved. I’m thinking about applying a quadratic programming library to the optimization procedure, but that seems kind of difficult.
Maximum noise proposal
Estimated noise: 195.72749659660874
Andy Sue Who WhoWhonker SlooSlonker
Betty Drew Who FumFoozler SlooSlonker
Sally Sue Who FumFoozler SlooSlonker
Phoebe Drew Who BlumBlooper FumFoozler
Freddie Lou Who TrumTroopa WhoWhonker
Eddie Sue Who TrumTroopa WhoWhonker
Cindy Drew Who GahGinka FumFoozler
Mary Lou Who BlumBlooper GahGinka
Ollie Lou Who BlumBlooper WhoWhonker
Johnny Drew Who TrumTroopa BlumBlooper
Minimum noise proposal
Estimated noise: 129.9544674398252
Andy Sue Who TrumTroopa GahGinka
Betty Drew Who BlumBlooper WhoWhonker
Sally Sue Who BlumBlooper WhoWhonker
Phoebe Drew Who BlumBlooper WhoWhonker
Freddie Lou Who FumFoozler GahGinka
Eddie Sue Who FumFoozler TrumTroopa
Cindy Drew Who SlooSlonker WhoWhonker
Mary Lou Who BlumBlooper SlooSlonker
Ollie Lou Who FumFoozler TrumTroopa
Johnny Drew Who FumFoozler SlooSlonker
The same Who Child can appear in several rows while they are growing up (for instance, Who Child 1530, Eddie Lou Who, appears in the dataset 12 times, from 1 year old in year 19 to 12 years old in year 30). This is not a bug, children do in fact have more than one Christmas while growing up.
I don’t think there are any cases where the same ID shows up many years apart—if there were that would indeed be a bug. Do you have an example of that you can point me at?
Ah, I got confused by Phoebe Drew Who, who shows up with ids 1533 and 1553.
After I posted my first post, but before reading the other answers, it occurred to me that I was probably leaving noise on the table by not modeling the individual Who children. Reading the other answers, it seems like doing that is key.
Revised results below when taking individual idosyncrasies into account in the ridge regression:
MIN SOLUTION
130.6603587239382
Andy Sue Who TrumTroopa FumFoozler
Betty Drew Who WhoWhonker BlumBlooper
Sally Sue Who BlumBlooper WhoWhonker
Phoebe Drew Who WhoWhonker BlumBlooper
Freddie Lou Who TrumTroopa GahGinka
Eddie Sue Who GahGinka FumFoozler
Cindy Drew Who SlooSlonker BlumBlooper
Mary Lou Who SlooSlonker WhoWhonker
Ollie Lou Who SlooSlonker FumFoozler
Johnny Drew Who TrumTroopa FumFoozler
MAX SOLUTION
210.90134871092357
Andy Sue Who SlooSlonker WhoWhonker
Betty Drew Who SlooSlonker FumFoozler
Sally Sue Who TrumTroopa FumFoozler
Phoebe Drew Who SlooSlonker FumFoozler
Freddie Lou Who WhoWhonker BlumBlooper
Eddie Sue Who BlumBlooper WhoWhonker
Cindy Drew Who GahGinka FumFoozler
Mary Lou Who TrumTroopa GahGinka
Ollie Lou Who WhoWhonker BlumBlooper
Johnny Drew Who BlumBlooper TrumTroopa