As hinted in my other comment, I believe that this problem is simply a gaussian mixture model.
I assume there are d species of snarks. Species i is defined by a set of variables:
pi the probability of a snark to belong to this species
μi,σi the parameters of the gaussian of this species
ki the probability for members of this species to kill
Phpi the vector giving, for each phenotype k , the probability for a member of species i to have phenotype k. Remember that there are 675 phenotypes in total, but I believe that there are way less species, thus there must be some phenotypic variation within each species
On top of those variables, there is also a hidden matrix A: Ai,j=1 if and only if snark i belongs to species j. We want to estimate all of those numbers
I must confess I don’t really remember how the approach I implemented is called, but basically it has to do with alternating estimations:
Assuming known A, we can compute the MLE for p,μ,σ,k,Php.
μ is just the weighted average of the times of sights, \(\\)σ is simply the standard deviation
p is the proportion of each species in A
Php is the proportion of each phenotype in each species
k is a bit tricky, I estimated it using only the data for which the crew actually tried to hunt. This is not a problem if nobody is capable of knowing what a Boojan looks like. This might be a problem if, for any reason, the decision to give up or not is not random, and especially if the crew has access to information hidden to us. Anyhow, I had to make an assumption
Assuming known p,μ,σ,k,Php, we can estimate A by computing for each observation x, the likelihood of x belonging to each species, and then making A proportional to this likelihood
By randomly initializing everything, and going back and forth between the two estimates, we should end up with a pretty satisfying result
This approach is super slow, especially because I am trying several numbers of species d, I will reply to this comment when I get results
Hum, about 12 ? Probably at least 8, maybe up to 20, or even more, I don’t really know
One interesting point, it looks like there is one species, the Goojam, (or maybe 2 Goojam species), which always attacks, while all other species never attack. I.e.k only contains ones and zeros. This is not something I anticipated, it appeared when running my code, but it makes sense, so this might confirm the approach
My submission
BPGYHQ should be the set of 6 letters minimizing the probability of death
The letters, in increasing order of danger, are BPGYHQTSVINRDMKLOJUACWXEF, with BPGYHQTSVINRDMKLOJUACW (without XEF) being the solution maximizing the expected return, at least according to my models
If this is true, the pareto frontier is exactly the set of prefixes of this word, but since one can only submit one word for the bonus task, I will submit BPGYHQTS, a word of length 8 with a very low probability of death
It is remarkable to see that I find similar results to Yonge, despite having wildly different approaches!
My approach
As hinted in my other comment, I believe that this problem is simply a gaussian mixture model.
I assume there are d species of snarks. Species i is defined by a set of variables:
pi the probability of a snark to belong to this species
μi,σi the parameters of the gaussian of this species
ki the probability for members of this species to kill
Phpi the vector giving, for each phenotype k , the probability for a member of species i to have phenotype k. Remember that there are 675 phenotypes in total, but I believe that there are way less species, thus there must be some phenotypic variation within each species
On top of those variables, there is also a hidden matrix A: Ai,j=1 if and only if snark i belongs to species j. We want to estimate all of those numbers
I must confess I don’t really remember how the approach I implemented is called, but basically it has to do with alternating estimations:
Assuming known A, we can compute the MLE for p,μ,σ,k,Php.
μ is just the weighted average of the times of sights, \(\\)σ is simply the standard deviation
p is the proportion of each species in A
Php is the proportion of each phenotype in each species
k is a bit tricky, I estimated it using only the data for which the crew actually tried to hunt. This is not a problem if nobody is capable of knowing what a Boojan looks like. This might be a problem if, for any reason, the decision to give up or not is not random, and especially if the crew has access to information hidden to us. Anyhow, I had to make an assumption
Assuming known p,μ,σ,k,Php, we can estimate A by computing for each observation x, the likelihood of x belonging to each species, and then making A proportional to this likelihood
By randomly initializing everything, and going back and forth between the two estimates, we should end up with a pretty satisfying result
This approach is super slow, especially because I am trying several numbers of species d, I will reply to this comment when I get results
So, how many species are there?
Hum, about 12 ? Probably at least 8, maybe up to 20, or even more, I don’t really know
One interesting point, it looks like there is one species, the Goojam, (or maybe 2 Goojam species), which always attacks, while all other species never attack. I.e.k only contains ones and zeros. This is not something I anticipated, it appeared when running my code, but it makes sense, so this might confirm the approach
My submission
BPGYHQ should be the set of 6 letters minimizing the probability of death
The letters, in increasing order of danger, are BPGYHQTSVINRDMKLOJUACWXEF, with BPGYHQTSVINRDMKLOJUACW (without XEF) being the solution maximizing the expected return, at least according to my models
If this is true, the pareto frontier is exactly the set of prefixes of this word, but since one can only submit one word for the bonus task, I will submit BPGYHQTS, a word of length 8 with a very low probability of death
It is remarkable to see that I find similar results to Yonge, despite having wildly different approaches!