I came across this puzzle via aphyer’s post, and got inspired to give it a try.
Here is the fit I was able to get on the existing sites (Performance vs. Predicted Performance). Some notes on it:
Seems good enough to run with. None of the highest predicted existing sites had a large negative residual, and the highest predicted new sites give some buffer.
Three observations I made along the way.
First (which is mostly redundant with what aphyer wound up sharing in his second post):
Almost every variable is predictive of Performance on its own, but none of the continuous variables have a straightforward linear relationship with Performance.
Second:
Modeling the effect of location could be tricky. e.g., Imagine on Earth if Australia and Mexico were especially good places for Performance, or on a checkerboard if Performance was higher on the black squares.
Third:
The ZPPG Performance variable has a skewed distribution which does not look like what you’d get if you were adding a bunch of variables, but does look like something you might get if you were multiplying several variables. And multiplication seems plausible for this scenario, e.g. perhaps such-and-such a disturbance halves Performance and this other factor cuts performance by a quarter.
which I see is the same as simon’s list, with very slight differences in the order
More on my process:
I initially modeled location just by a k nearest neighbors calculation, assuming that a site’s location value equals the average residual of its k nearest neighbors (with location transformed to Cartesian coordinates). That, along with linear regression predicting log(Performance), got me my first list of answers. I figured that list was probably good enough to pass the challenge: the sites’ predicted performance had a decent buffer over the required cutoff, the known sites with large predicted values did mostly have negative residuals but they were only about 1⁄3 the size of the buffer, there were some sites with large negative residuals but none among the sites with high predicted values and I probably even had a big enough buffer to withstand 1 of them sneaking in, and the nearest neighbors approach was likely to mainly err by giving overly middling values to sites near a sharp border (averaging across neighbors on both sides of the border) which would cause me to miss some good sites but not to include any bad sites. So it seemed fine to stop my work there.
Yesterday I went back and looked at the residuals and added some more handcrafted variables to my model to account for any visible patterns. The biggest was the sharp cutoff at Latitude +-36. I also changed my rescaling of Murphy’s Constant (because my previous attempt had negative residuals for low Murphy values), added a quadratic term to my rescaling of Local Value of Pi (because the dropoff from 3.15 isn’t linear), added a Shortitude cutoff at 45, and added a cos(Longitude-50) variable. Still kept the nearest neighbors calculation to account for any other location relevance (there is a little but much less now). That left me with 4 nines of correlation between predicted & actual performance, residuals near zero for the highest predicted sites in the training set, and this new list of sites. My previous lists of sites still seem good enough, but this one looks better.
> Still kept the nearest neighbors calculation to account for any other location relevance (there is a little but much less now). That left me with 4 nines of correlation between predicted & actual performance,
Interesting, that definitely suggests some additional influences that we haven’t explicitly taken account of, rather than random variation.
> added a quadratic term to my rescaling of Local Value of Pi (because the dropoff from 3.15 isn’t linear)
As did aphyer, but I didn’t see any such effect, which is really confusing me. I’m pretty sure I would have noticed it if it were anywhere near as large as aphyer shows in his post.
edit: on the pi issue see my reply to my own comment. Did you account for these factors as divisors dividing from a baseline, or multipliers multiplying a baseline (I did the latter)? edit: a converation with aphyer clarified this. I see you are predicting log performance, as with aphyer, so a linear effect on the multiplier would then have a log taken of it which makes it nonlinear.
Did a little robustness check, and I’m going to swap out 3 of these to make it:
96286, 23565, 68204, 905, 93762, 94408, 105880, 9344, 8415, 62718, 80395, 65607
To share some more:
I came across this puzzle via aphyer’s post, and got inspired to give it a try.
Here is the fit I was able to get on the existing sites (Performance vs. Predicted Performance). Some notes on it:
Seems good enough to run with. None of the highest predicted existing sites had a large negative residual, and the highest predicted new sites give some buffer.
Three observations I made along the way.
First (which is mostly redundant with what aphyer wound up sharing in his second post):
Almost every variable is predictive of Performance on its own, but none of the continuous variables have a straightforward linear relationship with Performance.
Second:
Modeling the effect of location could be tricky. e.g., Imagine on Earth if Australia and Mexico were especially good places for Performance, or on a checkerboard if Performance was higher on the black squares.
Third:
The ZPPG Performance variable has a skewed distribution which does not look like what you’d get if you were adding a bunch of variables, but does look like something you might get if you were multiplying several variables. And multiplication seems plausible for this scenario, e.g. perhaps such-and-such a disturbance halves Performance and this other factor cuts performance by a quarter.
My updated list after some more work yesterday is
96286, 9344, 107278, 68204, 905, 23565, 8415, 62718, 83512, 16423, 42742, 94304
which I see is the same as simon’s list, with very slight differences in the order
More on my process:
I initially modeled location just by a k nearest neighbors calculation, assuming that a site’s location value equals the average residual of its k nearest neighbors (with location transformed to Cartesian coordinates). That, along with linear regression predicting log(Performance), got me my first list of answers. I figured that list was probably good enough to pass the challenge: the sites’ predicted performance had a decent buffer over the required cutoff, the known sites with large predicted values did mostly have negative residuals but they were only about 1⁄3 the size of the buffer, there were some sites with large negative residuals but none among the sites with high predicted values and I probably even had a big enough buffer to withstand 1 of them sneaking in, and the nearest neighbors approach was likely to mainly err by giving overly middling values to sites near a sharp border (averaging across neighbors on both sides of the border) which would cause me to miss some good sites but not to include any bad sites. So it seemed fine to stop my work there.
Yesterday I went back and looked at the residuals and added some more handcrafted variables to my model to account for any visible patterns. The biggest was the sharp cutoff at Latitude +-36. I also changed my rescaling of Murphy’s Constant (because my previous attempt had negative residuals for low Murphy values), added a quadratic term to my rescaling of Local Value of Pi (because the dropoff from 3.15 isn’t linear), added a Shortitude cutoff at 45, and added a cos(Longitude-50) variable. Still kept the nearest neighbors calculation to account for any other location relevance (there is a little but much less now). That left me with 4 nines of correlation between predicted & actual performance, residuals near zero for the highest predicted sites in the training set, and this new list of sites. My previous lists of sites still seem good enough, but this one looks better.
> Still kept the nearest neighbors calculation to account for any other location relevance (there is a little but much less now). That left me with 4 nines of correlation between predicted & actual performance,
Interesting, that definitely suggests some additional influences that we haven’t explicitly taken account of, rather than random variation.
> added a quadratic term to my rescaling of Local Value of Pi (because the dropoff from 3.15 isn’t linear)
As did aphyer, but I didn’t see any such effect, which is really confusing me. I’m pretty sure I would have noticed it if it were anywhere near as large as aphyer shows in his post.
edit: on the pi issue see my reply to my own comment.
Did you account for these factors as divisors dividing from a baseline, ormultipliers multiplying a baseline (I did the latter)? edit: a converation with aphyer clarified this. I see you are predicting log performance, as with aphyer, so a linear effect on the multiplier would then have a log taken of it which makes it nonlinear.