I found this challenge difficult and awkward due to the high number of possible response-predictor pairs (disaster A in province B is predicted by disaster/omen X in province Y with a Z-year delay), low number of rows (if you look at each province seperately there are only 1080 records to play with), and probablistic linkages (if events had predicted each other more reliably, the shortage of data would have been less of an issue).
This isn’t necessarily a criticism—sometimes reality is difficult and awkward, and it’s good to prepare for that—and I get that it’s incongruous to hear “it’s too hard!” from the person who took second place out of a cohort that all did much better than random. Still, I think this problem would have been more approachable if we’d had fewer predictors and/or more data.
Misc other thoughts:
Rain of Fish is random. Sometimes fish just fall out of the sky. This is a thing that happens. It has a 2% chance of happening any year in any province.
Example of the problems caused by too few rows per column: I managed to convince myself there was a weak but solid connection between Rain of Fish and Plague in some provinces. (In my defense, it made intuitive sense that having rapidly-decaying fish all over your territories might make people sick.)
When the Titans rage against the bars of their prisons, two things happen:
There is an Earthquake in the province of their prison.
One or more fragments of them escape (the first into the province where they are imprisoned, additional ones into adjacent provinces). These fragments look to mortals like black doves, but carry fragments of Titanic malice.
. . . so my joke answer of “earthquakeproof every province, including the ones that don’t belong to you” would actually have been a good idea longterm? That’s delightful.
Was the need to use joins to analyze the data too large a barrier to entry?
I did my analysis without using joins. I created a sub-df for each province, reset their indices, then recombined; and for “does this predict that with a lag of N years?” investigations, I shifted one of the sub-dfs back by N before recombining. Joins would have made more sense in retrospect, but not knowing about them wouldn’t have stopped me cold.
due to the high number of possible response-predictor pairs
My hope was that people would figure out the existence of the Population and Wealth sub-variables, at which point I think figuring out what effects omens had would have been much much easier. Sadly it seems I illusion-of-transparencied myself on how hard that would be to work out. People figured out a lot of the intermediate correlations I expected to be useful there (enough to get some very good answers), but no-one seems to have actually drawn the link that would have connected them.
My hope was that you would start with sub-results like:
Famine in Year X means that Famine is unlikely in Year X+1
Plague in Year X also means that Famine is unlikely in Year X+1
Either Famine or Plague in Year X means that you are unlikely to Pillage a neighbor in Year X + 1
Omens in Year X that predict a high/low likelihood of Famine in Year X+1 (e.g. Moon Turns Red/Rivers of Blood) also predict a high/low likelihood of you Pillaging a neighbor in Year X+1
and eventually arrive at the conclusion of ‘maybe there is an underlying Population variable that many different things interact with’.
(I even tried to drop a hint about the Population and Wealth variables in the problem statement. I guess it’s just much harder than I expected to make deductions like that.)
for “does this predict that with a lag of N years?” investigations, I shifted one of the sub-dfs back by N before recombining
it’s just much harder than I expected to make deductions like that
This is something I noticed from some earlier .scis! I forget which, now. My hypothesis was that finding underlying unmentioned causes was really hard without explicitly using causal machinery in your exploration process, and I don’t know how to, uh, casually set up causal inference, and it’s something I would love to try learning at some point. Like, my intuition is something akin to “try a bunch of autogenerated causal graphs, see if something about correlations says [these] could work and [those] probably don’t, inspect them visually, notice that all of [these] have a commonality”. No idea if that would actually pan out or if there’s a much better way. There’s a lot of friction in “guess maybe there’s an underlying cause, do a lot of work to check that one specific guess, anticipate you’d go through many false guesses and maybe even there isn’t such a thing on this problem”.
What I was (haphazardly, inarticulately) getting at is that I never used any built-in functions with ‘join’ in the name, or for that matter thought anything along the lines of “I will Do a Join now”. In other words, I don’t think needing to know about joins was a barrier to entry, because I never explicitly used that information when working on this problem.
I found this challenge difficult and awkward due to the high number of possible response-predictor pairs (disaster A in province B is predicted by disaster/omen X in province Y with a Z-year delay), low number of rows (if you look at each province seperately there are only 1080 records to play with), and probablistic linkages (if events had predicted each other more reliably, the shortage of data would have been less of an issue).
This isn’t necessarily a criticism—sometimes reality is difficult and awkward, and it’s good to prepare for that—and I get that it’s incongruous to hear “it’s too hard!” from the person who took second place out of a cohort that all did much better than random. Still, I think this problem would have been more approachable if we’d had fewer predictors and/or more data.
Misc other thoughts:
Example of the problems caused by too few rows per column: I managed to convince myself there was a weak but solid connection between Rain of Fish and Plague in some provinces. (In my defense, it made intuitive sense that having rapidly-decaying fish all over your territories might make people sick.)
. . . so my joke answer of “earthquakeproof every province, including the ones that don’t belong to you” would actually have been a good idea longterm? That’s delightful.
I did my analysis without using joins. I created a sub-df for each province, reset their indices, then recombined; and for “does this predict that with a lag of N years?” investigations, I shifted one of the sub-dfs back by N before recombining. Joins would have made more sense in retrospect, but not knowing about them wouldn’t have stopped me cold.
My hope was that people would figure out the existence of the Population and Wealth sub-variables, at which point I think figuring out what effects omens had would have been much much easier. Sadly it seems I illusion-of-transparencied myself on how hard that would be to work out. People figured out a lot of the intermediate correlations I expected to be useful there (enough to get some very good answers), but no-one seems to have actually drawn the link that would have connected them.
My hope was that you would start with sub-results like:
Famine in Year X means that Famine is unlikely in Year X+1
Plague in Year X also means that Famine is unlikely in Year X+1
Either Famine or Plague in Year X means that you are unlikely to Pillage a neighbor in Year X + 1
Omens in Year X that predict a high/low likelihood of Famine in Year X+1 (e.g. Moon Turns Red/Rivers of Blood) also predict a high/low likelihood of you Pillaging a neighbor in Year X+1
and eventually arrive at the conclusion of ‘maybe there is an underlying Population variable that many different things interact with’.
(I even tried to drop a hint about the Population and Wealth variables in the problem statement. I guess it’s just much harder than I expected to make deductions like that.)
That...is in fact a join?
This is something I noticed from some earlier .scis! I forget which, now. My hypothesis was that finding underlying unmentioned causes was really hard without explicitly using causal machinery in your exploration process, and I don’t know how to, uh, casually set up causal inference, and it’s something I would love to try learning at some point. Like, my intuition is something akin to “try a bunch of autogenerated causal graphs, see if something about correlations says [these] could work and [those] probably don’t, inspect them visually, notice that all of [these] have a commonality”. No idea if that would actually pan out or if there’s a much better way. There’s a lot of friction in “guess maybe there’s an underlying cause, do a lot of work to check that one specific guess, anticipate you’d go through many false guesses and maybe even there isn’t such a thing on this problem”.
What I was (haphazardly, inarticulately) getting at is that I never used any built-in functions with ‘join’ in the name, or for that matter thought anything along the lines of “I will Do a Join now”. In other words, I don’t think needing to know about joins was a barrier to entry, because I never explicitly used that information when working on this problem.