I was first exposed to this in the context of baseball batting averages. I’ll relate that example in case it helps someone.
{For those unfamiliar with baseball, a player’s “batting average” is the number of hits the player has made divided by the number of hit attemps, also known as “at-bats.” Ruth and Gehrig were stars in the early 20th century. Ruth did enjoy a drink, and Gehrig never missed a game. The rest is made up.)
In 1927, Gehrig (injured but playing every game) and Ruth (on a months-long drinking tear, therefore sitting out lots of games while staggering through the rest) both performed terribly at the plate during the first half of the season: By the All-Star break, Ruth’s batting average was a pitiful .190, while Gehrig’s was only slightly less anemic at .200.
During the second half of the season, a dried-out Ruth and healthy Gehrig tore up the league, batting a torrid .390 and .400 respectively.
Yet despite the fact that Gehrig’s batting average exceeded Ruth’s in each half of the season, Ruth’s average over the entire season was greater than Gehrig’s. How can this be?
Answer: Remember all of those games that a hung-over Ruth sat out during the first half? The result was that far fewer of Ruth’s at-bats occurred during the dismal first half of the season than during the torrid second half. As a result, Ruth’s overall season average was determined to a greater extent by his second half performance. Gehrig’s season average, by contrast, was midway between his averages for the two halves. Here are the numbers:
Ruth: 19 hits /100 at-bats = .190 in 1st half, 78⁄200 =.390 in 2nd half, 97⁄300 = .323 overall Gehrig: 40⁄200 = .200 in 1st half, 80⁄200 = .400 in 2nd half, 120⁄300 = .300 overall
1) The total number of Gehrig’s at-bats for the season in my previous comment should have been 400, not 300 (and, sorry for the duplication)
2) I haven’t seen many attempts to actually answer the question in the posting. I’ll stick out my neck, after making a couple of simplifying assumptions: ASSUMING that the mortaility frequencies are reflective of the true underlying probabilities, and that the assignment of treatments to patients was otherwise random, I’d use treatment A on men regardless whether they have a history of heart disease, and treatment B on women. (In the real world, where these assumptions don’t necessarily hold, I’d have to think a lot harder about the unreliability of the smaller sample sizes, and of course I’d try to find out all I could about further confounding factors, the rules that were used for treatment selection, potential mechanisms underlying the gender and history effects, etc.) Critical comments invited.