I’ll walk through how I’d analyze this problem, let me know if I haven’t answered your questions by the end.
First, problem structure. You have three unknowns, which you want to estimate from the data: a shape parameter, a scale parameter, and an indicator of which model is correct.
“Which model is correct” is probably easiest to start with. I’m not completely sure that I’ve followed what your spreadsheet is doing, but if I understand correctly, then that’s probably an overly-complicated way to tackle the problem. You want P(model|data), which will be determined via Bayes’ Rule by P(data|model) and your prior probability for each model being correct. The prior is unlikely to matter much unless your dataset is tiny, so P(data|model) is the important part. That’s an integral:
P(data|model)=∫μ,σ∏iP(xi|μ,σ,model)dP(μ,σ)
In your case, you’re approximating that integral with a grid over μ and σ (dP(μ,σ) is a shorthand for ρ(μ,σ)dμdσ here). Rather than whatever you’re doing with timesteps, you can probably just take the product of P(xi|μ,σ,model), where xi is the lifetime of the ith component in your dataset, then sum over the grid. (If you are dealing with online data streaming in, then you would need to do the timestep thing.)
That takes care of the model part. Once that’s done, you’ll probably find that one model is like a gazillion times more likely than the other two, and you can just throw out the other two.
On to the 95% CI for the worst part in a million.
The distribution you’re interested in here is P(xworst|model,data). xworst is an order statistic. Its CDF is basically the CDF for any old point x raised to the power of 1000000; read up on it to see exactly what expression to use. So if we wanted to do this analytically, we’d first compute P(x|model,data) via Bayes’ Rule:
P(x|model,data)=P(data∪x|model)P(data|model)
… where both pieces on the right would involve our integral from earlier. Basically, you imagine adding one more point x to the dataset and see what that would do to P(data|model). If we had a closed-form expression for that distribution, then we could just raise the CDF to the millionth power, we’d get a closed-form expression for the millionth order statistic, and from there we’d get a 95% CI in the usual way.
In practice, that’s probably difficult, so let’s talk about how to approximate it numerically.
First, the order statistic part. As long as we can sample from the posterior distribution P(x|model,data), that part’s easy: generate a million samples of x, take the worst, and you have a sample of xworst. Repeat that process a bunch of times to compute the 95% CI. (This is not the same as the worst component in 20M, but it’s not any harder to code up.)
Next, the posterior distribution for x. This is going to be driven by two pieces: uncertainty in the parameters μ and σ, and random noise from the distribution itself. If the dataset is large enough, then the uncertainty in μ and σ will be small, so the distribution itself will be the dominant term. In that case, we can just find the best-fit (i.e. maximum a-posteriori) estimates of μ and σ, and then declare that P(x|model,data) is approximately the standard distribution (Weibull, log-normal, etc) with those exact parameters. Presumably we can sample from any of those distributions with known parameters, so we go do the order statistic part and we’re done.
If the uncertainty in μ and σ is not small enough to ignore, then the problem gets more complicated—we’ll need to sample from the posterior distribution P(μ,σ|model,data). At that point we’re in the land of Laplace approximation and MCMC and all that jazz; I’m not going to walk through it here, because this comment is already really long.
So one last thing to wrap it up. I wrote all that out because it’s a great example problem of how to Bayes, but there’s still a big problem at the model level: the lifetime of the millionth-worst component is probably driven by qualitatively different processes than the vast majority of other components. If some weird thing happens one time in 10000, and causes problems in the components, then a best-fit model of the whole dataset probably won’t pick it up at all. Nice-looking distributions like Weibull or log-normal just aren’t good at modelling two qualitatively different behaviors mixed into the same dataset. There’s probably some standard formal way of dealing with this kind of thing—I hear that “Rare Event Modelling” is a thing, although I know nothing about it—but the fundamental problem is just getting any relevant data at all. If we only have a hundred thousand data points, and we think that millionth-worst is driven by qualitatively different processes, then we have zero data on the millionth-worst, full stop. On the other hand, if we have a billion data points, then we can just throw out all but the worst few thousand and analyse only those.
Thanks John, that’s just the kind of thing I was hoping for.
The point about behaviour for the very worst components not following the same distribution as the population which I’m studying is a very good one.
I should have given a little more background to the task.
Firstly, the task is to compare a new component with an old one. Data is available for the old one and data is being collected for the new one. The new one has much higher average life but also much higher variance between samples which led to the question as to which will have the worst one-in-a-million performance.
Secondly, the number of data points is very small. Time cost per data point is ~1 week. I’ll be lucky if I have 10 data points in total. Essentially, I’m tasked to try to approve the component with the fewest possible data points.
(This means that so far I don’t have a favoured model (although Weibull is lower than the other 2) and I also don’t have a firm impression of μ and σ.)
I guess all I can achieve with limited data is to say that for components which fit the typical failure mechanism the worst one-in-a-million is better for the new component than the old. However, we don’t have any information on Black Swan events caused by a different failure mechanism for either component and I definitely need to make this clear in any presentation I make on the subject. With such a lack of data points I don’t think I can use rare event modelling to fill in the gaps.
I will have a look at the ideas about:
1. Taking multiple random samples of a million to get a better estimate of CI. I’m pretty sure this is comfortably within my limited programming skills.
(My limited programming skills are the main reason that my method is a bit long winded – doing it step by step was my way of trying to stop myself making a mistake which I didn’t notice. I did make a mistake with my application of the scale parameter for Birnbaum-Saunders at one point which this method helped me spot.)
2. Looking at how adding extra data points affects my result. I think I can use this to compare value of information with cost of information to get a good impression of when we have enough.
Having ~ten data points makes this way more interesting. That’s exactly the kind of problem that I specialize in.
For the log-normal distribution, it should be possible to do the integral for P(data|model) explicitly. The integral is tractable for the normal distribution—it comes out proportional to a power of the sample variance—so just log-transform the data and use that. If you write down the integral for normal-distributed data explicitly and plug it into wolframalpha or something, it should be able to handle it. That would circumvent needing to sample μ and σ.
I don’t know if there’s a corresponding closed form for Birnbaum-Saunders; I had never even heard of it before this. The problem is still sufficiently low-dimensional that it would definitely be tractable computationally, but it would probably be a fair bit of work to code.
Birnbaum-Saunders is an interesting one. For the purposes of fatigue analysis, the assumptions which bring about the three models are:
Weibull—numerous failure modes (of similar failure speed) racing to see which causes the component to fail first
Log-normal—Damage per cycle is proportional to current damage
Birnbaum-Saunders—Damage per cycle is normally distributed and independent of previous damage
My engineering gut says that this component is probably somewhere between Log-normal and Birnbaum-Saunders (I think proportionality will decay as damage increases) which is maybe why I don’t have a clear winner yet.
***
I think I understand now where my original reasoning was incorrect when I was calculating the expected worst in a million. I was just calculating worst in a million for each model and taking a weighted average of the answers. This meant that bad values from the outlier potential pdfs were massively suppressed.
I’ve done some sampling of the worst in a million by repeatedly creating 2 random numbers from 0 to 1. I use the first to select a μ,σ combination based on the posterior for each pair. I use the second random number as a p-value. I then icdf those values to get an x.
Is this an ok sampling method? I’m not sure if I’m missing something or should be using MCMC. I definitely need to read up on this stuff!
The worst in a million is currently dominated by those occasions where the probability distribution is an outlier. In those cases the p-value doesn’t need to be particularly extreme to achieve low x.
I think my initial estimates were based either mainly on uncertainty in p or mainly on uncertainty in μ,σ. The sampling method allows me to account for uncertainty in both which definitely makes more sense. The model seems to react sensibly when I add potential new data so I think I can assess much better now how many data points I require.
That sampling method sounds like it should work, assuming it’s all implemented correctly (not sure what method you’re using to sample from the posterior distribution of μ, σ).
Worst case in a million being dominated by parameter uncertainty definitely makes sense, given the small sample size and the rate at which those distributions fall off.
For μ,σ I effectively created a quasi-cumulative distribution with the parameter pairs as the x-axis.
μ1,σ1. μ2,σ1. μ3,σ1 … μ1,σ2. μ2,σ2. μ3,σ2 … μn,σm
The random number defines the relevant point on the y-axis. From there I get the corresponding μ,σ pair from the x-axis.
If this method works I’ll probably have to code the whole thing instead of using a spreadsheet as I don’t have nearly enough μ,σ values to get a good answer currently.
I’ll walk through how I’d analyze this problem, let me know if I haven’t answered your questions by the end.
First, problem structure. You have three unknowns, which you want to estimate from the data: a shape parameter, a scale parameter, and an indicator of which model is correct.
“Which model is correct” is probably easiest to start with. I’m not completely sure that I’ve followed what your spreadsheet is doing, but if I understand correctly, then that’s probably an overly-complicated way to tackle the problem. You want P(model|data), which will be determined via Bayes’ Rule by P(data|model) and your prior probability for each model being correct. The prior is unlikely to matter much unless your dataset is tiny, so P(data|model) is the important part. That’s an integral:
P(data|model)=∫μ,σ∏iP(xi|μ,σ,model)dP(μ,σ)
In your case, you’re approximating that integral with a grid over μ and σ (dP(μ,σ) is a shorthand for ρ(μ,σ)dμdσ here). Rather than whatever you’re doing with timesteps, you can probably just take the product of P(xi|μ,σ,model), where xi is the lifetime of the ith component in your dataset, then sum over the grid. (If you are dealing with online data streaming in, then you would need to do the timestep thing.)
That takes care of the model part. Once that’s done, you’ll probably find that one model is like a gazillion times more likely than the other two, and you can just throw out the other two.
On to the 95% CI for the worst part in a million.
The distribution you’re interested in here is P(xworst|model,data). xworst is an order statistic. Its CDF is basically the CDF for any old point x raised to the power of 1000000; read up on it to see exactly what expression to use. So if we wanted to do this analytically, we’d first compute P(x|model,data) via Bayes’ Rule:
P(x|model,data)=P(data∪x|model)P(data|model)
… where both pieces on the right would involve our integral from earlier. Basically, you imagine adding one more point x to the dataset and see what that would do to P(data|model). If we had a closed-form expression for that distribution, then we could just raise the CDF to the millionth power, we’d get a closed-form expression for the millionth order statistic, and from there we’d get a 95% CI in the usual way.
In practice, that’s probably difficult, so let’s talk about how to approximate it numerically.
First, the order statistic part. As long as we can sample from the posterior distribution P(x|model,data), that part’s easy: generate a million samples of x, take the worst, and you have a sample of xworst. Repeat that process a bunch of times to compute the 95% CI. (This is not the same as the worst component in 20M, but it’s not any harder to code up.)
Next, the posterior distribution for x. This is going to be driven by two pieces: uncertainty in the parameters μ and σ, and random noise from the distribution itself. If the dataset is large enough, then the uncertainty in μ and σ will be small, so the distribution itself will be the dominant term. In that case, we can just find the best-fit (i.e. maximum a-posteriori) estimates of μ and σ, and then declare that P(x|model,data) is approximately the standard distribution (Weibull, log-normal, etc) with those exact parameters. Presumably we can sample from any of those distributions with known parameters, so we go do the order statistic part and we’re done.
If the uncertainty in μ and σ is not small enough to ignore, then the problem gets more complicated—we’ll need to sample from the posterior distribution P(μ,σ|model,data). At that point we’re in the land of Laplace approximation and MCMC and all that jazz; I’m not going to walk through it here, because this comment is already really long.
So one last thing to wrap it up. I wrote all that out because it’s a great example problem of how to Bayes, but there’s still a big problem at the model level: the lifetime of the millionth-worst component is probably driven by qualitatively different processes than the vast majority of other components. If some weird thing happens one time in 10000, and causes problems in the components, then a best-fit model of the whole dataset probably won’t pick it up at all. Nice-looking distributions like Weibull or log-normal just aren’t good at modelling two qualitatively different behaviors mixed into the same dataset. There’s probably some standard formal way of dealing with this kind of thing—I hear that “Rare Event Modelling” is a thing, although I know nothing about it—but the fundamental problem is just getting any relevant data at all. If we only have a hundred thousand data points, and we think that millionth-worst is driven by qualitatively different processes, then we have zero data on the millionth-worst, full stop. On the other hand, if we have a billion data points, then we can just throw out all but the worst few thousand and analyse only those.
Thanks John, that’s just the kind of thing I was hoping for.
The point about behaviour for the very worst components not following the same distribution as the population which I’m studying is a very good one.
I should have given a little more background to the task.
Firstly, the task is to compare a new component with an old one. Data is available for the old one and data is being collected for the new one. The new one has much higher average life but also much higher variance between samples which led to the question as to which will have the worst one-in-a-million performance.
Secondly, the number of data points is very small. Time cost per data point is ~1 week. I’ll be lucky if I have 10 data points in total. Essentially, I’m tasked to try to approve the component with the fewest possible data points.
(This means that so far I don’t have a favoured model (although Weibull is lower than the other 2) and I also don’t have a firm impression of μ and σ.)
I guess all I can achieve with limited data is to say that for components which fit the typical failure mechanism the worst one-in-a-million is better for the new component than the old. However, we don’t have any information on Black Swan events caused by a different failure mechanism for either component and I definitely need to make this clear in any presentation I make on the subject. With such a lack of data points I don’t think I can use rare event modelling to fill in the gaps.
I will have a look at the ideas about:
1. Taking multiple random samples of a million to get a better estimate of CI. I’m pretty sure this is comfortably within my limited programming skills.
(My limited programming skills are the main reason that my method is a bit long winded – doing it step by step was my way of trying to stop myself making a mistake which I didn’t notice. I did make a mistake with my application of the scale parameter for Birnbaum-Saunders at one point which this method helped me spot.)
2. Looking at how adding extra data points affects my result. I think I can use this to compare value of information with cost of information to get a good impression of when we have enough.
Having ~ten data points makes this way more interesting. That’s exactly the kind of problem that I specialize in.
For the log-normal distribution, it should be possible to do the integral for P(data|model) explicitly. The integral is tractable for the normal distribution—it comes out proportional to a power of the sample variance—so just log-transform the data and use that. If you write down the integral for normal-distributed data explicitly and plug it into wolframalpha or something, it should be able to handle it. That would circumvent needing to sample μ and σ.
I don’t know if there’s a corresponding closed form for Birnbaum-Saunders; I had never even heard of it before this. The problem is still sufficiently low-dimensional that it would definitely be tractable computationally, but it would probably be a fair bit of work to code.
(I thoroughly enjoy/enjoyed this exchange and think I learned some things from it, and wanted to thank you for having this out in the open)
Birnbaum-Saunders is an interesting one. For the purposes of fatigue analysis, the assumptions which bring about the three models are:
Weibull—numerous failure modes (of similar failure speed) racing to see which causes the component to fail first
Log-normal—Damage per cycle is proportional to current damage
Birnbaum-Saunders—Damage per cycle is normally distributed and independent of previous damage
My engineering gut says that this component is probably somewhere between Log-normal and Birnbaum-Saunders (I think proportionality will decay as damage increases) which is maybe why I don’t have a clear winner yet.
***
I think I understand now where my original reasoning was incorrect when I was calculating the expected worst in a million. I was just calculating worst in a million for each model and taking a weighted average of the answers. This meant that bad values from the outlier potential pdfs were massively suppressed.
I’ve done some sampling of the worst in a million by repeatedly creating 2 random numbers from 0 to 1. I use the first to select a μ,σ combination based on the posterior for each pair. I use the second random number as a p-value. I then icdf those values to get an x.
Is this an ok sampling method? I’m not sure if I’m missing something or should be using MCMC. I definitely need to read up on this stuff!
The worst in a million is currently dominated by those occasions where the probability distribution is an outlier. In those cases the p-value doesn’t need to be particularly extreme to achieve low x.
I think my initial estimates were based either mainly on uncertainty in p or mainly on uncertainty in μ,σ. The sampling method allows me to account for uncertainty in both which definitely makes more sense. The model seems to react sensibly when I add potential new data so I think I can assess much better now how many data points I require.
That sampling method sounds like it should work, assuming it’s all implemented correctly (not sure what method you’re using to sample from the posterior distribution of μ, σ).
Worst case in a million being dominated by parameter uncertainty definitely makes sense, given the small sample size and the rate at which those distributions fall off.
For μ,σ I effectively created a quasi-cumulative distribution with the parameter pairs as the x-axis.
μ1,σ1. μ2,σ1. μ3,σ1 … μ1,σ2. μ2,σ2. μ3,σ2 … μn,σm
The random number defines the relevant point on the y-axis. From there I get the corresponding μ,σ pair from the x-axis.
If this method works I’ll probably have to code the whole thing instead of using a spreadsheet as I don’t have nearly enough μ,σ values to get a good answer currently.
At what point is the data used?
I use it to determine the relative probabilities of each μ,σ pair which in turn create the pseudo cdf.
Ok, that sounds right.