tailcalled comments on Entropy isn’t sufficient to measure password strength

tailcalled 17 Jan 2022 12:29 UTC
4 points
This seems like the exact sort of thing expected utility theory is made for.

Entropy is given by E[log 1/P]. Maximizing entropy can be seen as using a password-selection method with the assumption that log 1/P is the utility of a given password. Is that a reasonable utility function? What does it represent?

Well, 1/P seems to represent the rough number of guesses needed to guess a password. Which does seem like a relevant number for evaluating the utility of a password, as the difficulty in guessing it scales directly with the number of tries necessary. But should we take the logarithm of that? Should we just use 1/P directly? If we just use this directly, then we are risk-neutral with respect to guessing difficulty. Is that reasonable?

I don’t think so, and here’s a model: You might imagine that you have a distribution of adversaries. Some might just be script kiddies who are going to try 1-3 logical guesses and move on. Most might be using bots or botnets and guessing more, say 1000000 guesses or so. And small upper tail would consist of e.g. state actors who could spend an enormous amount of resources.

Assuming, unrealistically, that each of the adversaries would do equal amounts of harm when guessing your password, it seems like the value of the password difficulty would be in reducing the probability that they guess it, by locking out subset of the adversaries. For simplicity, we could take the cumulative distribution function of the adversary guess budget distribution as our measure of the probability that they will be defeated by the password.

So this model implies that a proper way of selecting is E[cdf(1/P)], where cdf is the cumulative distrobution function of attacker capabilities. The simple E[1/P] version can be seen as using an (improper) uniform prior where there is equal probability for any number of positive guesses, whether that is 1, 64827266195729462, a googolplex, Graham’s number, or whatever. Meanwhile, E[log 1/P] can be seen as using a (still improper) prior where the order of magnitude of the adversary’s capabilities are uniformly distributed. Both of these are unbounded, as tends to happen with improper priors, but if we used a proper prior, the utilities would be bounded, corresponding to strong risk-aversion.
- gjm 17 Jan 2022 17:03 UTC
  2 points
  Parent
  A little more detail seems worth providing. If you have a pdf (over probabilities) f(p), which may or may not actually integrate to 1 or to any finite value, then the corresponding pdf for our measure of attacker-capability q=1/p is f(1/q)/q^2; the relevant CDF here is the integral from 0 to q of this.
  Conversely, if we start with a given cdf F(q), its pdf is F’(q), and the corresponding pdf for p is F’(1/p)/p^2.
  So a uniform distribution over q (obviously improper) corresponds to F(q) = q and a pdf of 1/p^2. A uniform distribution over log q (again, obviously improper) corresponds to F(q) = log q and a pdf of 1/p.
  If we continue this “one step further”, a (no longer improper) uniform prior on p means F’(1/p)/p^2 = const or F’(q)=const/q^2 or F(q) = const/q; obviously the constant should be negative. That is, this would mean trying to minimize the expected probability of the password we pick. Equivalently, trying to minimize sum(p^2) over all our possible passwords. So, in particular, for a fixed set of passwords this says to pick them all with equal probability.
  How does this figure of demerit do for the example in the OP? Let’s generalize it a little, so that with probability p we pick one password and with probability 1/p we pick one of N passwords.
  - The entropy here is p log 1/p + (1-p) log N/(1-p). If we call this E(p,N) then note that E(1/2,N) >= E(1, 2N), so setting p=1/2 can be “fixed” by doubling N.
  - The sum-of-squared-probabilities is p^2 + (1-p)^2/N. When p=0 this is 1/N. When p>0, even in the limit of infinite N it’s 1/p. So for any N, if you increase p beyond 1/sqrt(N) you cannot hope to compensate for this by increasing N further.
  The implied model of our attackers here is that the probability that an attacker can afford to guess at least N passwords goes down in proportion to 1/N. Not obviously very realistic, but not obviously worse than the improper priors considered above either :-).
  tailcalled alluded to the following but didn’t make it explicit: If your prior is improper then you can make that CDF get unboundedly large, which means you can compensate for badness elsewhere that way, which means the pathology OP is about can happen; if your prior is proper then you can’t. So propriety is in some sense exactly the property we want here.
  [EDITED to add two things:] 1. After some discussion with tailcalled, there’s a rewritten and hopefully clearer version of most of the above in a comment of mine further downthread. 2. I see that this one has received someone’s strong-downvote. For the avoidance of doubt, I’m not complaining; you should strong-downvote things that you think are very bad; but it would be much more useful information if I understood what the problem is, and right now I don’t. Did I make a mathematical error? Was my writing too unclear to be useful (e.g., because “probabilities” was a bad way to refer to 1/attacker_capacity, or because looking at things as a function of p = 1/attacker_capacity is needlessly confusing)? Did it feel as if I was trying to restate what tailcalled already said and take credit for it? Etc. Thanks!
  - tailcalled 17 Jan 2022 18:29 UTC
    1 point
    Parent
    A little more detail seems worth providing. If you have a pdf (over probabilities) f(p), which may or may not actually integrate to 1 or to any finite value, then the corresponding pdf for our measure of attacker-capability q=1/p is f(1/q)/q^2; the relevant CDF here is the integral from 0 to q of this.
    Conversely, if we start with a given cdf F(q), its pdf is F’(q), and the corresponding pdf for p is F’(1/p)/p^2.
    I don’t think I understand. What do you mean by a pdf over probabilities? Like I get how one can have a pdf over probabilities in e.g. a Laplace’s law of succession-like situation, but I’m not sure what probabilities you have a pdf over here.
    - gjm 17 Jan 2022 19:35 UTC
      2 points
      Parent
      I said “probabilities” but should really have said something like “reciprocals of numbers of trials”. The thing you called P, when you wrote things like “E[1/P]” and “E[log 1/P]”. This has the same “units” as probability; if you wanted to express it in explicitly probabilistic terms it would be something like “least probable password the attacker has the resources to try”.
      - tailcalled 17 Jan 2022 21:14 UTC
        1 point
        Parent
        It’s fair enough to call that probabilities, I guess that just means that the thing I was really confused about is how you derive the CDF from the probabilities. That is, this part:
        then the corresponding pdf for our measure of attacker-capability q=1/p is f(1/q)/q^2; the relevant CDF here is the integral from 0 to q of this.
        Under my framework, the distribution of amount of guesses that an attacker is willing to go through is a separate independent parameter that one can choose. I don’t see where the f(1/q)/q^2 expression comes from.
        gjm 18 Jan 2022 1:04 UTC
        2 points
        Parent
        It’s a separate independent parameter for me too. My intention (though of course I may have screwed up) is that I’m describing the same thing as you are :-).
        So, suppose you have a particular probability distribution for the number of guesses an attacker has. I am describing that by saying what its pdf is as a function of 1/#guesses (the thing I’m calling p). I call this f(p). You’re describing it by saying what its cdf is as a function of #guesses (the thing you’re calling 1/P). You call this cdf(1/P). These two formulations are equivalent[1].
        [1] Well, maybe there are some weird technical issues if the cdf is sufficiently horribly non-differentiable—a Cantor staircase or something of the sort.
        Why start with f(p) rather than cdf(1/P)? No particular reason, I guess :-). It would probably have been cleaner just to define q=1/p (as I did) and then work with either the pdf or the cdf as a function of q.
        So, let me try again—I am not saying anything different, or at least not intending to, just cleaning up my notation and language and so forth—and see whether you find the result (1) comprehensible and (2) compatible with what you said before. (Again, to be clear, I am claiming only to explicate what you already said, which I found excellent but a bit terse.)
        A little more detail seems worth providing. Let’s say that an attacker has “resources R” if they are willing and able to crack passwords with probabilities at least 1/R. (Is this a realistic model? I think it is provided R is fairly large, which in practice it usually will be, provided attackers know the probability distribution we are using. If I am an attacker and know e.g. that passwords of type A are used 10x more often than passwords of type B, then I will use 10x more of my resources on type-A passwords. This falls down if I have enough resources that I literally can’t do that without repeating passwords, but for actually-realistic password distributions and attackers who are likely attacking multiple victims at once this is not a problem.)
        So, we can describe the distribution of attackers in terms of the pdf f(R). In some sense, as described above, values of R are reciprocals of password-probabilities; the corresponding pdf over probabilities is f(1/p)/p^2 where p=1/R. If we start with the cdf F(R) instead, the pdf is f’(R), the pdf over probabilities is F’(1/p)/p^2, and of course the cdf for probabilities is just 1-F(1/p).
        In each case, tailcalled’s prescription is to take E[cdf(1/p)], the expected fraction of attackers who fail to crack our password, as our figure of merit. The expectation is taken over our passwords with the probability distribution we are using (the probability distribution over passwords, not over attackers) so it equals the sum of p.F(1/p) where F is the cdf for attacker-resources.
        These things I’ve called “pdfs” and “cdfs” may not actually integrate/tend to 1; we may leave them unnormalized, or indeed they may be improper (so that the integral of the pdf / limiting value of the cdf is infinite). What we are doing with these is just maximizing/minimizing some expectation involving them; rescaling the values of the pdf/cdf doesn’t change the result (so we needn’t care about normalization) and in many cases the expectation will converge even if the pdf/cdf is improper.
        Let’s look more explicitly at the examples tailcalled mentioned. A uniform distribution for R, which of course is very improper indeed, corresponds to F(R)=R or f(R)=1. The corresponding pdf for p=1/R is 1/p^2. A uniform distribution for log R, still improper, corresponds to F(R) = log R or f(R) = 1/R. The corresponding pdf for p=1/R is 1/p. This is exactly the case in which expected utility = entropy.
        (As tailcalled implied but didn’t quite say explicitly, the fact that these are improper is what enables the pathology described by OP: as R gets large the cdf → infinity, so a modest probability of picking extremely-low-probability passwords can give an arbitrarily large positive contribution to the expectation we’re trying to maximize, even if most of the time we do very badly.)
        We looked at pdf(p) = 1/p^2 and pdf(p) = 1/p; let’s go one step further and take pdf(p)=1, corresponding to a uniform distribution over 1/R; cdf(R) = 1-1/R, and pdf(R) = 1/R^2. Our figure of merit is E[cdf(1/p)] = E[1-p] = 1-E[p], so the expected negative utility we get is (aside from irrelevant constant offsets and scalings) E[p] = sum(p^2).
        (At this point, continuing from “How does this figure of demerit do …” in the original version of what I wrote seems fine, except that the final paragraph of what I already wrote is effectively incorporated above.)
        tailcalled 18 Jan 2022 17:03 UTC
        1 point
        Parent
        So, suppose you have a particular probability distribution for the number of guesses an attacker has. I am describing that by saying what its pdf is as a function of 1/#guesses (the thing I’m calling p). I call this f(p). You’re describing it by saying what its cdf is as a function of #guesses (the thing you’re calling 1/P). You call this cdf(1/P). These two formulations are equivalent[1].
        The place where I’m getting confused is:
        In my model, the E is taking an expectation over different passwords. The P refers to the probability assigned to the selected password, not to anything related to the attacker. So 1/P refers to the number of guesses a password requires to crack.
        My description of the distribution of attacker capabilities is just named cdf, not cdf(1/P); cdf(1/P) is the amount of attackers who have the capabilities to crack a given password.
        … actually I just reread the thread and I think I misunderstood your previous comment:
        I said “probabilities” but should really have said something like “reciprocals of numbers of trials”. The thing you called P, when you wrote things like “E[1/P]” and “E[log 1/P]”. This has the same “units” as probability; if you wanted to express it in explicitly probabilistic terms it would be something like “least probable password the attacker has the resources to try”.
        Since in my original comment, P is not the least probable password the attacker has the resources to try.
        Maybe to clean up the math a bit, let’s restart with my formulation:
        Let $W$ be the distribution of password you are selecting from, and $w \sim W$ be a password. Let $P_{W}$ be the prior pmf for $W$ , let $A$ be a distribution of adversary capabilities, and let $c d f_{A}$ be the cdf for $A$ . The utility of a password-selection method is then given by approximately:
        $U (W) = E_{w \sim W} [c d f_{A} (1 / P_{W} (w))]$
        gjm 18 Jan 2022 17:48 UTC
        2 points
        Parent
        The function is cdf. The way it’s used in the expected-utility calculation is that it’s applied to 1/p where p is the probability of a given password. My original use of the term “probability” for the reciprocal of the thing fed to the cdf function was needlessly confusing, which is why I dropped it in the rewrite.
        since in my original comment, P is not the least probable password the attacker has the resources to try.
        In your original comment, P is the probability of a particular password. (I say this just to confirm that I do, and did, understand that.)
        But if we are going to explain what the cdf function actually is, we need to say something of the form “cdf(R) is the fraction—or, in the case of improper not-exactly-probability-distributions, something more like the total number—of attackers for whom …”. And I think the correct way to fill in that ”...” is something like “when they crack passwords, we expect that the least probable password they’re likely to crack has probability 1/R”. (Right?)
        In other words, I’m trying to be more explicit about what “adversary capabilities” actually cashes out to, and I think that’s what it is.
        Your more-explicit formalization of the calculation agrees with my understanding; to whatever extent you feel that what I’m describing is different from what you’re describing, I am pretty confident the cause is not that we have different understandings of the mathematics at that point. I think it’s you misunderstanding me / me communicating badly, not me misunderstanding you / you communicating badly.
        (It is a lamentable misfeature of our language that—so far as I can tell—there is no good way to say “what’s going on here is that what A is trying to say is not what B is interpreting it as” that doesn’t tend to assign blame to one or other party. You have to call it misunderstanding (implicitly blaming B) or miscommunicating (implicitly blaming A). But it takes two to tango and communication failures often involve suboptimality at both ends, and even in cases where it doesn’t assigning/taking blame is often an irrelevant distraction.)
        tailcalled 18 Jan 2022 18:14 UTC
        1 point
        Parent
        In your original comment, P is the probability of a particular password. (I say this just to confirm that I do, and did, understand that.)
        Yes.
        But if we are going to explain what the cdf function actually is, we need to say something of the form “cdf(R) is the fraction—or, in the case of improper not-exactly-probability-distributions, something more like the total number—of attackers for whom …”. And I think the correct way to fill in that ”...” is something like “when they crack passwords, we expect that the least probable password they’re likely to crack has probability 1/R”. (Right?)
        Yes, something like that. I’d probably fill it in with “who will probably succeed at cracking passwords w where P(w) is less than or equal to 1/R”, but it’s a similar point.
        (It is a lamentable misfeature of our language that—so far as I can tell—there is no good way to say “what’s going on here is that what A is trying to say is not what B is interpreting it as” that doesn’t tend to assign blame to one or other party. You have to call it misunderstanding (implicitly blaming B) or miscommunicating (implicitly blaming A). But it takes two to tango and communication failures often involve suboptimality at both ends, and even in cases where it doesn’t assigning/taking blame is often an irrelevant distraction.)
        Yes.