jsteinhardt comments on A Fervent Defense of Frequentist Statistics

jsteinhardt 19 Feb 2014 7:06 UTC
13 points
Eliezer,

The point of Bayesianism is to provide a coherent background epistemology which underlies everything; when a frequentist algorithm works, there’s supposed to be a Bayesian explanation of why it works. I have said this before many times but it seems to be a “resistant concept” which simply cannot sink in for many people.

First, I object to the labeling of Bayesian explanations as a “resistant concept”. I think it’s not only uncharitable but also wrong. I started out with exactly the viewpoint that everything should be explained in terms of Bayes (see one of my earliest and most-viewed blog posts if you don’t believe me). I moved away from this viewpoint slowly as the result of accumulated evidence that this is not the most productive lens through which to view the world.

More to the point: why is it that you think that everything should have a Bayesian explanation? One of the most-cited reasons why Bayes should be an empistemic ideal is the various “optimality” / Dutch book theorems, which I’ve already argued against in this post. Do you accept the rebuttals I gave, or disagree with them?

My guess is that you would still be in favor of Bayes as a normative standard of epistemology even if you rejected Dutch book arguments, and the reason why you like it is because you feel like it has been useful for solving a large number of problems. But frequentist statistics (not to mention pretty much any successful paradigm) has also been useful for solving a large number of problems, some of which Bayesian statistics cannot solve, as I have demonstrated in this post. The mere fact that a tool is extremely useful does not mean that it should be elevated to a universal normative standard.

but found that the exact original problem specified may be NP-hard according to Wikipedia, much as my instincts said it should be

We’ve already discussed this in one of the other threads, but I’ll just repeat here that this isn’t correct. With overwhelmingly high probability a Gaussian matrix will satisfy the restricted isometry property, which implies that appropriately L1-regularized least squares will return the exact solution.

I could go on about how for any given solution I can compute its Bayesian likelihood assuming Gaussian noise, and so again Bayes functions well as a background epistemology

The point of this example was to give a problem that, from a modeling perspective, was as convenient for Bayes as possible, but that was computationally intractable to solve using Bayesian techniques. I gave other examples (such as in Myth 5) that demonstrate situations where Bayes breaks down. And I argued indirectly in Myths 1, 4, and 8 that the prior is actually a pretty big deal and has the capacity to cause problems in ways that frequentists have ways of dealing with.

I should very much like to see explained concretely how Jacob’s favorite algorithm would handle the case of “You have a self-improving AI which turns out to maximize smiles, in all previous cases it produced smiles by making people happy, but once it became smart enough it realized that it ought to preserve your bad generalization and faked its evidence, and now that it has nanotech it’s going to tile the universe with tiny smileyfaces.”

I think this is a very bad testing ground for how good a technique is, because it’s impossible to say whether something would solve this problem without going through a lot of hand-waving. I think your “notion of how to solve it” is interesting but has a lot of details to fill in, and it’s extremely unclear how it would work, especially given that even for concrete problems that people work on now, an issue with Bayesian methods is overconfidence in a particular model. I should also note that, as we’ve registered earlier, I don’t think that what you call the Context Change Problem is actually a problem that an intelligent agent would face: any agent that is intelligent enough to behave at all functionally close to the level of a human would be robust to context changes.

However, even given all these caveats, I’ll still try to answer your question on your own terms. Short answer: do online learning with an additional action called “query programmer” that is guaranteed to always have some small negative utility, say −0.001, that is enough to outweigh any non-trivial amount of uncertainty but will eventually encourage the AI to act autonomously. We would need some way of upper-bounding the regret of other possible actions, and of incorporating this utility constraint into the algorithm, but I don’t think the amount of fleshing out is any more or less than that required by your proposal.

[WARNING: The rest of this comment is mostly meaningless rambling.]

I want to stress again that the above paragraph is only a (sketch of) an answer to the question as you posed it. But I’d rather sidestep the question completely and say something like: “OK, if we make literally no assumptions, then we’re completely screwed, because moving any speck of dust might cause the universe to explode. Being Bayesian doesn’t make this issue go away, it just ignores it.

So, what assumptions can we be reasonably okay with making that would help us solve the problem? Maybe I’d be okay assuming that the mechanism that takes in my past actions and returns a utility is a Turing machine of description length less than 10^15. But unfortunately that doesn’t help me much, because for every Turing machine M, there’s one of not that much longer description length that behaves identically to M up until I’m about to make my current decision, and then penalizes my current decision with some extraordinary large amount of disutility. Note that, again, being Bayesian doesn’t deal with this issue, it just assigns it low prior probability.

I think the question of exactly what assumptions one would be willing to make, that would allow one to confidently reason about actions with potentially extremely discontinuous effects, is an important and interesting one, and I think one of the drawbacks of “thinking like a Bayesian” is that it draws attention away from this issue by treating it as mostly solved (via assigning a prior).”
- Eliezer Yudkowsky 19 Feb 2014 18:08 UTC
  36 points
  Parent
  
  My guess is that you would still be in favor of Bayes as a normative standard of epistemology even if you rejected Dutch book arguments, and the reason why you like it is because you feel like it has been useful for solving a large number of problems.
  
  Um, nope. What it would really take to change my mind about Bayes is seeing a refutation of Dutch Book and Cox’s Theorem and Von Neumann-Morgenstern and the complete class theorem , combined with seeing some alternative epistemology (e.g. Dempster-Shafer) not turn out to completely blow up when subjected to the same kind of scrutiny as Bayesianism (the way DS brackets almost immediately go to [0-1] and fuzzy logic turned out to be useless etc.)
  
  Neural nets have been useful for solving a large number of problems. It doesn’t make them good epistemology. It doesn’t make them a plausible candidate for “Yes, this is how you need to organize your thinking about your AI’s thinking and if you don’t your AI will explode”.
  
  some of which Bayesian statistics cannot solve, as I have demonstrated in this post.
  
  I am afraid that your demonstration was not stated sufficiently precisely for me to criticize. This seems like the sort of thing for which there ought to be a standard reference, if there were such a thing as a well-known problem which Bayesian epistemology could not handle. For example, we have well-known critiques and literature claiming that nonconglomerability is a problem for Bayesianism, and we have a chapter of Jaynes which neatly shows that they all arise from misuse of limits on infinite problems. Is there a corresponding literature for your alleged reductio of Bayesianism which I can consult? Now, I am a great believer in civilizational inadequacy and the fact that the incompetence of academia is increasing, so perhaps if this problem was recently invented there is no more literature about it. I don’t want to be a hypocrite about the fact that sometimes something is true and nobody has written it up anyway, heaven knows that’s true all the time in my world. But the fact remains that I am accustomed to somewhat more detailed math when it comes to providing an alleged reductio of the standard edifice of decision theory. I know your time is limited, but the real fact is that I really do need more detail to think that I’ve seen a criticism and be convinced that no response to that criticism exists. Should your flat assertion that Bayesian methods can’t handle something and fall flat so badly as to constitute a critique of Bayesian epistemology, be something that I find convincing?
  
  We’ve already discussed this in one of the other threads, but I’ll just repeat here that this isn’t correct. With overwhelmingly high probability a Gaussian matrix will satisfy the restricted isometry property, which implies that appropriately L1-regularized least squares will return the exact solution.
  
  Okay. Though I note that you haven’t actually said that my intuitions (and/or my reading of Wikipedia) were wrong; many NP-hard problems will be easy to solve for a randomly generated case.
  
  Anyway, suppose a standard L1-penalty algorithm solves a random case of this problem. Why do you think that’s a reductio of Bayesian epistemology? Because the randomly generated weights mean that a Bayesian viewpoint says the credibility is going as the L2 norm on the non-zero weights, but we used an L1 algorithm to find which weights were non-zero? I am unable to parse this into the justifications I am accustomed to hearing for rejecting an epistemology. It seems like you’re saying that one algorithm is more effective at finding the maximum of a Bayesian probability landscape than another algorithm; in a case where we both agree that the unbounded form of the Bayesian algorithm would work.
  
  What destroys an epistemology’s credibility is a case where even in the limit of unbounded computing power and well-calibrated prior knowledge, a set of rules just returns the wrong answer. The inherent subjectivity of p-values as described in http://lesswrong.com/lw/1gc/frequentist_statistics_are_frequently_subjective/ is not something you can make go away with a better-calibrated prior, correct use of limits, or unlimited computing power; it’s the result of bad epistemology. This is the kind of smoking gun it would take to make me stop yammering about probability theory and Bayes’s rule. Showing me algorithms which don’t on the surface seem Bayesian but find good points on a Bayesian fitness landscape isn’t going to cut it!
  - jsteinhardt 20 Feb 2014 4:16 UTC
    6 points
    Parent
    Eliezer, I included a criticism of both complete class and Dutch book right at the very beginning, in Myth 1. If you find them unsatisfactory, can you at least indicate why?
    - Eliezer Yudkowsky 20 Feb 2014 4:34 UTC
      17 points
      Parent
      Your criticism of Dutch Book is that it doesn’t seem to you useful to add anti-Dutch-book checkers to your toolbox. My support of Dutch Book is that if something inherently produces Dutch Books then it can’t be the right epistemological principle because clearly some of its answers must be wrong even in the limit of well-calibrated prior knowledge and unbounded computing power.
      
      The complete class theorem I understand least of the set, and it’s probably not very much entwined with my true rejection so it would be logically rude to lead you on here. Again, though, the point that every local optimum is Bayesian tells us something about non-Bayesian rules producing intrinsically wrong answers. If I believed your criticism, I think it would be forceful; I could accept a world in which for every pair of a rational plan with a world, there is an irrational plan which does better in that world, but no plausible way for a cognitive algorithm to output that irrational plan—the plans which are equivalent of “Just buy the winning lottery ticket, and you’ll make more money!” I can imagine being shown that the complete class theorem demonstrates only an “unfair” superiority of this sort, and that only frequentist methods can produce actual outputs for realistic situations even in the limit of unbounded computing power. But I do not believe that you have leveled such a criticism. And it doesn’t square very much with my current understanding that the decision rules being considered are computable rules from observations to actions. You didn’t actually tell me about a frequentist algorithm which is supposed to be realistic and show why the Bayesian rule which beats it is beating it unfairly.
      
      If you want to hit me square in the true rejection I suggest starting with VNM. The fact that our epistemology has to plug into our actions is one reason why I roll my eyes at the likes of Dempster-Shafer or frequentist confidence intervals that don’t convert to credibility distributions.
      - Luke_A_Somers 21 Feb 2014 22:24 UTC
        1 point
        Parent
        
        I could accept a world in which for every pair of a rational plan with a world, there is an irrational plan which does better in that world, but no plausible way for a cognitive algorithm to output that irrational plan
        
        We already live in that world.
        
        (The following is not evidence, just an illustrative analogy) Ever seen Groundhog Day? Imagine him skipping the bulk of the movie and going straight to the last day. It is straight wall to wall WTF but it’s very optimal.
      - jsteinhardt 20 Feb 2014 4:44 UTC
        1 point
        Parent
        One of the criticisms I raised is that merely being able to point to all the local optima is not a particularly impressive property of an epistemological theory. Many of those local optima will be horrible! (My criticism of VNM is essentially the same.)
        
        Many frequentist methods, such as minimax, also provide local optima, but they provide local optima which actually have certain nice properties. And minimax provides a complete decision rule, not just a probability distribution, so it plugs directly into actions.
  - thomascolthurst 12 Mar 2014 15:05 UTC
    3 points
    Parent
    FYI, there are published counterexamples to Cox’s theorem. See for example Joseph Halpern’s at http://arxiv.org/pdf/1105.5450.pdf.
    - Vaniver 12 Mar 2014 17:32 UTC
      0 points
      Parent
      You need to not include the period in your link, like so.
- Wei Dai 1 Jul 2015 23:35 UTC
  1 point
  Parent
  
  Short answer: do online learning with an additional action called “query programmer” that is guaranteed to always have some small negative utility, say −0.001, that is enough to outweigh any non-trivial amount of uncertainty but will eventually encourage the AI to act autonomously.
  
  This short answer is too short for me to understand, unfortunately. Do you think there is enough potential merit in this idea to try to understand it better or further develop it? (I’ve been learning about online learning recently in an effort to understand/evaluate Paul Christiano’s recent “AI control” ideas. If you have your own ideas also based on online learning, I’d love to try to understand them while the online learning stuff is fresh in my mind.)
  - John_Maxwell 6 Aug 2018 7:31 UTC
    2 points
    Parent
    Here is a control idea based on online learning—I think I independently generated something similar to what Jacob describes.
- Vaniver 19 Feb 2014 18:03 UTC
  1 point
  Parent
  
  We’ve already discussed this in one of the other threads, but I’ll just repeat here that this isn’t correct. With overwhelmingly high probability a Gaussian matrix will satisfy the restricted isometry property, which implies that appropriately L1-regularized least squares will return the exact solution.
  
  I do wonder if it would have been better to include something along the lines of “with probability 1” to the claim that non-Bayesian methods can solve it easily. Compressed sensing isn’t magic, even though it’s very close.
  
  any agent that is intelligent enough to behave at all functionally close to the level of a human would be robust to context changes.
  
  Humans get tripped up by context changes very frequently. It’s not obvious to me where you think this robustness would come from.
  - Dentin 19 Feb 2014 21:14 UTC
    −1 points
    Parent
    Compressed sensing isn’t even magic, if you’re halfway versed in signal processing. I understood compressed sensing within 30 seconds of hearing a general overview of it, and there are many related analogs in many fields.
    - Vaniver 19 Feb 2014 21:40 UTC
      0 points
      Parent
      
      Compressed sensing isn’t even magic
      
      The convex optimization guys I know are all rather impressed by compressed sensing- but that may be because they specialize in doing L1 and L2 problems, and so compressed sensing makes the things they’re good at even more important.