[deleted] comments on MIRI’s Approach

[deleted]Jul 30, 2015, 3:38 PM
13 points
0

It appears you are making the problem unnecessarily difficult.

No, not really. In fact, I expect that given the right way of modelling, formal verification of learning systems up to epsilon-delta bounds (in the style of PAC-learning, for instance) should be quite doable. Why? Because, as mentioned regarding PAC learning, it’s the existing foundation for machine learning.

I do agree that this post reflects an “Old Computer Science” worldview, but to be fair, that’s not Nate’s personal fault, or MIRI’s organizational fault. It’s the fault of the entire subfield of AGI that still has not bloody learned the basic lessons of statistical machine learning: that real cognition just is about probably approximately correct statistical modelling.

So as you mention, for instance, there’s an immense amount of foundational theory behind modern neural networks. Hell, if I could find the paper showing that deep networks form a “funnel” in the model’s free-energy landscape—where local minima are concentrated in that funnel and all yield more-or-less as-good test error, while the global minimum reliably overfits—I’d be posting the link myself.

The problem with deep neural networks is not that they lack theoretical foundations. It’s that most of the people going “WOW SO COOL” at deep neural networks can’t be bothered to understand the theoretical foundations. The “deep learning cabal” of researchers (out of Toronto, IIRC), and the Switzerland Cabal of Schmidhuber-Hutter-and-Legg fame, all know damn well what they are doing on an analytical level.

(And to cheer for my favorite approach, the probabilistic programming cabal has even more analytical backing, since they can throw Bayesian statistics, traditional machine learning, and programming-languages theory at their problems.)

Sure, it does all require an unusual breadth of background knowledge, but they, this is how real science proceeds, people: shut up and read the textbooks and literature. Sorry, but if we (as in, this community) go around claiming that important problems can be tackled without background knowledge and active literature, or with as little as the “AGI” field seems to generate, then we are not being instrumentally rational. Period. Shut up and PhD.

Why not test safety long before the system is superintelligent?

Because that requires a way to state and demonstrate safety properties such that safety guarantees obtained with small amounts of resources remain strong when the system gets more resources. More on that below.

This again reflects the old ‘hard’ computer science worldview, and obsession with exact solutions.

If it seems really really really impossibly hard to solve a problem even with the ‘simplification’ of lots of computing power, perhaps the underlying assumptions are wrong. For example—perhaps using lots and lots of computing power makes the problem harder instead of easier.

You’re not really being fair to Nate here, but let’s be charitable to you: this is fundamentally a dispute between the heuristics-and-biases school of thought about cognition and the bounded/resource-rational school of thought.

In the heuristics-and-biases school of thought, the human mind uses heuristics or biases when it believes it doesn’t have the computing power on hand to use generally intelligent inference, or sometimes the general intelligence is even construed as an emergent computational behavior of an array of heuristics and biases that happened to get thrown together by evolution in the right way. Computationally, this is saying, “When we have enough resources that only asymptotic complexity matters, we use the Old Computer Science way of just running the damn algorithm that implements optimal behavior and optimal asymptotic complexity.” Trying to extend this approach into statistical inference gets you basic Bayesianism and AIXI, which appear to have nice “optimality” guarantees, but are computationally intractable and are only optimal up to the training data you give them.

In the bounded-rationality school of thought, computing power is considered a strictly (not asymptotically) finite resource, which must be exploited in an optimal way. I’ve seen a very nice paper on how thermodynamics actually yields a formal theory for how to do this. Cognition is then analyzed as a algorithmic ways to tractably build and evaluate models that deal well with the data. This approach yields increasingly fruitful analyses of such cognitive activities as causal learning, concept learning, and planning in arbitrary environments as probabilistic inference enriched with causal/logical structure.

In terms of LW posts, the former alternative is embodied in Eliezer’s Sequences, and the latter in jacob_cannell’s post on The Brain as a Universal Learning Machine and my book review of Plato’s Camera.

The kinds of steps needed to get both “AI” as such, and “Friendliness” as such, are substantively different in the “possible worlds” where the two different schools of thought apply. Or, perhaps, both are true in certain ways, and what we’re really talking about is just two different ways of building minds. Personally, I think the one true distinction is that Calude’s work on measuring nonhalting computations gives us a definitive way to deal with the kinds of self-reference scenarios that Old AGI’s “any finite computation” approach generates paradoxes in.

But time will tell and I am not a PhD, so everything I say should be taken with substantial sprinklings of salt. On the other hand, to wit, while you shouldn’t think for a second that I am one of them, I am certainly on the side of the PhDs.

(Nate: sorry for squabbling on your post. All these sorts of qualms with the research program were things I was going to bring up in person, in a much more constructive way. Still looking forward to meeting you in September!)
- jacob_cannell Jul 30, 2015, 7:34 PM
  11 points
  Parent
  
  The problem with deep neural networks is not that they lack theoretical foundations. It’s that most of the people going “WOW SO COOL” at deep neural networks can’t be bothered to understand the theoretical foundations. The “deep learning cabal” of researchers (out of Toronto, IIRC), and the Switzerland Cabal of Schmidhuber-Hutter-and-Legg fame, all know damn well what they are doing on an analytical level.
  
  This isn’t really a problem, because—as you point out—the formidable researchers all “know damn well what they are doing on an analytical level”.
  
  Thus the argument that there are people using DL without understanding it—and moreover that this is dangerous—is specious and weak because these people are not the ones actually likely to develop AGI let alone superintelligence.
  
  Why not test safety long before the system is superintelligent?
  
  Because that requires a way to state and demonstrate safety properties such that safety guarantees obtained with small amounts of resources remain strong when the system gets more resources. More on that below.
  
  Ah—the use of guarantees belies the viewpoint problem. Instead of thinking of ‘safety’ or ‘alignment’ as some absolute binary property we can guarantee, it is more profitable to think of a complex distribution over the relative amounts of ‘safety’ or ‘alignment’ in an AI population (and any realistic AI project will necessarily involve a population due to scaling constraints). Strong guarantees may be impossible, but we can at least influence or steer the distribution by selecting for agent types that are more safe/altruistic. We can develop a scaling theory of if, how, and when these desirable properties change as agents grow in capability.
  
  In other words—these issues are so incredibly complex that we can’t really develop any good kind of theory without alot of experimental data to back it up.
  
  Also—I should point out that one potential likely result of ANN based AGI is the creation of partial uploads through imitation and reverse reinforcement learning—agents which are intentionally close in mindspace to their human ‘parent’ or ‘model’.
  - [deleted]Jul 30, 2015, 11:24 PM
    3 points
    Parent
    
    Thus the argument that there are people using DL without understanding it—and moreover that this is dangerous—is specious and weak because these people are not the ones actually likely to develop AGI let alone superintelligence.
    
    Yes, but I don’t think that’s an argument anyone has actually made. Nobody, to my knowledge, sincerely believes that we are right around the corner from superintelligent, self-improving AGI built out of deep neural networks, such that any old machine-learning professor experimenting with how to get a lower error rate in classification tasks is going to suddenly get the Earth covered in paper-clips.
    
    Actually, no, I can think of one person who believed that: a radically underinformed layperson on reddit who, for some strange reason, believed that LessWrong is the only site with people doing “real AI” and that “[machine-learning researchers] build optimizers! They’ll destroy us all!”
    
    Hopefully he was messing with me. Nobody else has ever made such ridiculous claims.
    
    Sorry, wait, I’m forgetting to count sensationalistic journalists as people again. But that’s normal.
    
    Instead of thinking of ‘safety’ or ‘alignment’ as some absolute binary property we can guarantee, it is more profitable to think of a complex distribution over the relative amounts of ‘safety’ or ‘alignment’ in an AI population
    
    No, “guarantees” in this context meant PAC-style guarantees: “We guarantee that with probability 1-\delta, the system will only ‘go wrong’ from what its sample data taught it 1-\epsilon fraction of the time.” You then need to plug in the epsilons and deltas you want and solve for how much sample data you need to feed the learner. The links for intro PAC lectures in the other comment given to you were quite good, by the way, although I do recommend taking a rigorous introductory machine learning class (new grad-student level should be enough to inflict the PAC foundations on you).
    
    we can at least influence or steer the distribution by selecting for agent types that are more safe/altruistic
    
    “Altruistic” is already a social behavior, requiring the agent to have a theory of mind and care about the minds it believes it observes in its environment. It also assumes that we can build in some way to learn what the hypothesized minds want, learn how they (ie: human beings) think, and separate the map (of other minds) from the territory (of actual people).
    
    Note that “don’t disturb this system over there (eg: a human being) because you need to receive data from it untainted by your own causal intervention in any way” is a constraint that at least I, personally, do not know how to state in computational terms.
    - YVLIAZ Sep 7, 2015, 6:14 AM
      0 points
      Parent
      I think you are overhyping the PAC model. It surely is an important foundation for probabilistic guarantees in machine learning, but there are some serious limitations when you want to use it to constrain something like an AGI:
      
      It only deals with supervised learning
      
      Simple things like finite automata are not learnable, but in practice it seems like humans pick them up fairly easily.
      
      It doesn’t deal with temporal aspects of learning.
      
      However, there are some modification of the PAC model that can ameliorate these problems, like learning with membership queries (item 2).
      
      It’s also perhaps a bit optimistic to say that PAC-style bounds on a possibly very complex system like an AGI would be “quite doable”. We don’t even know, for example, whether DNF is learnable in polynomial time under the distribution free assumption.
      - [deleted]Sep 7, 2015, 2:06 PM
        0 points
        Parent
        I would definitely call it an open research problem to provide PAC-style bounds for more complicated hypothesis spaces and learning settings. But that doesn’t mean it’s impossible or un-doable, just that it’s an open research problem. I want a limitary theorem proved before I go calling things impossible.
- jacob_cannell Jul 30, 2015, 5:26 PM
  6 points
  Parent
  
  In fact, I expect that given the right way of modelling, formal verification of learning systems up to epsilon-delta bounds (in the style of PAC-learning, for instance) should be quite doable. Why?
  
  Dropping the ‘formal verification’ part and replacing it with approximate error bound variance reduction this is potentially interesting—although it also seems to be a general technique that would—if it worked well—be useful for practical training, safety aside.
  
  Why? Because, as mentioned regarding PAC learning, it’s the existing foundation for machine learning.
  
  Machine learning is an eclectic field with many mostly independent ‘foundations’ - bayesian statistics of course, optimization methods (hessian free, natural, etc), geometric methods and NLDR, statistical physics …
  
  That being said—I’m not very familiar with the PAC learning literature yet—do you have a link to a good intro/summary/review?
  
  Hell, if I could find the paper showing that deep networks form a “funnel” in the model’s free-energy landscape—where local minima are concentrated in that funnel and all yield more-or-less as-good test error, while the global minimum reliably overfits—I’d be posting the link myself.
  
  That sounds kind of like the saddle point paper. It’s easy to show that in complex networks there are a large number of equivalent minima due to various symmetries and redundancies. Thus finding the actual technical ‘global optimum’ quickly becomes suboptimal when you discount for resource costs.
  
  If it seems really really really impossibly hard to solve a problem even with the ‘simplification’ of lots of computing power, perhaps the underlying assumptions are wrong. For example—perhaps using lots and lots of computing power makes the problem harder instead of easier.
  
  You’re not really being fair to Nate here, but let’s be charitable to you: this is fundamentally a dispute between the heuristics-and-biases school of thought about cognition and the bounded/resource-rational school of thought.
  
  Yes that is the source of disagreement, but how am I not being fair? I said ‘perhaps’ - as in have you considered this? Not ‘here is why you are certainly wrong’.
  
  Computationally, this is saying, “When we have enough resources that only asymptotic complexity matters, we use the Old Computer Science way of just running the damn algorithm that implements optimal behavior and optimal asymptotic complexity.” Trying to extend this approach into statistical inference gets you basic Bayesianism and AIXI, which appear to have nice “optimality” guarantees, but are computationally intractable and are only optimal up to the training data you give them.
  
  Solonomoff/AIXI and more generally ‘full Bayesianism’ is useful as a thought model, but is perhaps over valued on this site compared to the machine learning field. Compare the number of references/hits to AIXI on this site (tons) to the number on r/MachineLearning (1!). Compare the number of references for AIXI papers (~100) to other ML papers and you will see that the ML community sees AIXI and related work as minor.
  
  The important question is what does the optimal practical approximation of Solonomoff/Bayesian look like? And how different is that from what the brain does? By optimal I of course I mean optimal in terms of all that really matters, which is intelligence per unit resources.
  
  Human intelligence—including that of Turing or Einstein, only requires 10 watts of energy and more surprisingly only around 10^14 switches/second or less—which is basically miraculous. A modern GPU uses more than 10^18 switches/second. You’d have to go back to a pentium or something to get down to 10^14 switches per second. Of course the difference is that switch events in an ANN are much more powerful because they are more like memory ops, but still.
  
  It is really really hard to make any sort of case that actual computer tech is going to become significantly more efficient than the brain anytime in the near future (at least in terms of switch events/second). There is a very strong case that all the H&B stuff is just what actual practical intelligence looks like. There is no such thing as intelligence that is not resource efficient—or alternatively we could say that any useful definition of intelligence must be resource normalized (ie utility/cost).
  - LawrenceC Jul 30, 2015, 9:23 PM
    5 points
    Parent
    I’m not sure what you’re looking for in terms of the PAC-learning summary, but for a quick intro, there’s this set of slides or these two lectures notes from Scott Aaronson. For a more detailed review of the literature in all the field up until the mid 1990s, there’s this paper by David Haussler, though given its length you might as well read up Kearns and Vazirani’s 1994 textbook on the subject. I haven’t been able to find a more recent review of the literature though—if anyone had a link that’d be great.
  - [deleted]Jul 30, 2015, 11:53 PM
    3 points
    0
    Parent
    
    Human intelligence—including that of Turing or Einstein, only requires 10 watts of energy and more surprisingly only around 10^14 switches/second or less—which is basically miraculous. A modern GPU uses more than 10^18 switches/second. You’d have to go back to a pentium or something to get down to 10^14 switches per second. Of course the difference is that switch events in an ANN are much more powerful because they are more like memory ops, but still.
    
    It’s not that amazing when you understand PAC-learning or Markov processes well. A natively probabilistic (analogously: “natively neuromorphic”) computer can actually afford to sacrifice precision “cheaply”, in the sense that sizeable sacrifices of hardware precision actually entail fairly small injections of entropy into the distribution being modelled. Since what costs all that energy in modern computers is precision, that is, exactitude, a machine that simply expects to get things a little wrong all the time can still actually perform well, provided it is performing a fundamentally statistical task in the first place—which a mind is!
    - jacob_cannell Jul 31, 2015, 2:01 AM
      1 point
      Parent
      Eli this doesn’t make sense—the fact that digital logic switches are higher precision and more powerful and thus require more minimal energy makes the brain/mind more impressive, not less.
      
      The energy efficiency per op in the brain is rather poor in one sense—perhaps 10^5 larger than the minimum imposed by physics for a low SNR analog op, but essentially all of this cost is wire energy.
      
      The miraculous thing is how much intelligence the brain/mind achieves for such a tiny amount of computation in terms of low level equivalent bit ops/second. It suggests that brain-like ANNs will absolutely dominate the long term future of AI.
      - [deleted]Jul 31, 2015, 4:00 AM
        1 point
        Parent
        
        Eli this doesn’t make sense—the fact that digital logic switches are higher precision and more powerful and thus require more minimal energy makes the brain/mind more impressive, not less.
        
        Nuh-uh :-p. The issue is that the brain’s calculations are probabilistic. When doing probabilistic calculations, you can either use very, very precise representations of computable real numbers to represent the probabilities, or you can use various lower-precision but natively stochastic representations, whose distribution over computation outcomes is the distribution being inferred.
        
        Hence why the brain is, on the one hand, very impressive for extracting inferential power from energy and mass, but on the other hand, “not that amazing” in the sense that it, too, begins to add up to normality once you learn a little about how it works.
        jacob_cannell Jul 31, 2015, 4:10 AM
        1 point
        Parent
        
        When doing probabilistic calculations, you can either use very, very precise representations of computable real numbers to represent the probabilities, or you can use various lower-precision but natively stochastic representations, whose distribution over computation outcomes is the distribution being inferred.
        
        Of course—and using say a flop to implement a low precision synaptic op is inefficient by six orders of magnitude or so—but this just strengthens my point. Neuromorphic brain-like AGI thus has huge potential performance improvement to look forward to, even without Moore’s Law.
        [deleted]Jul 31, 2015, 4:17 AM
        1 point
        Parent
        
        Neuromorphic brain-like AGI thus has huge potential performance improvement to look forward to, even without Moore’s Law.
        
        Yes, if you could but dissolve your concept of “brain-like”/”neuromorphic” into actual principles about what calculations different neural nets embody.
  - V_V Jul 31, 2015, 8:32 AM
    2 points
    Parent
    
    Human intelligence—including that of Turing or Einstein, only requires 10 watts of energy and more surprisingly only around 10^14 switches/second or less—which is basically miraculous. A modern GPU uses more than 10^18 switches/second.
    
    I don’t think that “switches” per second is a relevant metric here. The computation performed by a single neuron in a single firing cycle is much more complex than the computation performed by a logic gate in a single switching cycle.
    
    The amount of computational power required to simulate a human brain in real time is estimated in the petaflops range. Only the largest supercomputer operate in that range, certainly not common GPUs.
    - jacob_cannell Jul 31, 2015, 4:29 PM
      0 points
      Parent
      You misunderstood me—the biological switch events I was referring to are synaptic ops, and they are comparable to transistor/gate switch ops in terms of minimum fundemental energy cost in Landauer analysis.
      
      The amount of computational power required to simulate a human brain in real time is estimated in the petaflops range.
      
      That is a tad too high, the more accurate figure is 10^14 ops/second (10^14 synapses * avg 1 hz spike rate). The minimal computation required to simulate a single GPU in real time is 10,000 times higher.
      - V_V Aug 1, 2015, 6:29 AM
        1 point
        Parent
        
        That is a tad too high, the more accurate figure is 10^14 ops/second (10^14 synapses * avg 1 hz spike rate).
        
        I’ve seen various people give estimates in the order of 10^16 flops by considering the maximum firing rate of a typical neuron (~10^2 Hz) rather than the average firing rate, as you do.
        
        On one hand, a neuron must do some computation whether it fires or not, and a “naive” simulation would necessarily use a cycle frequency of the order of 10^2 Hz or more, on the other hand, if the result of a computation is almost always “do not fire”, then as a random variable the result has little information entropy and this may perhaps be exploited to optimize the computation. I don’t have a strong intuition about this.
        
        The minimal computation required to simulate a single GPU in real time is 10,000 times higher.
        
        On a traditional CPU perhaps, on another GPU I don’t think so.
- V_V Aug 1, 2015, 6:54 AM
  1 point
  Parent
  
  This approach yields increasingly fruitful analyses of such cognitive activities as causal learning, concept learning, and planning in arbitrary environments as probabilistic inference enriched with causal/logical structure.
  
  It’s not obvious to me that the Church programming language and execution model is based on bounded rationality theory.
  
  I mean, the idea of using MCMC to sample the executions of probabilistic programs is certainly neat, and you can trade off bias with computing time by varying the burn-in and samples lag parameters, but this trade-off is not provably optimal.
  
  If I understand correctly, provably optimal bounded rationality is marred by unsolved theoretical questions such as the one-way functions conjecture and P != NP. Even assuming that these conjectures are true, the fact that we can’t prove them implies that we can’t often prove anything interesting about the optimality of many AI algorithms.
  - [deleted]Aug 3, 2015, 3:40 AM
    0 points
    Parent
    
    It’s not obvious to me that the Church programming language and execution model is based on bounded rationality theory.
    
    That’s because it’s not. The probabilistic models of cognition (title drop!) implemented using Church tend to deal with what the authors call the resource-rational school of thought about cognition.
    
    If I understand correctly, provably optimal bounded rationality is marred by unsolved theoretical questions such as the one-way functions conjecture and P != NP.
    
    The paper about it that I read was actually using statistical thermodynamics to form its theory of bounded-optimal inference. These conjectures are irrelevant, in that we would be building reasoning systems that would make use of their own knowledge about these facts, such as it might be.
    - V_V Aug 9, 2015, 7:42 PM
      −1 points
      Parent
      
      The paper about it that I read was actually using statistical thermodynamics to form its theory of bounded-optimal inference.
      
      Sounds interesting, do you have a reference?
      - [deleted]Aug 9, 2015, 7:52 PM
        1 point
        Parent
        Sure. If you know statistical mechanics/thermodynamics, I’d be happy to hear your view on the paper, since I don’t know those fields.
        V_V Aug 11, 2015, 12:40 PM
        −1 points
        Parent
        Thanks, I’ll read it, though I’m not an expert in statistical mechanics and thermodynamics.