Thanks again, Jacob. I don’t have time to reply to all of this, but let me reply to one part:
Once one acknowledges that the bit exact ‘best’ solution either does not exist or cannot be found, then there is an enormous (infinite really) space of potential solutions which have different tradeoffs in their expected utillity in different scenarios/environments along with different cost structures. The most interesting solutions often are so complex than they are too difficult to analyze formally.
I don’t buy this. Consider the “expert systems” of the seventies, which used curated databases of logical sentences and reasoned from those using a whole lot of ad-hoc rules. They could just as easily have said “Well we need to build systems that deal with lots of special cases, and you can never be certain about the world. We cannot get exact solutions, and so we are doomed to the zone of heuristics and tradeoffs where the only interesting solutions are too complex to analyze formally.” But they would have been wrong. There were tools and concepts and data structures that they were missing. Judea Pearl (and a whole host of others) showed up, formalized probabilistic graphical models, related them to Bayesian inference, and suddenly a whole class of ad-hoc solutions were superseded.
So I don’t buy that “we can’t get exact solutions” implies “we’re consigned to complex heuristics.” People were using complicated ad-hoc rules to approximate logic, and then later they were using complex heuristics to approximate Bayesian inference, and this was progress.
My claim is that there are other steps such as those that haven’t been made yet, that there are tools on the order of “causal graphical models” that we are missing.
Imagine encountering a programmer from the future who knows how to program an AGI and asking them “How do you do that whole multi-level world-modeling thing? Can you show me the algorithm?” I strongly expect that they’d say something along the lines of “oh, well, you set up a system like this and then have it take percepts like that, and then you can see how if we run this for a while on lots of data it starts building multi-level descriptions of the universe. Here, let me walk you through what it looks like for the system to discover general relativity.”
Since I don’t know of a way to set up a system such that it would knowably and reliably start modeling the universe in this sense, I suspect that we’re missing some tools.
I’m not sure whether your view is of the form “actually the programmer of the future would say “I don’t know how it’s building a model of the world either, it’s just a big neural net that I trained for a long time”″ or whether it’s of the form “actually we do know how to set up that system already”, or whether it’s something else entirely. But if it’s the second one, then please tell! :-)
My claim is that there are other steps such as those that haven’t been made yet, that there are tools on the order of “causal graphical models” that we are missing.
I thought you hired Jessica for exactly that. I have these slides and everything that I was so sad I wouldn’t get to show you because you’d know all about probabilistic programming after hiring Jessica.
Thanks for the clarifications—I’ll make this short.
Judea Pearl (and a whole host of others) showed up, formalized probabilistic graphical models, related them to Bayesian inference, and suddenly a whole class of ad-hoc solutions were superseded.
Probabilistic graphical models were definitely a key theoretical development, but they hardly swept the field of expert systems. From what I remember, in terms of practical applications, they immediately replaced or supplemented expert systems in only a few domains—such as medical diagnostic systems. Complex ad hoc expert systems continued to dominate unchallenged in most fields for decades: in robotics, computer vision, speech recognition, game AI, fighter jets, etc etc basically everything important. As far as I am aware the current ANN revolution is truly unique in that it is finally replacing expert systems across most of the board—although there are still holdouts (as far as I know most robotic controllers are still expert systems, as are fighter jets, and most Go AI systems).
The ANN solutions are more complex than the manually crafted expert systems they replace—but the complexity is automatically generated. The code the developers actually need to implement and manage is vastly simpler—this is the great power and promise of machine learning.
Here is a simple general truth—the Occam simplicity prior does imply that simpler hypotheses/models are more likely, but for any simple model there are an infinite family of approximations to that model of escalating complexity. Thus more efficient approximations naturally tend to have greater code complexity, even though they approximate a much simpler model.
My claim is that there are other steps such as those that haven’t been made yet, that there are tools on the order of “causal graphical models” that we are missing.
Well, that would be interesting.
I’m not sure whether your view is of the form “actually the programmer of the future would say “I don’t know how it’s building a model of the world either, it’s just a big neural net that I trained for a long time”″ or whether it’s of the form “actually we do know how to set up that system [multi-level model] already”, or whether it’s something else entirely. But if it’s the second one, then by all means, please tell :-)
Anyone who has spent serious time working in graphics has also spent serious time thinking about how to create the matrix—if given enough computer power. If you got say a thousand of the various brightest engineers in different simulation related fields, from physics to graphics, and got them all working on a large mega project with huge funds it could probably be implemented today. You’d start with a hierarchical/multi-resolution modelling graph—using say octrees or kdtrees over voxel cells, and a general set of hierarchical bidirectional inference operators for tracing paths and interactions.
To make it efficient, you need a huge army of local approximation models for different phenomena at different scales—low level quantum codes just in case, particle level codes, molecular bio codes, fluid dynamics, rigid body, etc etc. It’s a sea of codes with decision tree like code to decide which models to use where and when.
Of course with machine learning we could automatically learn most of those codes—which suddenly makes it more tractable. And then you could use that big engine as your predictive world model, once it was trained.
The problem is to plan anything worthwhile you need to simulate human minds reasonably well, which means to be useful the sim engine would basically need to infer copies of everyone’s minds . . ..
And if you can do that, then you already have brain based AGI!
So I expect that the programmer from the future will say—yes at the low level we use various brain-like neural nets, and various non-brain like neural nets or learned virtual circuits, some operating over explicit space-time graphs. In all cases we have pretty detailed knowledge of what the circuits are doing—here take a look at that last goal update that just propagated in your left anterior prefrontal cortex . ..
While the methods for finding a solution to a well-formed problem currently used in Machine Learning are relatively well understood, the solutions found are not.
And that is what really matters from a safety perspective. We can and do make some headway in understanding the solutions, as well, but the trend is towards more autonomy for the learning algorithm, and correspondingly more opaqueness.
As you mentioned, the solutions found are extremely complex. So I don’t think it makes sense to view them only in terms of approximations to some conceptually simple (but expensive) ideal solution.
If we want to understand their behaviour, which is what actually matters for safety, we will have to grapple with this complexity somehow.
Personally, I’m not optimistic about experimentation (as it is currently practiced in the ML community) being a good enough solution. There is, at least, the problem of the treacherous turn. If we’re lucky, the AI jumps the gun, and society wakes up to the possibility of an AI trying to take over. If we’re unlucky, we don’t get any warning, and the AI only behaves for long enough to gain our trust and discover a nearly fail-proof strategy. VR could help here, but I think it’s rather far from a complete solution.
1) The divide between your so called “old CS” and “new CS” is more of a divide (or perhaps a continuum) between engineers and theorists. The former is concerned with on-the-ground systems, where quadratic time algorithms are costly and statistics is the better weapon at dealing with real world complexities. The latter is concerned with abstracted models where polynomial time is good enough and logical deduction is the only tool. These models will probably never be applied literally by engineers, but they provide human understanding of engineering problems, and because of their generality, they will last longer. The idea of a Turing machine will last centuries if not millenia, but a Pascal programmer might not find a job today and a Python programmer might not find a job in 20 years. Machine learning techniques constantly come in and out of vogue, but something like the PAC model will be here to stay for a long time. But of course at the end of the day it’s engineers who realize new inventions and technologies.
Theorists’ ideas can transform an entire engineering field, and engineering problems inspire new theories. We need both types of people (or rather, people across the spectrum from engineers to theorists).
2) With neural networks increasing in complexity, making the learning converge is no longer as simple as just running gradient descent. In particular, something like a K12 curriculum will probably emerge to guide the AGI past local optima. For example, the recent paper on neural Turing machines has already employed curriculum learning, as the authors couldn’t get good performance otherwise. So there is a nontrivial maintenance cost (in designing a curriculum) to a neural network so that it adapts to a changing environment, which will not lessen if we don’t better our understanding of it.
Of course expert systems also have maintenance costs, of a different type. But my point is that neural networks are not free lunches.
3) What caused the AI winter was that AI researchers didn’t realize how difficult it was to do what seems so natural to us—motion, language, vision, etc. They were overly optimistic because they succeeded in what were difficult to humans—chess, math, etc. I think it’s fair to say the ANNs have “swept the board” in the former category, the category of lower level functions (machine translation, machine vision, etc), but the high level stuff is still predominantly logical systems (formal verification, operations research, knowledge representation, etc). It’s unfortunate that the the neural camp and logical camp don’t interact too much, but I think it is a major objective to combine the flexibility of neural systems with the power and precision of logical systems.
Here is a simple general truth—the Occam simplicity prior does imply that simpler hypotheses/models are more likely, but for any simple model there are an infinite family of approximations to that model of escalating complexity. Thus more efficient approximations naturally tend to have greater code complexity, even though they approximate a much simpler model.
Schmidhuber invented something called the speed prior that weighs an algorithm according to how fast it generates the observation, rather than how simple it is. He makes some ridiculous claims about our (physical) universe assuming the speed prior. Ostensibly one can also weigh in accuracy of approximation in there to produce another variant of prior. (But of course all of these will lose the universality enjoyed by the Occam prior)
Thanks again, Jacob. I don’t have time to reply to all of this, but let me reply to one part:
I don’t buy this. Consider the “expert systems” of the seventies, which used curated databases of logical sentences and reasoned from those using a whole lot of ad-hoc rules. They could just as easily have said “Well we need to build systems that deal with lots of special cases, and you can never be certain about the world. We cannot get exact solutions, and so we are doomed to the zone of heuristics and tradeoffs where the only interesting solutions are too complex to analyze formally.” But they would have been wrong. There were tools and concepts and data structures that they were missing. Judea Pearl (and a whole host of others) showed up, formalized probabilistic graphical models, related them to Bayesian inference, and suddenly a whole class of ad-hoc solutions were superseded.
So I don’t buy that “we can’t get exact solutions” implies “we’re consigned to complex heuristics.” People were using complicated ad-hoc rules to approximate logic, and then later they were using complex heuristics to approximate Bayesian inference, and this was progress.
My claim is that there are other steps such as those that haven’t been made yet, that there are tools on the order of “causal graphical models” that we are missing.
Imagine encountering a programmer from the future who knows how to program an AGI and asking them “How do you do that whole multi-level world-modeling thing? Can you show me the algorithm?” I strongly expect that they’d say something along the lines of “oh, well, you set up a system like this and then have it take percepts like that, and then you can see how if we run this for a while on lots of data it starts building multi-level descriptions of the universe. Here, let me walk you through what it looks like for the system to discover general relativity.”
Since I don’t know of a way to set up a system such that it would knowably and reliably start modeling the universe in this sense, I suspect that we’re missing some tools.
I’m not sure whether your view is of the form “actually the programmer of the future would say “I don’t know how it’s building a model of the world either, it’s just a big neural net that I trained for a long time”″ or whether it’s of the form “actually we do know how to set up that system already”, or whether it’s something else entirely. But if it’s the second one, then please tell! :-)
I thought you hired Jessica for exactly that. I have these slides and everything that I was so sad I wouldn’t get to show you because you’d know all about probabilistic programming after hiring Jessica.
Thanks for the clarifications—I’ll make this short.
Probabilistic graphical models were definitely a key theoretical development, but they hardly swept the field of expert systems. From what I remember, in terms of practical applications, they immediately replaced or supplemented expert systems in only a few domains—such as medical diagnostic systems. Complex ad hoc expert systems continued to dominate unchallenged in most fields for decades: in robotics, computer vision, speech recognition, game AI, fighter jets, etc etc basically everything important. As far as I am aware the current ANN revolution is truly unique in that it is finally replacing expert systems across most of the board—although there are still holdouts (as far as I know most robotic controllers are still expert systems, as are fighter jets, and most Go AI systems).
The ANN solutions are more complex than the manually crafted expert systems they replace—but the complexity is automatically generated. The code the developers actually need to implement and manage is vastly simpler—this is the great power and promise of machine learning.
Here is a simple general truth—the Occam simplicity prior does imply that simpler hypotheses/models are more likely, but for any simple model there are an infinite family of approximations to that model of escalating complexity. Thus more efficient approximations naturally tend to have greater code complexity, even though they approximate a much simpler model.
Well, that would be interesting.
Anyone who has spent serious time working in graphics has also spent serious time thinking about how to create the matrix—if given enough computer power. If you got say a thousand of the various brightest engineers in different simulation related fields, from physics to graphics, and got them all working on a large mega project with huge funds it could probably be implemented today. You’d start with a hierarchical/multi-resolution modelling graph—using say octrees or kdtrees over voxel cells, and a general set of hierarchical bidirectional inference operators for tracing paths and interactions.
To make it efficient, you need a huge army of local approximation models for different phenomena at different scales—low level quantum codes just in case, particle level codes, molecular bio codes, fluid dynamics, rigid body, etc etc. It’s a sea of codes with decision tree like code to decide which models to use where and when.
Of course with machine learning we could automatically learn most of those codes—which suddenly makes it more tractable. And then you could use that big engine as your predictive world model, once it was trained.
The problem is to plan anything worthwhile you need to simulate human minds reasonably well, which means to be useful the sim engine would basically need to infer copies of everyone’s minds . . ..
And if you can do that, then you already have brain based AGI!
So I expect that the programmer from the future will say—yes at the low level we use various brain-like neural nets, and various non-brain like neural nets or learned virtual circuits, some operating over explicit space-time graphs. In all cases we have pretty detailed knowledge of what the circuits are doing—here take a look at that last goal update that just propagated in your left anterior prefrontal cortex . ..
While the methods for finding a solution to a well-formed problem currently used in Machine Learning are relatively well understood, the solutions found are not.
And that is what really matters from a safety perspective. We can and do make some headway in understanding the solutions, as well, but the trend is towards more autonomy for the learning algorithm, and correspondingly more opaqueness.
As you mentioned, the solutions found are extremely complex. So I don’t think it makes sense to view them only in terms of approximations to some conceptually simple (but expensive) ideal solution.
If we want to understand their behaviour, which is what actually matters for safety, we will have to grapple with this complexity somehow.
Personally, I’m not optimistic about experimentation (as it is currently practiced in the ML community) being a good enough solution. There is, at least, the problem of the treacherous turn. If we’re lucky, the AI jumps the gun, and society wakes up to the possibility of an AI trying to take over. If we’re unlucky, we don’t get any warning, and the AI only behaves for long enough to gain our trust and discover a nearly fail-proof strategy. VR could help here, but I think it’s rather far from a complete solution.
BTW, SOTA for Computer Go uses ConvNets (before that, it was Monte-Carlo Tree Search, IIRC): http://machinelearning.wustl.edu/mlpapers/paper_files/icml2015_clark15.pdf ;)
I just want to point out some nuiances.
1) The divide between your so called “old CS” and “new CS” is more of a divide (or perhaps a continuum) between engineers and theorists. The former is concerned with on-the-ground systems, where quadratic time algorithms are costly and statistics is the better weapon at dealing with real world complexities. The latter is concerned with abstracted models where polynomial time is good enough and logical deduction is the only tool. These models will probably never be applied literally by engineers, but they provide human understanding of engineering problems, and because of their generality, they will last longer. The idea of a Turing machine will last centuries if not millenia, but a Pascal programmer might not find a job today and a Python programmer might not find a job in 20 years. Machine learning techniques constantly come in and out of vogue, but something like the PAC model will be here to stay for a long time. But of course at the end of the day it’s engineers who realize new inventions and technologies.
Theorists’ ideas can transform an entire engineering field, and engineering problems inspire new theories. We need both types of people (or rather, people across the spectrum from engineers to theorists).
2) With neural networks increasing in complexity, making the learning converge is no longer as simple as just running gradient descent. In particular, something like a K12 curriculum will probably emerge to guide the AGI past local optima. For example, the recent paper on neural Turing machines has already employed curriculum learning, as the authors couldn’t get good performance otherwise. So there is a nontrivial maintenance cost (in designing a curriculum) to a neural network so that it adapts to a changing environment, which will not lessen if we don’t better our understanding of it.
Of course expert systems also have maintenance costs, of a different type. But my point is that neural networks are not free lunches.
3) What caused the AI winter was that AI researchers didn’t realize how difficult it was to do what seems so natural to us—motion, language, vision, etc. They were overly optimistic because they succeeded in what were difficult to humans—chess, math, etc. I think it’s fair to say the ANNs have “swept the board” in the former category, the category of lower level functions (machine translation, machine vision, etc), but the high level stuff is still predominantly logical systems (formal verification, operations research, knowledge representation, etc). It’s unfortunate that the the neural camp and logical camp don’t interact too much, but I think it is a major objective to combine the flexibility of neural systems with the power and precision of logical systems.
Schmidhuber invented something called the speed prior that weighs an algorithm according to how fast it generates the observation, rather than how simple it is. He makes some ridiculous claims about our (physical) universe assuming the speed prior. Ostensibly one can also weigh in accuracy of approximation in there to produce another variant of prior. (But of course all of these will lose the universality enjoyed by the Occam prior)