What this indicates is not that deep learning in particular is going to be the Game Over algorithm. Rather, the background variables are looking more like “Human neural intelligence is not that complicated and current algorithms are touching on keystone, foundational aspects of it.” What’s alarming is not this particular breakthrough, but what it implies about the general background settings of the computational universe.
You could easily transpose it for the time when Checkers or Chess programs beat professional players: back then the “keystone, foundational aspect” of intelligence was thought to be the ability to do combinatorial search in large solution spaces, and scaling up to AGI was “just” a matter of engineering better heuristics. Sure, it didn’t work on Go yet, but Go players were not using a different cortical algorithm than Chess players, were they?
Or you could transpose it for the time when MCTS Go programs reached “dan” (advanced amateur) level. They still couldn’t beat professional players, but professional players were not using a different cortical algorithm than advanced amateur players, were they?
AlphaGo succeded at the current achievement by using artificial neural networks in a regime where they are know to do well. But this regime, and the type of games like Go, Chess, Checkers, Othello, etc. represent a small part of the range of human cognitive tasks. In fact, we probably find this kind of board games fascinating precisely because they are very different than the usual cognitive stimuli we deal with in everyday life.
It’s tempting to assume that the “keystone, foundational aspect” of intelligence is learning essentially the same way that artificial neural networks learn. But humans can do things like “one-shot” learning, learning from weak supervision, learning in non-stationary environments, etc. which no current neural network can do, and not just because a matter of scale or architectural “details”. Researchers generally don’t know how to make neural networks, or really any other kind of machine learning algorithm, do these things, except with massive task-specific engineering. Thus I think it’s fair to say that we still don’t know what the foundational aspects of intelligence are.
In the brain, the same circuitry that is used to solve vision is used to solve most of the rest of cognition—vision is 10% of the cortex. Going from superhuman vision to superhuman Go suggests superhuman anything/everything is getting near.
The reason being that strong Go requires both deep slow inference over huge data/time (which DL excels in, similar to what the cortex/cerebellum specialize in), combined with fast/low data inference (the MCTS part here). There is still much room for improvement in generalizing beyond current MCTS techniques, and better integration into larger scale ANNs, but that is increasingly looking straightforward.
It’s tempting to assume that the “keystone, foundational aspect” of intelligence is learning essentially the same way that artificial neural networks learn.
Yes, but only because “ANN” is enormously broad (tensor/linear algebra program space), and basically includes all possible routes to AGI (all possible approximations of bayesian inference).
But humans can do things like “one-shot” learning, learning from weak supervision, learning in non-stationary environments, etc. which no current neural network can do, and not just because a matter of scale or architectural “details”.
Bayesian methods excel at one shot learning, and are steadily integrating themselves into ANN techniques (providing the foundation needed to derive new learning and inference rules). Progress in transfer and semi-supervised learning is also progressing rapidly and the theory is all there. I don’t know about non-stationary as much, but I’d be pretty surprised if there wasn’t progress there as well.
Thus I think it’s fair to say that we still don’t know what the foundational aspects of intelligence are.
LOL. Generalized DL + MCTS is—rather obviously—a practical approximation of universal intelligence like AIXI. I doubt MCTS scales to all domains well enough, but the obvious next step is for DL to eat MCTS techniques (so that super new complex heuristic search techniques can be learned automatically).
In the brain, the same circuitry that is used to solve vision is used to solve most of the rest of cognition
And in a laptop the same circuitry that it is used to run a spreadsheet is used to play a video game.
Systems that are Turing-complete (in the limit of infinite resources) tend to have an independence between hardware and possibly many layers of software (program running on VM running on VM running on VM and so on). Things that look similar at a some levels may have lots of difference at other levels, and thus things that look simple at some levels can have lots of hidden complexity at other levels.
Going from superhuman vision
Human-level (perhaps weakly superhuman) vision is achieved only in very specific tasks where large supervised datasets are available. This is not very surprising, since even traditional “hand-coded” computer vision could achieve superhuman performances in some narrow and clearly specified tasks.
Yes, but only because “ANN” is enormously broad (tensor/linear algebra program space), and basically includes all possible routes to AGI (all possible approximations of bayesian inference).
Again, ANN are Turing-complete, therefore in principle they include literally everything, but so does the brute-force search of C programs.
In practice if you try to generate C programs by brute-force search you will get stuck pretty fast, while ANN with gradient descent training empirically work well on various kinds of practical problems, but not on all kinds practical problems that humans are good at, and how to make them work on these problems, if it even efficiently possible, is a whole open research field.
Bayesian methods excel at one shot learning
With lots of task-specific engineering.
Generalized DL + MCTS is—rather obviously—a practical approximation of universal intelligence like AIXI.
So are things like AIXI-tl, Hutter-search, Gödel machine, and so on. Yet I would not consider any of them as the “foundational aspect” of intelligence.
And in a laptop the same circuitry that it is used to run a spreadsheet is used to play a video game.
Exactly, and this a good analogy to illustrate my point. Discovering that the cortical circuitry is universal vs task-specific (like an ASIC) was a key discovery.
Human-level (perhaps weakly superhuman) vision is achieved only in very specific tasks where large supervised datasets are available.
Note I didn’t say that we have solved vision to superhuman level, but this is simply not true. Current SOTA nets can achieve human-level performance in at least some domains using modest amounts of unsupervised data combined with small amounts of supervised data.
Human vision builds on enormous amounts of unsupervised data—much larger than ImageNet. Learning in the brain is complex and multi-objective, but perhaps best described as self-supervised (unsupervised meta-learning of sub-objective functions which then can be used for supervised learning).
A five year old will have experienced perhaps 50 million seconds worth of video data. Imagenet consists of 1 million images, which is vaguely equivalent to 1 million seconds of video if we include 30x amplification for small translations/rotations.
The brain’s vision system is about 100x larger than current ‘large’ vision ANNs. But If deepmind decided to spend the cash on that and make it a huge one off research priority, do you really doubt that they could build a superhuman general vision system that learns with a similar dataset and training duration?
So are things like AIXI-tl, Hutter-search, Gödel machine, and so on. Yet I would not consider any of them as the “foundational aspect” of intelligence.
The foundation of intelligence is just inference—simply because universal inference is sufficient to solve any other problem. AIXI is already simple, but you can make it even simpler by replacing the planning component with inference over high EV actions, or even just inference over program space to learn approx planning.
So it all boils down to efficient inference. The new exciting progress in DL—for me at least—is in understanding how successful empirical optimization techniques can be derived as approx inference update schemes with various types of priors. This is what I referred to as new and upcoming “Bayesian methods”—bayesian grounded DL.
Yes, but only because “ANN” is enormously broad (tensor/linear algebra program space), and basically includes all possible routes to AGI (all possible approximations of bayesian inference).
“Enormously broad” is just another way of saying “not very useful”. We don’t even know in which sense (if any) the “deep networks” that are used in practice may be said to approximate Bayesian inference; the best we can do, AIUI, is make up a hand-wavy story about how they must be some “hierarchical” variation of single-layer networks, i.e. generalized linear models.
Specifically I meant approx bayesian inference over the tensor program space to learn the ANN, not that the ANN itself needs to implement bayesian inference (although they will naturally tend to learn that, as we see in all the evidence for various bayesian ops in the brain) .
I agree. I don’t find this result to be any more or less indicative of near-term AI than Google’s success on ImageNet in 2012. The algorithm learns to map positions to moves and values using CNNs, just as CNNs can be used to learn mappings from images to 350 classes of dog breeds and more. It turns out that Go really is a game about pattern recognition and that with a lot of data you can replicate the pattern detection for good moves in very supervised ways (one could call their reinforcement learning actually supervised because the nature of the problem gives you credit assignment for free).
I think what this result says is thus: “Any tasks humans can do, an AI can now learn to do better, given a sufficient source of training data.”
Games lend themselves to auto-generation of training data, in the sense that the AI can at the very least play against itself. No matter how complex the game, a deep neural net will find the structure in it, and find a deeper structure than human players can find.
We have now answered the question of, “Are deep neural nets going to be sufficient to match or exceed task-specific human performance at any well-specified task?” with “Yes, they can, and they can do it better and faster than we suspected.” The next hurdle—which all the major companies are working on—is to create architectures that can find structure in smaller datasets, less well-tailored training data, and less well-specified tasks.
I included the word “sufficient” as an ass-covering move, because one facet of the problem is we don’t really know what will serve as a “sufficient” amount of training data in what context.
But, what specific types of tasks do you think machines still can’t do, given sufficient training data? If your answer is something like “physics research,” I would rejoinder that if you could generate training data for that job, a machine could do it.
Grand pronouncements with an ass-covering move look silly :-)
One obvious problem is that you are assuming stability. Consider modeling something that changes (in complex ways) with time—like the economy of the United States. Is “training data” from the 1950s relevant to the currrent situation?
Generally speaking, the speed at which your “training data” gets stale puts an upper limit on the relevant data that you can possibly have and that, in turn, puts an upper limit on the complexity of the model (NNs included) that you can build on its basis.
I don’t see how we anything like know that deep NNs with ‘sufficient training data’ would be sufficient for all problems. We’ve seen them be sufficient for many different problems and can expect them to be sufficient for many more, but all?
I think what this result says is thus: “Any tasks humans can do, an AI can now learn to do better, given a sufficient source of training data.”
Yes, but that would likely require an extremely large amount of training data because to prepare actions for many kind of situations you’d have an exponential blow up to cover many combinations of many possibilities, and hence the model would need to be huge as well. It also would require high-quality data sets with simple correction signals in order to work, which are expensive to produce.
I think, above all for building a real-time AI you need reuse of concepts so that abstractions can be recombined and adapted to new situations; and for concept-based predictions (reasoning) you need one-shot learning so that trains of thoughts can be memorized and built upon. In addition, the entire network needs to learn somehow to determine which parts of the network in the past were responsible for current reward signals which are delayed and noisy. If there is a simple and fast solutions to this, then AGI could be right around the corner. If not, it could take several decades of research.
In addition, the entire network needs to learn somehow to determine which parts of the network in the past were responsible for current reward signals which are delayed and noisy.
This is a well-known problem, called reinforcement learning. It is a significant component in the reported results. (What happens in practice is that a network’s ability to assign “credit” or “blame” for reward signals falls off exponentially with increasing delay. This is a significant limitation, but reinforcement learning is nevertheless very helpful given tight feedback loops.)
Yes, but as I wrote above, the problems of credit assignment, reward delay and noise are non-existent in this setting, and hence their work does not contribute at all to solving AI.
Reward delay is not very significant in this task, since the task is episodic and fully observable, and there is no time preference, thus you can just play a game to completion without updating and then assign the final reward to all the positions.
In more general reinforcement learning settings, where you want to update your policy during the execution, you have to use some kind of temporal difference learning method, which is further complicated if the world states are not fully observable.
Credit assignment is taken care of by backpropagation, as usual in neural networks. I don’t know why RaelwayScot brought it up, unless they meant something else.
I meant that for AI we will possibly require high-level credit assignment, e.g. experiences of regret like “I should be more careful in these kinds of situations”, or the realization that one particular strategy out of the entire sequence of moves worked out really nicely. Instead it penalizes/enforces all moves of one game equally, which is potentially a much slower learning process. It turns out playing Go can be solved without much structure for the credit assignment processes, hence I said the problem is non-existent, i.e. there wasn’t even need to consider it and further our understanding of RL techniques.
thus you can just play a game to completion without updating and then assign the final reward to all the positions.
Agreed, with the caveat that this is a stochastic object, and thus not a fully simple problem. (Even if I knew all possible branches of the game tree that originated in a particular state, I would need to know how likely any of those branches are to be realized in order to determine the current value of that state.)
Even if I knew all possible branches of the game tree that originated in a particular state, I would need to know how likely any of those branches are to be realized in order to determine the current value of that state.
Well, the value of a state is defined assuming that the optimal policy is used for all the following actions. For tabular RL you can actually prove that the updates converge to the optimal value function/policy function (under some conditions). If NN are used you don’t have any convergence guarantees, but in practice the people at DeepMind are able to make it work, and this particular scenario (perfect observability, determinism and short episodes) is simpler than, for instance that of the Atari DQN agent.
“Nonexistent problems” was meant as a hyperbole to say that they weren’t solved in interesting ways and are extremely simple in this setting because the states and rewards are noise-free. I am not sure what you mean by the second question. They just apply gradient descent on the entire history of moves of the current game such that expected reward is maximized.
It seems to me that the problem of value assignment to boards—”What’s the edge for W or B if the game state looks like this?” is basically a solution to that problem, since it gives you the counterfactual information you need (how much would placing a stone here improve my edge?) to answer those questions.
I agree that it’s a much simpler problem here than it is in a more complicated world, but I don’t think it’s trivial.
There are other big deals. The MS ImageNet win also contained frightening progress on the training meta level.
The other issue is that constructing this kind of mega-neural net is tremendously difficult. Landing on a particular set of algorithms—determining how each layer should operate and how it should talk to the next layer—is an almost epic task. But Microsoft has a trick here, too. It has designed a computing system that can help build these networks.
As Jian Sun explains it, researchers can identify a promising arrangement for massive neural networks, and then the system can cycle through a range of similar possibilities until it settles on this best one. “In most cases, after a number of tries, the researchers learn [something], reflect, and make a new decision on the next try,” he says. “You can view this as ‘human-assisted search.’”
Going by that description, it is much much less important than residual learning, because hyperparameter optimization is not new. There are a lot of approaches: grid search, random search, Gaussian processes. Some hyperparameter optimizations baked into MSR’s deep learning framework would save some researcher time and effort, certainly, but I don’t know that it would’ve made any big difference unless they have something quite unusual going one.
(I liked one paper which took a Bayesian multi-armed bandit approach and treated error curves as partial information about final performance, and it would switch between different networks being trained based on performance, regularly ‘freezing’ and ‘thawing’ networks as the probability each network would become the best performer changed with information from additional mini-batches/epoches.) Probably the single coolest one is last year some researchers showed that it is possible to somewhat efficiently backpropagate on hyperparameters! So hyperparameters just become more parameters to learn, and you can load up on all sorts of stuff without worrying about it making your hyperparameter optimization futile or having to train a billion times, and would both save people a lot of time (for using vanilla networks) and allow exploring extremely complicated and heavily parameterized families of architectures, and would be a big deal. Unfortunately, it’s still not efficient enough for the giant networks we want to train. :(
A step which was taken a long time ago and does not seem to have played much of a role in recent developments; for the most part, people don’t bother with extensive hyperparameter tuning. Better initialization, better algorithms like dropout or residual learning, better architectures, but not hyperparameters.
It’s a big deal for Go, but I don’t think it’s a very big deal for AGI.
Conceptually Go is like Chess or Checkers: fully deterministic, perfect information two-player games.
Go is more challenging for computers because the search space (and in particular the average branching factor) is larger and known position evaluation heuristics are not as good, so traditional alpha-beta minimax search becomes infeasible.
The first big innovation, already put into use by most Go programs for a decade (although the idea is older) was Monte Carlo tree search, which addresses the high branching factor issue: while traditional search either does not expand a node or expands it and recursively evaluates all its children, MCTS stochastically evaluates nodes with a probability that depends on how promising they look, according to some heuristic.
DeepMind’s innovation consists in using a NN to learn a good position evaluation heuristic in a supervised fashion from a large database of professional games, refining it with reinforcement learning in “greedy” self-play mode and then using both the refined heuristic and the supervised heuristic in a MCTS engine.
Their approach essentially relies on big data and big hardware. From an engineering point of view, it is a major advancement of neural network technology because of the sheer scale and in particular the speed of the thing, which required significant non-trivial parallelization, but the core techniques aren’t particularly new and I doubt that they can scale well to more general domains with non-determinism and partial observability. However, neural networks may be more robust to noise and certain kinds of disturbances than hand-coded heuristics, so take this with a grain of salt.
So, to the extent that AGI will rely on large and fast neural networks, this work is a significant step towards practical AGI engineering, but to the extent that AGI will rely on some “master algorithm” this work is probably not a very big step towards the discovery of such algorithm, at least compared to previously known techniques.
I think it is a bigger deal than chess because it doesn’t use brute-force but mostly unsupervised learning. It is not the breakthrough in AGI but it is telling that this approach thoroughly beats all the other Go algorithms (1 out of 500 plays lost, even with handicap 4. And they say that it still improves by training.
I wouldn’t say that it’s “mostly unsupervised” since a crucial part of their training is done in a traditional supervised fashion on a database of games by professional players.
But it’s certainly much more automated than having an hand-coded heuristic.
Humans also learn extensively by studying the games of experts. In Japan/China, even fans follow games from newspapers.
A game might take an hour on average. So a pro with 10 years of experience may have played/watched upwards of 10,000 games. However, it takes much less time to read a game that has already been played—so a 10 year pro may be familiar with say 100,000 games. Considering that each game has 200+ moves, that roughly is a training set of order 2 to 20 million positions.
AlphaGo’s training set consisted of 160,000 games with 29 million positions, so the upper end estimate for humans is similar. More importantly, the human training set is far more carefully curated and thus of higher quality.
so a 10 year pro may be familiar with say 100,000 games.
That’s 27.4 games a day, on average. I think this is an overestimate.
It was my upper bound estimate, and if anything it was too low.
A pro will grow up in a dedicated go school where there are hundreds of other players just playing go and studying go all day. Some students will be playing speed games, and some will be flipping through summaries of historical games in books/magazines and or on the web.
When not playing, people will tend to walk around and spectate the other games (nowdays this is also trivial to do online). An experienced player can reconstruct some of the move history by just glancing at the board.
So if anything, 27.4 games watched/skimmed/experienced per day is too low for the upper estimate.
An East Asian Go pro will often have been an insei and been studying Go full-time at a school, and a dedicated amateur before that, so you can imagine how many hours a day they will be studying… (The intensiveness is part of why they dominate Go to the degree they do and North American & Europeans are so much weaker: start a lot of kids, start them young, school them 10 hours a day for years studying games and playing against each other and pros, and keep relentlessly filtering to winnow out anyone who is not brilliant.)
I would say 100k is an overestimate since they will tend to be more closely studying the games and commentaries and also working out life-and-death problems, memorizing the standard openings, and whatnot, but they are definitely reading through and studying tens of thousands of games—similar to how one of the reasons chess players are so much better these days than even just decades ago is that computers have given access to enormous databases of games which can be studied with the help of chess AIs (Carlsen has benefited a lot from this, I understand). Also, while I’m nitpicking, AlphaGo trained on both the KGS and then self-play; I don’t know how many games the self-play amounted to, but the appendix broke down the wallclock times by phase, and of the 4 weeks of wallclock time, IIRC most of it was spent on the self-play finetuning the value function.
But if AlphaGo is learning from games ‘only’ more efficiently than 99%+ of the humans who play Go (Fan Hui was ranked in the 600s, there’s maybe 1000-2000 people who earn a living as Go professionals, selected from the hundreds of thousands/millions of people who play), that doesn’t strike me as much of a slur.
For the SL phase, they trained 340 million updates with a batch size of 16, so 5.4 billion position-updates. However the database had only 29 million unique positions. That’s about 200 gradient iterations per unique position.
The self-play RL phase for AlphaGo consisted of 10,000 minibatches of 128 games each, so about 1 million games total. They only trained that part for a day.
They spent more time training the value network: 50 million minibatches of 32 board positions, so about 1.6 billion positions. That’s still much smaller than the SL training phase.
I’m referring to figure 1a on page 4 and the explanation below. I can’t be sure but the self-play should be contributing a large part to the training and can go on and improve the algorithm even if the expert database stays fixed.
They spent three weeks to train the supervised policy and one day to train the reinforcement learning policy starting from the supervised policy, plus an additional week to extract the value function from the reinforcement learning policy (pages 25-26).
In the final system the only part that depends on RL is the value function. According to figure 4, if the value function is taken out the system still plays better than any other Go program, though worse than the human champion.
Therefore I would say that the system heavily depends on supervised training on a human-generated dataset. RL was needed to achieve the final performance, but it was not the most important ingredient.
How big a deal is this? What, if anything, does it signal about when we get smarter than human AI?
It shows that Monte-Carlo tree search meshes remarkably well with neural-network-driven evaluation (“value networks”) and decision pruning/policy selection (“policy networks”). This means that if you have a planning task to which MCTS can be usefully applied, and sufficient data to train networks for state-evaluation and policy selection, and substantial computation power (a distributed cluster, in AlphaGo’s case), you can significantly improve performance on your task (from “strong amateur” to “human champion” level). It’s not an AGI-complete result however, any more than Deep-Blue or TD-gammon were AGI-complete.
The “training data” factor is a biggie; we lack this kind of data entirely for things like automated theorem proving, which would otherwise be quite amenable to this ‘planning search + complex learned heuristics’ approach. In particular, writing provably-correct computer code is a minor variation on automated theorem proving. (Neural networks can already write incorrect code, but this is not good enough if you want a provably Friendly AGI.)
The interesting thing about that RNN that you linked that writes code, is that it shouldn’t work at all. It was just given text files of code and told to predict the next character. It wasn’t taught how to program, it never got to see an interpreter, it doesn’t know any English yet has to work with English variable names, and it only has a few hundred neurons to represent its entire knowledge state.
The fact that it is even able to produce legible code is amazing, and suggests that we might not be that far of from NNs that can write actually usable code. Still some ways away, but not multiple decades.
The fact that it is even able to produce legible code is amazing
Somewhat. Look at what happens when you generate code from a simple character-level Markov language model (that’s just a look up table that gives the probability of the next character conditioned on the last n characters, estimated by frequency counts on the training corpus).
An order-20 language model generates fairly legible code, with sensible use of keywords, identifier names and even comments. The main difference with the RNN language model is that the RNN learns to do proper identation and bracket matching, while the Markov model can’t do it except at shot range.
While, as remarked by Yoav Goldberg, it is impressive that the RNN could learn to do this, learning to match brackets and ident blocks seems very far from learning to write correct and purposeful code.
Anyway, this code generation example is pretty much of a stunt, not a very interesting task. If you gave the Linux kernel source code to a human who has never programmed and doesn’t speak English and asked them to write something that looks like it, I doubt that they would be able to do much better.
Better examples of code generation using NNs (actually, log-bilinear models) or Bayesian models exist (ref, ref). In these works syntactic correctness is already guaranteed and the ML model only focuses on semantics.
The difference with Markov models is they tend to overfit at that level. At 20 characters deep, you are just copy and pasting large sections of existing code and language. Not generating entirely unseen samples. You can do a similar thing with RNNs, by training them only on one document. They will be able to reproduce that document exactly, but nothing else.
To properly compare with a markov model, you’d need to first tune it so it doesn’t overfit. That is, when it’s looking at an entirely unseen document, it’s guess of what the next character should be is most likely to be correct. The best setting for that is probably only 3-5 characters, not 20. And when you generate from that, the output will be much less legible. (And even that’s kind of cheating, since markov models can’t give any prediction for sequences it’s never seen before.)
Generating samples is just a way to see what patterns the RNN has learned. And while it’s far from perfect, it’s still pretty impressive. It’s learned a lot about syntax, a lot about variable names, a lot about common programming idioms, and it’s even learned some English from just code comments.
The best setting for that is probably only 3-5 characters, not 20.
In NLP applications where Markov language models are used, such as speech recognition and machine translation, the typical setting is 3 to 5 words. 20 characters correspond to about 4 English words, which is in this range.
Anyway, I agree that in this case the order-20 Markov model seems to overfit (Googling some lines from the snippets in the post often locates them in an original source file, which doesn’t happen as often with the RNN snippets). This may be due to the lack of regularization (“smoothing”) in the probability estimation and the relatively small size of the training corpus: 474 MB versus the >10 GB corpora which are typically used in NLP applications. Neural networks need lots of data, but still less than plain look-up tables.
This is a big deal, and it is another sign that AGI is near.
Intelligence boils down to inference. Go is an interesting case because good play for both humans and bots like AlphaGo requires two specialized types of inference operating over very different timescales:
rapid combinatoric inference over move sequences during a game(planning). AlphaGo uses MCT search for this, whereas the human brain uses a complex network of modules involving the basal ganglia, hippocampus, and PFC.
slow deep inference over a huge amount of experience to develop strong pattern recognition and intuitions (deep learning). AlphaGo uses deep supervised and reinforcement learning via SGD over a CNN for this. The human brain uses the cortex.
Machines have been strong in planning/search style inference for a while. It is only recently that the slower learning component (2nd order inference over circuit/program structure) is starting to approach and surpass human level.
Critics like to point out that DL requires tons of data, but so does the human brain. A more accurate comparison requires quantifying the dataset human pro go players train on.
A 30 year old asian pro will have perhaps 40,000 hours of playing experience (20 years 50 40 hrs/week). The average game duration is perhaps an hour and consists of 200 moves. In addition, pros (and even fans) study published games. Reading a game takes less time, perhaps as little as 5 minutes or so.
So we can estimate very roughly that a top pro will have absorbed between 100,000 games to 1 million games, and between 20 to 200 million individual positions (around 200 moves per game) .
AlphaGo was trained on the KGS dataset: 160,00 games and 29 million positions. So it did not train on significantly more data than a human pro. The data quantities are actually very similar.
Furthermore, the human’s dataset is perhaps of better quality for a pro, as they will be familiar with mainly pro level games, whereas the AlphaGo dataset is mostly amateur level.
The main difference is speed. The human brain’s ‘clockrate’ or equivalent is about 100 hz, whereas AlphaGo’s various CNNs can run at roughly 1000hz during training on a single machine, and perhaps 10,000 hz equivalent distributed across hundreds of machines. 40,000 hours—a lifetime of experience—can be compressed 100x or more into just a couple of weeks for a machine. This is the key lesson here.
The classification CNN trained on KGS was run for 340 million steps, which is about 10 iterations per unique position in the database.
The ANNs that AlphaGo uses are much much smaller than a human brain, but the brain has to do a huge number of other tasks, and also has to solve complex vision and motor problems just to play the game. AlphaGO’s ANNs get to focus purely on Go.
A few hundred TitanX’s can muster up perhaps a petaflop of compute. The high end estimate of the brain is 10 petaflops (100 trillion synapses 100 hz max firing rate). The more realistic estimate is 100 teraflops (100 trillion synapes 1 hz avg firing rate), and the lower end is 1⁄10 that or less.
So why is this a big deal? Because it suggests that training a DL AI to master more economically key tasks, such as becoming an expert level programmer, could be much closer than people think.
The techniques used here are nowhere near their optimal form yet in terms of efficiency. When Deep Blue beat Kasparov in 1996, it required a specialized supercomputer and a huge team. 10 years later chess bots written by individual programmers running on modest PC’s soared past Deep Blue—thanks to more efficient algorithms and implementations.
A 30 year old asian pro will have perhaps 40,000 hours of playing experience (20 years 50 40 hrs/week). The average game duration is perhaps an hour and consists of 200 moves. In addition, pros (and even fans) study published games. Reading a game takes less time, perhaps as little as 5 minutes or so.
So we can estimate very roughly that a top pro will have absorbed between 100,000 games to 1 million games, and between 20 to 200 million individual positions (around 200 moves per game) .
At least the order of magnitude should be more or less right. Hours of playing weekly is probably somewhat lower on average (say 20-30 hours), and I’d also use 10-15 minutes to read a game instead of five. Just 300 seconds to place 200 stones sounds pretty tough. Still, I’d imagine that a 30-year-old professional has seen at least 50 000 games, and possibly many more.
Critics like to point out that DL requires tons of data, but so does the human brain.
Both deep networks and the human brain require lots of data, but the kind of data they require is not the same. Humans engage mostly in semi-supervised learning, where supervised data comprises a small fraction of the total. They also manage feats of “one-shot learning” (making critically-important generalizations from single datapoints) that are simply not feasible for neural networks or indeed other ‘machine learning’ methods.
A few hundred TitanX’s can muster up perhaps a petaflop of compute.
Could you elaborate? I think this number is too high by roughly one order of magnitude.
The high end estimate of the brain is 10 petaflops (100 trillion synapses * 100 hz max firing rate).
Estimating the computational capability of the human brain is very difficult. Among other things, we don’t know what the neuroglia cells may be up to, and these are just as numerous as neurons.
Both deep networks and the human brain require lots of data, but the kind of data they require is not the same. Humans engage mostly in semi-supervised learning, where supervised data comprises a small fraction of the total.
This is probably a misconception for several reasons. Firstly, given that we don’t fully understand the learning mechanisms in the brain yet, it’s unlikely that it’s mostly one thing. Secondly, we have some pretty good evidence for reinforcement learning in the cortex, hippocampus, and basal ganglia. We have evidence for internally supervised learning in the cerebellum, and unsupervised learning in the cortex.
The point being: these labels aren’t all that useful. Efficient learning is multi-objective and doesn’t cleanly divide into these narrow categories.
The best current guess for questions like this is almost always to guess that the brain’s solution is highly efficient, given it’s constraints.
In the situation where a go player experiences/watches a game between two other players far above one’s own current skill, the optimal learning update is probably going to be a SL style update. Even if you can’t understand the reasons behind the moves yet, it’s best to compress them into the cortex for later. If you can do a local search to understand why the move is good, then that is even better and it becomes more like RL, but again, these hard divisions are arbitrary and limiting.
A few hundred TitanX’s can muster up perhaps a petaflop of compute.
Could you elaborate? I think this number is too high by roughly one order of magnitude.
The GTX TitanX has a peak perf of 6.1 terraflops, so you’d need only a few hundred to get a petaflop supercomputer (more specifically, around 175).
The high end estimate of the brain is 10 petaflops (100 trillion synapses * 100 hz max firing rate).
Estimating the computational capability of the human brain is very difficult. Among other things, we don’t know what the neuroglia cells may be up to, and these are just as numerous as neurons.
It’s just a circuit, and it obeys the same physical laws. We have this urge to mystify it for various reasons. Neuroglia can not possibly contribute more to the total compute power than the neurons, based on simple physics/energy arguments. It’s another stupid red herring like quantum woo.
These estimates are only validated when you can use them to make predictions. And if you have the right estimates (brain equivalent to 100 terraflops ish, give or take an order of magnitude), you can roughly predict the outcome of many comparisons between brain circuits vs equivalent ANN circuits (more accurately than using the wrong estimates).
This is probably a misconception for several reasons. Firstly, given that we don’t fully understand the learning mechanisms in the brain yet, it’s unlikely that it’s mostly one thing …
We don’t understand the learning mechanisms yet, but we’re quite familiar with the data they use as input. “Internally” supervised learning is just another term for semi-supervised learning anyway. Semi-supervised learning is plenty flexible enough to encompass the “multi-objective” features of what occurs in the brain.
The GTX TitanX has a peak perf of 6.1 terraflops, so you’d need only a few hundred to get a petaflop supercomputer (more specifically, around 175).
Raw and “peak performance” FLOPS numbers should be taken with a grain of salt. Anyway, given that a TitanX apparently draws as much as 240W of power at full load, your “petaflop-scale supercomputer” will cost you a few hundred-thousand dollars and draw 42kW to do what the brain does within 20W or so. Not a very sensible use for that amount of computing power—except for the odd publicity stunt, I suppose. Like playing Go.
It’s just a circuit, and it obeys the same physical laws.
Of course. Neuroglia are not magic or “woo”. They’re physical things, much like silicon chips and neurons.
Raw and “peak performance” FLOPS numbers should be taken with a grain of salt.
Yeah, but in this case the best convolution and gemm codes can reach like 98% efficiency for the simple standard algorithms and dense input—which is what most ANNs use for about everything.
given that a TitanX apparently draws as much as 240W of power at full load, your “petaflop-scale supercomputer” will cost you a few hundred-thousand dollars and draw 42kW to do what the brain does within 20W or so
Well, in this case of Go and for an increasing number of domains, it can do far more than any brain—learns far faster. Also, the current implementations are very very far from optimal form. There is at least another 100x to 1000x easy perf improvement in the years ahead. So what 100 gpus can do now will be accomplished by a single GPU in just a year or two.
It’s just a circuit, and it obeys the same physical laws.
Of course. Neuroglia are not magic or “woo”. They’re physical things, much like silicon chips and neurons.
Right, and they use a small fraction of the energy budget, and thus can’t contribute much to the computational power.
Well, in this case of Go and for an increasing number of domains, it can do far more than any brain—learns far faster.
This might actually be the most interesting thing about AlphaGo. Domain experts who have looked at its games have marveled most at how truly “book-smart” it is. Even though it has not shown a lot of creativity or surprising moves (indeed, it was comparatively weak at the start of Game 1), it has fully internalized its training and can always come up with the “standard” play.
Right, and they use a small fraction of the energy budget, and thus can’t contribute much to the computational power.
Not necessarily—there might be a speed vs. energy-per-op tradeoff, where neurons specialize in quick but energy-intensive computation, while neuroglia just chug along in the background. We definitely see such a tradeoff in silicon devices.
Domain experts who have looked at its games have marveled most at how truly “book-smart” it is. Even though it has not shown a lot of creativity or surprising moves (indeed, it was comparatively weak at the start of Game 1), it has fully internalized its training and can always come up with the “standard” play.
Do you have links to such analyses? I’d be interested in reading them.
Doesn’t a similar criticism apply to ML researchers who claim not to fear AI? (i.e. it would be inconvenient for them if it became widely thought that ML research was dangerous).
Not really. I mean, yes, in principle. But in practice, EY relies on people having UFAI as a “live issue” to keep MIRI going. ML researchers are not worried about funding cuts due to UFAI fears. They are worried about Congress being dysfunctional, etc. My personal funding situation will be affected by any way this argument plays out not at all.
If, despite lots of effort, we couldn’t create a program that could beat any human in go, wouldn’t this be evidence that we were far away from creating smarter-than-human AI?
No, I just remember my AI history (TD gammon, etc.) The question you should be asking is: “is there any evidence that will result in EY ceasing to urgently ask for your money?”
I actually think self-driving cars are more interesting than strong go playing programs (but they don’t worry me much either).
I guess I am not sure why I should pay attention to EY’s opinion on this. I do ML-type stuff for a living. Does EY have an unusual track record for predicting anything? All I see is a long tail of vaguely silly things he says online that he later renounces (e.g. “ignore stuff EY_2004 said”). To be clear: moving away from bad opinions is great! That is not what the issue is.
edit: In general I think LW really really doesn’t listen to experts enough (I don’t even mean myself, I just mean the sensible Bayesian thing to do is to just go with expert opinion prior on almost everything.) EY et al. take great pains to try to move people away from that behavior, talking about how the world is mad, about civiliational inadequacy, etc. In other words, don’t trust experts, they are crazy anyways.
I’m not going to argue that you should pay attention to EY. His arguments convince me, but if they don’t convince you, I’m not gonna do any better.
What I’m trying to get at is, when you ask “is there any evidence that will result in EY ceasing to urgently ask for your money?”… I mean, I’m sure there is such evidence, but I don’t wish to speak for him. But it feels to me that by asking that question, you possibly also think of EY as the sort of person who says: “this is evidence that AI risk is near! And this is evidence that AI risk is near! Everything is evidence that AI risk is near!” And I’m pointing out that no, that’s not how he acts.
While we’re at it, this exchange between us seems relevant. (“Eliezer has said that security mindset is similar, but not identical, to the mindset needed for AI design.” “Well, what a relief!”) You seem surprised, and I’m not sure what about it was surprising to you, but I don’t think you should have been surprised.
Basically, even if you’re right that he’s wrong, I feel like you’re wrong about how he’s wrong. You seem to have a model of him which is very different from my model of him.
(Btw, his opinion seems to be that AlphaGo’s methods are what makes it more of a leap than a self-driving car or than Deep Blue, not the results. Not sure that affects your position.)
“this is evidence that AI risk is near! And this is evidence that AI risk is near! Everything is evidence that AI risk is near!” And I’m pointing out that no, that’s not how he acts.
In particular he apparently mentioned Go play as an indicator before (and assumed as many other people that it were somewhat more distant) and now follows up on this threshold. What else would you expect? That he don’t name a limited number of relevant events (I assume that the number is limited; I didn’t know of this specific one before)?
I think you misunderstood me (but that’s my fault for being opaque, cadence is hard to convey in text). I was being sarcastic. In other words, I don’t need EY’s opinion, I can just look at the problem myself (as you guys say “argument screens authority.”)
I feel like you’re wrong about how he’s wrong.
Look, I met EY and chatted with him. I don’t think EY is “evil,” exactly, in a way that L. Ron Hubbard was. I think he mostly believes his line (but humans are great at self-deception). I think he’s a flawed person, like everyone else. It’s just that he has an enormous influence on the rationalist community that immensely magnify the damage his normal human flaws and biases can do.
I always said that the way to repair human frailty issues is to treat rationality as a job (rather than a social club), and fellow rationalists as coworkers (rather than tribe members). I also think MIRI should stop hitting people up for money and get a normal funding stream going. You know, let their ideas of how to avoid UFAI compete in the normal marketplace of ideas.
I also think MIRI should stop hitting people up for money and get a normal funding stream going. You know, let their ideas of how to avoid UFAI compete in the normal marketplace of ideas.
Currently MIRI gets their funding by 1) donations 2) grants. Isn’t that exactly what the normal funding stream for non-profits is?
Sure. Scientology probably has non-profits, too. I am not saying MIRI is anything like Scientology, merely that it isn’t enough to just determine legal status and call it a day, we have to look at the type of thing the non-profit is.
MIRI is a research group. They call themselves an institute, but they aren’t, really. Institutes are large. They are working on some neat theory stuff (from what Benja/EY explained to me) somewhat outside the mainstream. Which is great! They have some grant funding, actually, last I checked. Which is also great!
They are probably not yet financially secure to stop asking for money, which is also ok.
I think all I am saying is, in my view the success condition is they “achieve orbit” and stop asking, because basically what they are working on is considered sufficiently useful research that they can operate like a regular research group. If they never stop asking I think that’s a bit weird, because either their direction isn’t perceived good and they can’t get enough funding bandwidth without donations, or they do have enough bandwidth but want more revenue anyways, which I personally would find super weird and unsavory.
They are probably not yet financially secure to stop asking for money, which is also ok.
Who is? Last I checked, Harvard was still asking alums for donations, which suggests to me that asking is driven by getting money more than it’s driven by needing money.
I think comparing Harvard to a research group is a type error, though. Research groups don’t typically do this. I am not going to defend Unis shaking alums down for money, especially given what they do with it.
I think comparing Harvard to a research group is a type error, though.
I know several research groups where the PI’s sole role is fundraising, despite them having much more funding than the average research group.
My point was more generic—it’s not obvious to me why you would expect groups to think “okay, we have enough resources, let’s stop trying to acquire more” instead of “okay, we have enough resources to take our ambitions to the next stage.” The American Cancer Society has about a billion dollar budget, and yet they aren’t saying “yeah, this is enough to deal with cancer, we don’t need your money.”
(It may be the case that a particular professor stops writing grant applications, because they’re limited by attention they can give to their graduate students. But it’s not like any of those professors will say “yeah, my field is big enough, we don’t need any more professor slots for my students to take.”)
In my experience, research groups exist inside universities or a few corporations like Google. The senior members are employed and paid for by the institution, and only the postgrads, postdocs, and equipment beyond basic infrastructure are funded by research grants. None of them fly “in orbit” by themselves but only as part of a larger entity. Where should an independent research group like MIRI seek permanent funding?
By “in orbit” I mean “funded by grants rather than charity.” If a group has a steady grant research stream, that means they are doing good enough work that funding agencies continue to give them money. This is the standard way to be self-sustaining for a research group.
This is a good question. I think lots of funding incentive to build integrated systems (like self-driving cars, but for other domains) and enough talent pipeline to start making that stuff happen and create incremental improvements. People in general underestimate the systems engineering aspect of getting artificial intelligent agents to work in practice even in fairly limited settings like car driving.
Go is a hard game, but it is a toy problem in a way that dealing with the real world isn’t. I am worried about economic incentives making it worth people’s while to keep throwing money and people and iterating on real actual systems that do intelligent things in the world. Even fairly limited things at first.
Go is a hard game, but it is a toy problem in a way that dealing with the real world isn’t.
What do you mean by this exactly? That real world has combinatorics problems that are much wider, or that dealing with real world does not reduce well to search in a tree of possible actions?
I think getting this working took a lot of effort and insight, and I don’t mean to discount this effort or insight at all. I couldn’t do what these guys did. But what I mean by “toy problem” is it avoids a lot of stuff about the physical world, hardware, laws, economics, etc. that happen when you try to build real things like cars, robots, or helicopters.
In other words, I think it’s great people figured out the ideal rocket equation. But somehow it will take a lot of elbow grease (that Elon Musk et al are trying to provide) to make this stuff practical for people who are not enormous space agencies.
I don’t think that fair criticism on that point. As far as I understand MIRI did make the biggest survey of AI experts that asked when those experts predict AGI to arrive:
A recent set of surveys of AI researchers produced the following median dates:
for human-level AI with 10% probability: 2022 for human-level AI with 50% probability: 2040 for human-level AI with 90% probability: 2075
When EY says that this news shows that we should put a significant amount of our probability mass before 2050 that doesn’t contradict expert opinions.
Sure, but it’s not just about what experts say on a survey about human level AI. It’s also about what info a good Go program has for this question, and whether MIRI’s program makes any sense (and whether it should take people’s money). People here didn’t say “oh experts said X, I am updating,” they said “EY said X on facebook, time for me to change my opinion.”
I don’t know your mind, you tell me? What exactly is it that you find worrying?
My possibly-incorrect guess is that you’re worried about something like “the community turning into an echo chamber that only promotes Eliezer’s views and makes its members totally ignore expert opinion when forming their views”. But if that was your worry, the presence of highly upvoted criticisms of Eliezer’s views should do a lot to help, since it shows that the community does still take into account (and even actively reward!) well-reasoned opinions that show dissent from the tribal leaders.
So since you still seem to be worried despite the presence of those comments, I’m assuming that your worry is something slightly different, but I’m not entirely sure of what.
One problem is that the community has few people actually engaged enough with cutting edge AI / machine learning / whatever-the-respectable-people-call-it-this-decade research to have opinions that are grounded in where the actual research is right now. So a lot of the discussion is going to consist of people either staying quiet or giving uninformed opinions to keep the conversation going. And what incentive structures there are here mostly work for a social club, so there aren’t really that many checks and balances that keep things from drifting further away from being grounded in actual reality instead of the local social reality.
Ilya actually is working with cutting edge machine learning, so I pay attention to his expressions of frustration and appreciate that he persists in hanging out here.
“EY said X on facebook, time for me to change my opinion.”
Who do you think said that in this case?
Just to be clear about your position, what do you think are reasonable values for human-level AI with 10% probability/
human-level AI with 50% probability and human-level AI with 90% probability?
I think the question in this thread is about how much the deep learning Go program should move my beliefs about this, whatever they may be. My answer is “very little in a sooner direction” (just because it is a successful example of getting a complex thing working). The question wasn’t “what are your belief about how far human level AI is” (mine are centered fairly far out).
I think this debate is quite hard with terms vague terms like “very little” and “far out”. I really do think it would be helpful for other people trying to understand your position if you put down your numbers for those predictions.
When EY says that this news shows that we should put a significant amount of our probability mass before 2050 that doesn’t contradict expert opinions.
The point is how much we should update our AI future timeline beliefs (and associated beliefs about whether it is appropriate to donate to MIRI and how much) based on the current news of DeepMind’s AlphaGo success.
There is a difference between “Gib moni plz because the experts say that there is a 10% probability of human-level AI within 2022” and “Gib moni plz because of AlphaGo”.
I understand IlyaShpitser to claim that there are people who update their AI future timeline beliefs in a way that isn’t appropriate because of EY statements. I don’t think that’s true.
I don’t have a source on this, but I remember an anecdote from Kurzweil that scientists who worked on early transistors were extremely skeptical about the future of the technology. They were so focused on solving specific technical problems that they didn’t see the big picture. Whereas an outside could have just looked at the general trend and predicted a doubling every 18 months, and they would have been accurate for at least 50 years.
So that’s why I wouldn’t trust various ML experts like Ng that have said not to worry about AGI. No, the specific algorithms they work on are not anywhere near human level. But the general trend, and the proof that humans aren’t really that special, is concerning.
I’m not saying that you should just trust Yudkowsky or me instead. And expert opinion still has value. But maybe pick an expert that is more “big picture” focused? Perhaps Jürgen Schmidhuber, who has done a lot of notable work on deep learning and ML, but also has an interest in general intelligence and self improving AIs.
And I don’t have any specific prediction from him on when we will reach AGI. But he did say last year that he believes we will reach monkey level intelligence in 10 years. Which is quite a huge milestone.
Another candidate might be the group being discussed in this thread, Deepmind. They are focused on reaching general AI instead of just typical machine vision work. That’s why they have such a strong interest in game playing. I don’t have any specific predictions from them either, but I do get the impression they are very optimistic.
Whereas an outside could have just looked at the general trend and predicted a doubling every 18 months, and they would have been accurate for at least 50 years.
I’m not buying this.
There are tons of cases where people look at the current trend and predict it will continue unabated into the future. Occasionally they turn out to be right, mostly they turn out to be wrong. In retrospect it’s easy to pick “winners”, but do you have any reason to believe it was more than a random stab in the dark which got lucky?
If you were trying to predict the future of flight in 1900, you’d do pretty terrible by surveying experts. You would do far better by taking a Kurzweil style approach where you put combustion engine performance on a chart and compared it to estimates of the power/weight ratios required for flight.
The point of that comment wasn’t to praise predicting with trends. It was to show an example where experts are sometimes overly pessimistic and not looking at the big picture.
When people say that current AI sucks, and progress is really hard, and they can’t imagine how it will scale to human level intelligence, I think it’s a similar thing. They are overly focused on current methods and their shortcomings and difficulties. They aren’t looking at the general trend that AI is rapidly making a lot of progress. Who knows what could be achieved in decades.
I’m not talking about specific extrapolations like Moore’s law, or even imagenet benchmarks—just the general sense of progress every year.
This claim doesn’t make much sense from the outset. Look at your specific example of transistors. In 1965, an electronics magazine wanted to figure out what would happen over time with electronics/transistors so they called up an expert, the director of research of Fairchild semiconductor. Gordon Moore (the director of research), proceeded to coin Moore’s law and tell them the doubling would continue for at least a decade, probably more. Moore wasn’t an outsider, he was an expert.
I never said that every engineer at every point in time was pessimistic. Just that many of them were at one time. And I said it was a second hand anecdote, so take that for what it’s worth.
You have to be more specific with the timeline. Transistors were invented in 1925 but received little interests due to many technical problems. It took three decades of research before the first commercial transistors were produced by Texas Instruments in 1954.
Gordon Moore formulated his eponymous law in 1965, while he was director of R&D at Fairchild Semiconductor, a company whose entire business consisted in the manufacture of transistors and integrated circuits. By that time, tens of thousands transistor-based computers were in active commercial use.
It wouldn’t have made a lot of sense to predict any doublings for transistors in an integrated circuit before 1960, because I think that is when they were invented.
As I said, the ideal is to use expert opinion as prior unless you have a lot of good info, or you think something is uniquely dysfunctional about an area (its rationalist folklore that a lot of areas are dysfunctional—“the world is mad”—but I think people are being silly about this). Experts really do know a lot.
You also need to figure out who are actual experts and what do they actually say. That’s a non-trivial task—reading reports on science in mainstream media will just stuff your head with nonsense.
It’s actually much worse than that, because huge breakthroughs themselves are what create new experts. So on the eve of huge breakthroughs, currently recognized experts invariably predict the future is far, simply because they can’t see the novel path towards the solution.
In this sense everyone who is currently an AI expert is, trivially, someone who has failed to create AGI. The only experts who have any sort of clear understanding of how far AGI is are either not currently recognized or do not yet exist.
Btw, I don’t consider myself an AI expert. I am not sure what “AI expertise” entails, I guess knowing a lot about lots of things that include stuff like stats/ML but also other things, including a ton of engineering. I think an “AI expert” is sort of like “an airplane expert.” Airplanes are too big for one person—you might be an expert on modeling fluids or an expert on jet engines, but not an expert on airplanes.
And the many-worlds interpretation of quantum mechanics. That is, all EY’s hobby horses. Though I don’t know how common these positions are among the unquiet spirits that haunt LessWrong.
As researchers we fight to make the machine slightly more intelligent, but they are still so stupid. I used to think we shouldn’t call the field artificial intelligence but artificial stupidity. Really, our machines are dumb, and we’re just trying to make them less dumb.
Now, because of these advances that people can see with demos, now we can say, “Oh, gosh, it can actually say things in English, it can understand the contents of an image.” Well, now we connect these things with all the science fiction we’ve seen and it’s like, “Oh, I’m afraid!”
How big a deal is this? What, if anything, does it signal about when we get smarter than human AI?
Eliezer thinks it’s a big deal.
Thanks. Key quote:
His argument proves too much.
You could easily transpose it for the time when Checkers or Chess programs beat professional players: back then the “keystone, foundational aspect” of intelligence was thought to be the ability to do combinatorial search in large solution spaces, and scaling up to AGI was “just” a matter of engineering better heuristics. Sure, it didn’t work on Go yet, but Go players were not using a different cortical algorithm than Chess players, were they?
Or you could transpose it for the time when MCTS Go programs reached “dan” (advanced amateur) level. They still couldn’t beat professional players, but professional players were not using a different cortical algorithm than advanced amateur players, were they?
AlphaGo succeded at the current achievement by using artificial neural networks in a regime where they are know to do well. But this regime, and the type of games like Go, Chess, Checkers, Othello, etc. represent a small part of the range of human cognitive tasks. In fact, we probably find this kind of board games fascinating precisely because they are very different than the usual cognitive stimuli we deal with in everyday life.
It’s tempting to assume that the “keystone, foundational aspect” of intelligence is learning essentially the same way that artificial neural networks learn. But humans can do things like “one-shot” learning, learning from weak supervision, learning in non-stationary environments, etc. which no current neural network can do, and not just because a matter of scale or architectural “details”. Researchers generally don’t know how to make neural networks, or really any other kind of machine learning algorithm, do these things, except with massive task-specific engineering. Thus I think it’s fair to say that we still don’t know what the foundational aspects of intelligence are.
In the brain, the same circuitry that is used to solve vision is used to solve most of the rest of cognition—vision is 10% of the cortex. Going from superhuman vision to superhuman Go suggests superhuman anything/everything is getting near.
The reason being that strong Go requires both deep slow inference over huge data/time (which DL excels in, similar to what the cortex/cerebellum specialize in), combined with fast/low data inference (the MCTS part here). There is still much room for improvement in generalizing beyond current MCTS techniques, and better integration into larger scale ANNs, but that is increasingly looking straightforward.
Yes, but only because “ANN” is enormously broad (tensor/linear algebra program space), and basically includes all possible routes to AGI (all possible approximations of bayesian inference).
Bayesian methods excel at one shot learning, and are steadily integrating themselves into ANN techniques (providing the foundation needed to derive new learning and inference rules). Progress in transfer and semi-supervised learning is also progressing rapidly and the theory is all there. I don’t know about non-stationary as much, but I’d be pretty surprised if there wasn’t progress there as well.
LOL. Generalized DL + MCTS is—rather obviously—a practical approximation of universal intelligence like AIXI. I doubt MCTS scales to all domains well enough, but the obvious next step is for DL to eat MCTS techniques (so that super new complex heuristic search techniques can be learned automatically).
And in a laptop the same circuitry that it is used to run a spreadsheet is used to play a video game.
Systems that are Turing-complete (in the limit of infinite resources) tend to have an independence between hardware and possibly many layers of software (program running on VM running on VM running on VM and so on). Things that look similar at a some levels may have lots of difference at other levels, and thus things that look simple at some levels can have lots of hidden complexity at other levels.
Human-level (perhaps weakly superhuman) vision is achieved only in very specific tasks where large supervised datasets are available. This is not very surprising, since even traditional “hand-coded” computer vision could achieve superhuman performances in some narrow and clearly specified tasks.
Again, ANN are Turing-complete, therefore in principle they include literally everything, but so does the brute-force search of C programs.
In practice if you try to generate C programs by brute-force search you will get stuck pretty fast, while ANN with gradient descent training empirically work well on various kinds of practical problems, but not on all kinds practical problems that humans are good at, and how to make them work on these problems, if it even efficiently possible, is a whole open research field.
With lots of task-specific engineering.
So are things like AIXI-tl, Hutter-search, Gödel machine, and so on. Yet I would not consider any of them as the “foundational aspect” of intelligence.
Exactly, and this a good analogy to illustrate my point. Discovering that the cortical circuitry is universal vs task-specific (like an ASIC) was a key discovery.
Note I didn’t say that we have solved vision to superhuman level, but this is simply not true. Current SOTA nets can achieve human-level performance in at least some domains using modest amounts of unsupervised data combined with small amounts of supervised data.
Human vision builds on enormous amounts of unsupervised data—much larger than ImageNet. Learning in the brain is complex and multi-objective, but perhaps best described as self-supervised (unsupervised meta-learning of sub-objective functions which then can be used for supervised learning).
A five year old will have experienced perhaps 50 million seconds worth of video data. Imagenet consists of 1 million images, which is vaguely equivalent to 1 million seconds of video if we include 30x amplification for small translations/rotations.
The brain’s vision system is about 100x larger than current ‘large’ vision ANNs. But If deepmind decided to spend the cash on that and make it a huge one off research priority, do you really doubt that they could build a superhuman general vision system that learns with a similar dataset and training duration?
The foundation of intelligence is just inference—simply because universal inference is sufficient to solve any other problem. AIXI is already simple, but you can make it even simpler by replacing the planning component with inference over high EV actions, or even just inference over program space to learn approx planning.
So it all boils down to efficient inference. The new exciting progress in DL—for me at least—is in understanding how successful empirical optimization techniques can be derived as approx inference update schemes with various types of priors. This is what I referred to as new and upcoming “Bayesian methods”—bayesian grounded DL.
“Enormously broad” is just another way of saying “not very useful”. We don’t even know in which sense (if any) the “deep networks” that are used in practice may be said to approximate Bayesian inference; the best we can do, AIUI, is make up a hand-wavy story about how they must be some “hierarchical” variation of single-layer networks, i.e. generalized linear models.
Specifically I meant approx bayesian inference over the tensor program space to learn the ANN, not that the ANN itself needs to implement bayesian inference (although they will naturally tend to learn that, as we see in all the evidence for various bayesian ops in the brain) .
I agree. I don’t find this result to be any more or less indicative of near-term AI than Google’s success on ImageNet in 2012. The algorithm learns to map positions to moves and values using CNNs, just as CNNs can be used to learn mappings from images to 350 classes of dog breeds and more. It turns out that Go really is a game about pattern recognition and that with a lot of data you can replicate the pattern detection for good moves in very supervised ways (one could call their reinforcement learning actually supervised because the nature of the problem gives you credit assignment for free).
I think what this result says is thus: “Any tasks humans can do, an AI can now learn to do better, given a sufficient source of training data.”
Games lend themselves to auto-generation of training data, in the sense that the AI can at the very least play against itself. No matter how complex the game, a deep neural net will find the structure in it, and find a deeper structure than human players can find.
We have now answered the question of, “Are deep neural nets going to be sufficient to match or exceed task-specific human performance at any well-specified task?” with “Yes, they can, and they can do it better and faster than we suspected.” The next hurdle—which all the major companies are working on—is to create architectures that can find structure in smaller datasets, less well-tailored training data, and less well-specified tasks.
I don’t think it says anything like that.
I included the word “sufficient” as an ass-covering move, because one facet of the problem is we don’t really know what will serve as a “sufficient” amount of training data in what context.
But, what specific types of tasks do you think machines still can’t do, given sufficient training data? If your answer is something like “physics research,” I would rejoinder that if you could generate training data for that job, a machine could do it.
Grand pronouncements with an ass-covering move look silly :-)
One obvious problem is that you are assuming stability. Consider modeling something that changes (in complex ways) with time—like the economy of the United States. Is “training data” from the 1950s relevant to the currrent situation?
Generally speaking, the speed at which your “training data” gets stale puts an upper limit on the relevant data that you can possibly have and that, in turn, puts an upper limit on the complexity of the model (NNs included) that you can build on its basis.
I don’t see how we anything like know that deep NNs with ‘sufficient training data’ would be sufficient for all problems. We’ve seen them be sufficient for many different problems and can expect them to be sufficient for many more, but all?
Yes, but that would likely require an extremely large amount of training data because to prepare actions for many kind of situations you’d have an exponential blow up to cover many combinations of many possibilities, and hence the model would need to be huge as well. It also would require high-quality data sets with simple correction signals in order to work, which are expensive to produce.
I think, above all for building a real-time AI you need reuse of concepts so that abstractions can be recombined and adapted to new situations; and for concept-based predictions (reasoning) you need one-shot learning so that trains of thoughts can be memorized and built upon. In addition, the entire network needs to learn somehow to determine which parts of the network in the past were responsible for current reward signals which are delayed and noisy. If there is a simple and fast solutions to this, then AGI could be right around the corner. If not, it could take several decades of research.
This is a well-known problem, called reinforcement learning. It is a significant component in the reported results. (What happens in practice is that a network’s ability to assign “credit” or “blame” for reward signals falls off exponentially with increasing delay. This is a significant limitation, but reinforcement learning is nevertheless very helpful given tight feedback loops.)
Yes, but as I wrote above, the problems of credit assignment, reward delay and noise are non-existent in this setting, and hence their work does not contribute at all to solving AI.
Credit assignment and reward delay are nonexistent? What do you think happens when one diffs the board strength of two potential boards?
Reward delay is not very significant in this task, since the task is episodic and fully observable, and there is no time preference, thus you can just play a game to completion without updating and then assign the final reward to all the positions.
In more general reinforcement learning settings, where you want to update your policy during the execution, you have to use some kind of temporal difference learning method, which is further complicated if the world states are not fully observable.
Credit assignment is taken care of by backpropagation, as usual in neural networks. I don’t know why RaelwayScot brought it up, unless they meant something else.
I meant that for AI we will possibly require high-level credit assignment, e.g. experiences of regret like “I should be more careful in these kinds of situations”, or the realization that one particular strategy out of the entire sequence of moves worked out really nicely. Instead it penalizes/enforces all moves of one game equally, which is potentially a much slower learning process. It turns out playing Go can be solved without much structure for the credit assignment processes, hence I said the problem is non-existent, i.e. there wasn’t even need to consider it and further our understanding of RL techniques.
Agreed, with the caveat that this is a stochastic object, and thus not a fully simple problem. (Even if I knew all possible branches of the game tree that originated in a particular state, I would need to know how likely any of those branches are to be realized in order to determine the current value of that state.)
Well, the value of a state is defined assuming that the optimal policy is used for all the following actions. For tabular RL you can actually prove that the updates converge to the optimal value function/policy function (under some conditions). If NN are used you don’t have any convergence guarantees, but in practice the people at DeepMind are able to make it work, and this particular scenario (perfect observability, determinism and short episodes) is simpler than, for instance that of the Atari DQN agent.
“Nonexistent problems” was meant as a hyperbole to say that they weren’t solved in interesting ways and are extremely simple in this setting because the states and rewards are noise-free. I am not sure what you mean by the second question. They just apply gradient descent on the entire history of moves of the current game such that expected reward is maximized.
It seems to me that the problem of value assignment to boards—”What’s the edge for W or B if the game state looks like this?” is basically a solution to that problem, since it gives you the counterfactual information you need (how much would placing a stone here improve my edge?) to answer those questions.
I agree that it’s a much simpler problem here than it is in a more complicated world, but I don’t think it’s trivial.
Man, I wouldn’t bother. EY has spoken, we are done here.
There are other big deals. The MS ImageNet win also contained frightening progress on the training meta level.
-- extracted from very readable summary at wired: http://www.wired.com/2016/01/microsoft-neural-net-shows-deep-learning-can-get-way-deeper/
Going by that description, it is much much less important than residual learning, because hyperparameter optimization is not new. There are a lot of approaches: grid search, random search, Gaussian processes. Some hyperparameter optimizations baked into MSR’s deep learning framework would save some researcher time and effort, certainly, but I don’t know that it would’ve made any big difference unless they have something quite unusual going one.
(I liked one paper which took a Bayesian multi-armed bandit approach and treated error curves as partial information about final performance, and it would switch between different networks being trained based on performance, regularly ‘freezing’ and ‘thawing’ networks as the probability each network would become the best performer changed with information from additional mini-batches/epoches.) Probably the single coolest one is last year some researchers showed that it is possible to somewhat efficiently backpropagate on hyperparameters! So hyperparameters just become more parameters to learn, and you can load up on all sorts of stuff without worrying about it making your hyperparameter optimization futile or having to train a billion times, and would both save people a lot of time (for using vanilla networks) and allow exploring extremely complicated and heavily parameterized families of architectures, and would be a big deal. Unfortunately, it’s still not efficient enough for the giant networks we want to train. :(
The key point is that machine learning starts to happen at the hyper-parameter level. Which is one more step toward systems that optimize themselves.
A step which was taken a long time ago and does not seem to have played much of a role in recent developments; for the most part, people don’t bother with extensive hyperparameter tuning. Better initialization, better algorithms like dropout or residual learning, better architectures, but not hyperparameters.
It’s a big deal for Go, but I don’t think it’s a very big deal for AGI.
Conceptually Go is like Chess or Checkers: fully deterministic, perfect information two-player games.
Go is more challenging for computers because the search space (and in particular the average branching factor) is larger and known position evaluation heuristics are not as good, so traditional alpha-beta minimax search becomes infeasible.
The first big innovation, already put into use by most Go programs for a decade (although the idea is older) was Monte Carlo tree search, which addresses the high branching factor issue: while traditional search either does not expand a node or expands it and recursively evaluates all its children, MCTS stochastically evaluates nodes with a probability that depends on how promising they look, according to some heuristic.
DeepMind’s innovation consists in using a NN to learn a good position evaluation heuristic in a supervised fashion from a large database of professional games, refining it with reinforcement learning in “greedy” self-play mode and then using both the refined heuristic and the supervised heuristic in a MCTS engine.
Their approach essentially relies on big data and big hardware. From an engineering point of view, it is a major advancement of neural network technology because of the sheer scale and in particular the speed of the thing, which required significant non-trivial parallelization, but the core techniques aren’t particularly new and I doubt that they can scale well to more general domains with non-determinism and partial observability. However, neural networks may be more robust to noise and certain kinds of disturbances than hand-coded heuristics, so take this with a grain of salt.
So, to the extent that AGI will rely on large and fast neural networks, this work is a significant step towards practical AGI engineering, but to the extent that AGI will rely on some “master algorithm” this work is probably not a very big step towards the discovery of such algorithm, at least compared to previously known techniques.
I think it is a bigger deal than chess because it doesn’t use brute-force but mostly unsupervised learning. It is not the breakthrough in AGI but it is telling that this approach thoroughly beats all the other Go algorithms (1 out of 500 plays lost, even with handicap 4. And they say that it still improves by training.
I wouldn’t say that it’s “mostly unsupervised” since a crucial part of their training is done in a traditional supervised fashion on a database of games by professional players.
But it’s certainly much more automated than having an hand-coded heuristic.
Humans also learn extensively by studying the games of experts. In Japan/China, even fans follow games from newspapers.
A game might take an hour on average. So a pro with 10 years of experience may have played/watched upwards of 10,000 games. However, it takes much less time to read a game that has already been played—so a 10 year pro may be familiar with say 100,000 games. Considering that each game has 200+ moves, that roughly is a training set of order 2 to 20 million positions.
AlphaGo’s training set consisted of 160,000 games with 29 million positions, so the upper end estimate for humans is similar. More importantly, the human training set is far more carefully curated and thus of higher quality.
That’s 27.4 games a day, on average. I think this is an overestimate.
It was my upper bound estimate, and if anything it was too low.
A pro will grow up in a dedicated go school where there are hundreds of other players just playing go and studying go all day. Some students will be playing speed games, and some will be flipping through summaries of historical games in books/magazines and or on the web.
When not playing, people will tend to walk around and spectate the other games (nowdays this is also trivial to do online). An experienced player can reconstruct some of the move history by just glancing at the board.
So if anything, 27.4 games watched/skimmed/experienced per day is too low for the upper estimate.
An East Asian Go pro will often have been an insei and been studying Go full-time at a school, and a dedicated amateur before that, so you can imagine how many hours a day they will be studying… (The intensiveness is part of why they dominate Go to the degree they do and North American & Europeans are so much weaker: start a lot of kids, start them young, school them 10 hours a day for years studying games and playing against each other and pros, and keep relentlessly filtering to winnow out anyone who is not brilliant.)
I would say 100k is an overestimate since they will tend to be more closely studying the games and commentaries and also working out life-and-death problems, memorizing the standard openings, and whatnot, but they are definitely reading through and studying tens of thousands of games—similar to how one of the reasons chess players are so much better these days than even just decades ago is that computers have given access to enormous databases of games which can be studied with the help of chess AIs (Carlsen has benefited a lot from this, I understand). Also, while I’m nitpicking, AlphaGo trained on both the KGS and then self-play; I don’t know how many games the self-play amounted to, but the appendix broke down the wallclock times by phase, and of the 4 weeks of wallclock time, IIRC most of it was spent on the self-play finetuning the value function.
But if AlphaGo is learning from games ‘only’ more efficiently than 99%+ of the humans who play Go (Fan Hui was ranked in the 600s, there’s maybe 1000-2000 people who earn a living as Go professionals, selected from the hundreds of thousands/millions of people who play), that doesn’t strike me as much of a slur.
For the SL phase, they trained 340 million updates with a batch size of 16, so 5.4 billion position-updates. However the database had only 29 million unique positions. That’s about 200 gradient iterations per unique position.
The self-play RL phase for AlphaGo consisted of 10,000 minibatches of 128 games each, so about 1 million games total. They only trained that part for a day.
They spent more time training the value network: 50 million minibatches of 32 board positions, so about 1.6 billion positions. That’s still much smaller than the SL training phase.
The supervised part is only in the bootstrapping. The main learning happens in the self-play part.
Cite? They use the supervised network for policy selection (i.e. tree pruning) which is a critical part of the system.
I’m referring to figure 1a on page 4 and the explanation below. I can’t be sure but the self-play should be contributing a large part to the training and can go on and improve the algorithm even if the expert database stays fixed.
They spent three weeks to train the supervised policy and one day to train the reinforcement learning policy starting from the supervised policy, plus an additional week to extract the value function from the reinforcement learning policy (pages 25-26).
In the final system the only part that depends on RL is the value function. According to figure 4, if the value function is taken out the system still plays better than any other Go program, though worse than the human champion.
Therefore I would say that the system heavily depends on supervised training on a human-generated dataset. RL was needed to achieve the final performance, but it was not the most important ingredient.
It shows that Monte-Carlo tree search meshes remarkably well with neural-network-driven evaluation (“value networks”) and decision pruning/policy selection (“policy networks”). This means that if you have a planning task to which MCTS can be usefully applied, and sufficient data to train networks for state-evaluation and policy selection, and substantial computation power (a distributed cluster, in AlphaGo’s case), you can significantly improve performance on your task (from “strong amateur” to “human champion” level). It’s not an AGI-complete result however, any more than Deep-Blue or TD-gammon were AGI-complete.
The “training data” factor is a biggie; we lack this kind of data entirely for things like automated theorem proving, which would otherwise be quite amenable to this ‘planning search + complex learned heuristics’ approach. In particular, writing provably-correct computer code is a minor variation on automated theorem proving. (Neural networks can already write incorrect code, but this is not good enough if you want a provably Friendly AGI.)
Humans need extensive training to become competent, as will AGI, and this should have been obvious for anyone with a good understanding of ML.
The interesting thing about that RNN that you linked that writes code, is that it shouldn’t work at all. It was just given text files of code and told to predict the next character. It wasn’t taught how to program, it never got to see an interpreter, it doesn’t know any English yet has to work with English variable names, and it only has a few hundred neurons to represent its entire knowledge state.
The fact that it is even able to produce legible code is amazing, and suggests that we might not be that far of from NNs that can write actually usable code. Still some ways away, but not multiple decades.
Somewhat. Look at what happens when you generate code from a simple character-level Markov language model (that’s just a look up table that gives the probability of the next character conditioned on the last n characters, estimated by frequency counts on the training corpus).
An order-20 language model generates fairly legible code, with sensible use of keywords, identifier names and even comments. The main difference with the RNN language model is that the RNN learns to do proper identation and bracket matching, while the Markov model can’t do it except at shot range.
While, as remarked by Yoav Goldberg, it is impressive that the RNN could learn to do this, learning to match brackets and ident blocks seems very far from learning to write correct and purposeful code.
Anyway, this code generation example is pretty much of a stunt, not a very interesting task. If you gave the Linux kernel source code to a human who has never programmed and doesn’t speak English and asked them to write something that looks like it, I doubt that they would be able to do much better.
Better examples of code generation using NNs (actually, log-bilinear models) or Bayesian models exist (ref, ref). In these works syntactic correctness is already guaranteed and the ML model only focuses on semantics.
The difference with Markov models is they tend to overfit at that level. At 20 characters deep, you are just copy and pasting large sections of existing code and language. Not generating entirely unseen samples. You can do a similar thing with RNNs, by training them only on one document. They will be able to reproduce that document exactly, but nothing else.
To properly compare with a markov model, you’d need to first tune it so it doesn’t overfit. That is, when it’s looking at an entirely unseen document, it’s guess of what the next character should be is most likely to be correct. The best setting for that is probably only 3-5 characters, not 20. And when you generate from that, the output will be much less legible. (And even that’s kind of cheating, since markov models can’t give any prediction for sequences it’s never seen before.)
Generating samples is just a way to see what patterns the RNN has learned. And while it’s far from perfect, it’s still pretty impressive. It’s learned a lot about syntax, a lot about variable names, a lot about common programming idioms, and it’s even learned some English from just code comments.
In NLP applications where Markov language models are used, such as speech recognition and machine translation, the typical setting is 3 to 5 words. 20 characters correspond to about 4 English words, which is in this range.
Anyway, I agree that in this case the order-20 Markov model seems to overfit (Googling some lines from the snippets in the post often locates them in an original source file, which doesn’t happen as often with the RNN snippets). This may be due to the lack of regularization (“smoothing”) in the probability estimation and the relatively small size of the training corpus: 474 MB versus the >10 GB corpora which are typically used in NLP applications. Neural networks need lots of data, but still less than plain look-up tables.
This is a big deal, and it is another sign that AGI is near.
Intelligence boils down to inference. Go is an interesting case because good play for both humans and bots like AlphaGo requires two specialized types of inference operating over very different timescales:
rapid combinatoric inference over move sequences during a game(planning). AlphaGo uses MCT search for this, whereas the human brain uses a complex network of modules involving the basal ganglia, hippocampus, and PFC.
slow deep inference over a huge amount of experience to develop strong pattern recognition and intuitions (deep learning). AlphaGo uses deep supervised and reinforcement learning via SGD over a CNN for this. The human brain uses the cortex.
Machines have been strong in planning/search style inference for a while. It is only recently that the slower learning component (2nd order inference over circuit/program structure) is starting to approach and surpass human level.
Critics like to point out that DL requires tons of data, but so does the human brain. A more accurate comparison requires quantifying the dataset human pro go players train on.
A 30 year old asian pro will have perhaps 40,000 hours of playing experience (20 years 50 40 hrs/week). The average game duration is perhaps an hour and consists of 200 moves. In addition, pros (and even fans) study published games. Reading a game takes less time, perhaps as little as 5 minutes or so.
So we can estimate very roughly that a top pro will have absorbed between 100,000 games to 1 million games, and between 20 to 200 million individual positions (around 200 moves per game) .
AlphaGo was trained on the KGS dataset: 160,00 games and 29 million positions. So it did not train on significantly more data than a human pro. The data quantities are actually very similar.
Furthermore, the human’s dataset is perhaps of better quality for a pro, as they will be familiar with mainly pro level games, whereas the AlphaGo dataset is mostly amateur level.
The main difference is speed. The human brain’s ‘clockrate’ or equivalent is about 100 hz, whereas AlphaGo’s various CNNs can run at roughly 1000hz during training on a single machine, and perhaps 10,000 hz equivalent distributed across hundreds of machines. 40,000 hours—a lifetime of experience—can be compressed 100x or more into just a couple of weeks for a machine. This is the key lesson here.
The classification CNN trained on KGS was run for 340 million steps, which is about 10 iterations per unique position in the database.
The ANNs that AlphaGo uses are much much smaller than a human brain, but the brain has to do a huge number of other tasks, and also has to solve complex vision and motor problems just to play the game. AlphaGO’s ANNs get to focus purely on Go.
A few hundred TitanX’s can muster up perhaps a petaflop of compute. The high end estimate of the brain is 10 petaflops (100 trillion synapses 100 hz max firing rate). The more realistic estimate is 100 teraflops (100 trillion synapes 1 hz avg firing rate), and the lower end is 1⁄10 that or less.
So why is this a big deal? Because it suggests that training a DL AI to master more economically key tasks, such as becoming an expert level programmer, could be much closer than people think.
The techniques used here are nowhere near their optimal form yet in terms of efficiency. When Deep Blue beat Kasparov in 1996, it required a specialized supercomputer and a huge team. 10 years later chess bots written by individual programmers running on modest PC’s soared past Deep Blue—thanks to more efficient algorithms and implementations.
I asked a pro player I know whether these numbers sounded reasonable. He replied:
Both deep networks and the human brain require lots of data, but the kind of data they require is not the same. Humans engage mostly in semi-supervised learning, where supervised data comprises a small fraction of the total. They also manage feats of “one-shot learning” (making critically-important generalizations from single datapoints) that are simply not feasible for neural networks or indeed other ‘machine learning’ methods.
Could you elaborate? I think this number is too high by roughly one order of magnitude.
Estimating the computational capability of the human brain is very difficult. Among other things, we don’t know what the neuroglia cells may be up to, and these are just as numerous as neurons.
This is probably a misconception for several reasons. Firstly, given that we don’t fully understand the learning mechanisms in the brain yet, it’s unlikely that it’s mostly one thing. Secondly, we have some pretty good evidence for reinforcement learning in the cortex, hippocampus, and basal ganglia. We have evidence for internally supervised learning in the cerebellum, and unsupervised learning in the cortex.
The point being: these labels aren’t all that useful. Efficient learning is multi-objective and doesn’t cleanly divide into these narrow categories.
The best current guess for questions like this is almost always to guess that the brain’s solution is highly efficient, given it’s constraints.
In the situation where a go player experiences/watches a game between two other players far above one’s own current skill, the optimal learning update is probably going to be a SL style update. Even if you can’t understand the reasons behind the moves yet, it’s best to compress them into the cortex for later. If you can do a local search to understand why the move is good, then that is even better and it becomes more like RL, but again, these hard divisions are arbitrary and limiting.
The GTX TitanX has a peak perf of 6.1 terraflops, so you’d need only a few hundred to get a petaflop supercomputer (more specifically, around 175).
It’s just a circuit, and it obeys the same physical laws. We have this urge to mystify it for various reasons. Neuroglia can not possibly contribute more to the total compute power than the neurons, based on simple physics/energy arguments. It’s another stupid red herring like quantum woo.
These estimates are only validated when you can use them to make predictions. And if you have the right estimates (brain equivalent to 100 terraflops ish, give or take an order of magnitude), you can roughly predict the outcome of many comparisons between brain circuits vs equivalent ANN circuits (more accurately than using the wrong estimates).
We don’t understand the learning mechanisms yet, but we’re quite familiar with the data they use as input. “Internally” supervised learning is just another term for semi-supervised learning anyway. Semi-supervised learning is plenty flexible enough to encompass the “multi-objective” features of what occurs in the brain.
Raw and “peak performance” FLOPS numbers should be taken with a grain of salt. Anyway, given that a TitanX apparently draws as much as 240W of power at full load, your “petaflop-scale supercomputer” will cost you a few hundred-thousand dollars and draw 42kW to do what the brain does within 20W or so. Not a very sensible use for that amount of computing power—except for the odd publicity stunt, I suppose. Like playing Go.
Of course. Neuroglia are not magic or “woo”. They’re physical things, much like silicon chips and neurons.
Yeah, but in this case the best convolution and gemm codes can reach like 98% efficiency for the simple standard algorithms and dense input—which is what most ANNs use for about everything.
Well, in this case of Go and for an increasing number of domains, it can do far more than any brain—learns far faster. Also, the current implementations are very very far from optimal form. There is at least another 100x to 1000x easy perf improvement in the years ahead. So what 100 gpus can do now will be accomplished by a single GPU in just a year or two.
Right, and they use a small fraction of the energy budget, and thus can’t contribute much to the computational power.
This might actually be the most interesting thing about AlphaGo. Domain experts who have looked at its games have marveled most at how truly “book-smart” it is. Even though it has not shown a lot of creativity or surprising moves (indeed, it was comparatively weak at the start of Game 1), it has fully internalized its training and can always come up with the “standard” play.
Not necessarily—there might be a speed vs. energy-per-op tradeoff, where neurons specialize in quick but energy-intensive computation, while neuroglia just chug along in the background. We definitely see such a tradeoff in silicon devices.
Do you have links to such analyses? I’d be interested in reading them.
EDIT: Ah, I guess you were referring to this: https://www.reddit.com/r/MachineLearning/comments/43fl90/synopsis_of_top_go_professionals_analysis_of/
Yudkowsky seems to think it is significant …
https://news.ycombinator.com/item?id=10983539
“It is difficult to get a man to understand something, when his salary depends on his not understanding it.”
Doesn’t a similar criticism apply to ML researchers who claim not to fear AI? (i.e. it would be inconvenient for them if it became widely thought that ML research was dangerous).
Not really. I mean, yes, in principle. But in practice, EY relies on people having UFAI as a “live issue” to keep MIRI going. ML researchers are not worried about funding cuts due to UFAI fears. They are worried about Congress being dysfunctional, etc. My personal funding situation will be affected by any way this argument plays out not at all.
I wouldn’t worry.
If, despite lots of effort, we couldn’t create a program that could beat any human in go, wouldn’t this be evidence that we were far away from creating smarter-than-human AI?
Are you asking me if I know what the law of iterated expectations is? I do.
Do you never worry?
No, I just remember my AI history (TD gammon, etc.) The question you should be asking is: “is there any evidence that will result in EY ceasing to urgently ask for your money?”
Does it sway you at all that EY points at self-driving cars and says “these could be taken as a sign as well, but they’re not”?
I actually think self-driving cars are more interesting than strong go playing programs (but they don’t worry me much either).
I guess I am not sure why I should pay attention to EY’s opinion on this. I do ML-type stuff for a living. Does EY have an unusual track record for predicting anything? All I see is a long tail of vaguely silly things he says online that he later renounces (e.g. “ignore stuff EY_2004 said”). To be clear: moving away from bad opinions is great! That is not what the issue is.
edit: In general I think LW really really doesn’t listen to experts enough (I don’t even mean myself, I just mean the sensible Bayesian thing to do is to just go with expert opinion prior on almost everything.) EY et al. take great pains to try to move people away from that behavior, talking about how the world is mad, about civiliational inadequacy, etc. In other words, don’t trust experts, they are crazy anyways.
I’m not going to argue that you should pay attention to EY. His arguments convince me, but if they don’t convince you, I’m not gonna do any better.
What I’m trying to get at is, when you ask “is there any evidence that will result in EY ceasing to urgently ask for your money?”… I mean, I’m sure there is such evidence, but I don’t wish to speak for him. But it feels to me that by asking that question, you possibly also think of EY as the sort of person who says: “this is evidence that AI risk is near! And this is evidence that AI risk is near! Everything is evidence that AI risk is near!” And I’m pointing out that no, that’s not how he acts.
While we’re at it, this exchange between us seems relevant. (“Eliezer has said that security mindset is similar, but not identical, to the mindset needed for AI design.” “Well, what a relief!”) You seem surprised, and I’m not sure what about it was surprising to you, but I don’t think you should have been surprised.
Basically, even if you’re right that he’s wrong, I feel like you’re wrong about how he’s wrong. You seem to have a model of him which is very different from my model of him.
(Btw, his opinion seems to be that AlphaGo’s methods are what makes it more of a leap than a self-driving car or than Deep Blue, not the results. Not sure that affects your position.)
In particular he apparently mentioned Go play as an indicator before (and assumed as many other people that it were somewhat more distant) and now follows up on this threshold. What else would you expect? That he don’t name a limited number of relevant events (I assume that the number is limited; I didn’t know of this specific one before)?
I think you misunderstood me (but that’s my fault for being opaque, cadence is hard to convey in text). I was being sarcastic. In other words, I don’t need EY’s opinion, I can just look at the problem myself (as you guys say “argument screens authority.”)
Look, I met EY and chatted with him. I don’t think EY is “evil,” exactly, in a way that L. Ron Hubbard was. I think he mostly believes his line (but humans are great at self-deception). I think he’s a flawed person, like everyone else. It’s just that he has an enormous influence on the rationalist community that immensely magnify the damage his normal human flaws and biases can do.
I always said that the way to repair human frailty issues is to treat rationality as a job (rather than a social club), and fellow rationalists as coworkers (rather than tribe members). I also think MIRI should stop hitting people up for money and get a normal funding stream going. You know, let their ideas of how to avoid UFAI compete in the normal marketplace of ideas.
Currently MIRI gets their funding by 1) donations 2) grants. Isn’t that exactly what the normal funding stream for non-profits is?
Sure. Scientology probably has non-profits, too. I am not saying MIRI is anything like Scientology, merely that it isn’t enough to just determine legal status and call it a day, we have to look at the type of thing the non-profit is.
MIRI is a research group. They call themselves an institute, but they aren’t, really. Institutes are large. They are working on some neat theory stuff (from what Benja/EY explained to me) somewhat outside the mainstream. Which is great! They have some grant funding, actually, last I checked. Which is also great!
They are probably not yet financially secure to stop asking for money, which is also ok.
I think all I am saying is, in my view the success condition is they “achieve orbit” and stop asking, because basically what they are working on is considered sufficiently useful research that they can operate like a regular research group. If they never stop asking I think that’s a bit weird, because either their direction isn’t perceived good and they can’t get enough funding bandwidth without donations, or they do have enough bandwidth but want more revenue anyways, which I personally would find super weird and unsavory.
Who is? Last I checked, Harvard was still asking alums for donations, which suggests to me that asking is driven by getting money more than it’s driven by needing money.
I think comparing Harvard to a research group is a type error, though. Research groups don’t typically do this. I am not going to defend Unis shaking alums down for money, especially given what they do with it.
I know several research groups where the PI’s sole role is fundraising, despite them having much more funding than the average research group.
My point was more generic—it’s not obvious to me why you would expect groups to think “okay, we have enough resources, let’s stop trying to acquire more” instead of “okay, we have enough resources to take our ambitions to the next stage.” The American Cancer Society has about a billion dollar budget, and yet they aren’t saying “yeah, this is enough to deal with cancer, we don’t need your money.”
(It may be the case that a particular professor stops writing grant applications, because they’re limited by attention they can give to their graduate students. But it’s not like any of those professors will say “yeah, my field is big enough, we don’t need any more professor slots for my students to take.”)
In my experience, research groups exist inside universities or a few corporations like Google. The senior members are employed and paid for by the institution, and only the postgrads, postdocs, and equipment beyond basic infrastructure are funded by research grants. None of them fly “in orbit” by themselves but only as part of a larger entity. Where should an independent research group like MIRI seek permanent funding?
By “in orbit” I mean “funded by grants rather than charity.” If a group has a steady grant research stream, that means they are doing good enough work that funding agencies continue to give them money. This is the standard way to be self-sustaining for a research group.
What would worry you that strong AI is near?
This is a good question. I think lots of funding incentive to build integrated systems (like self-driving cars, but for other domains) and enough talent pipeline to start making that stuff happen and create incremental improvements. People in general underestimate the systems engineering aspect of getting artificial intelligent agents to work in practice even in fairly limited settings like car driving.
Go is a hard game, but it is a toy problem in a way that dealing with the real world isn’t. I am worried about economic incentives making it worth people’s while to keep throwing money and people and iterating on real actual systems that do intelligent things in the world. Even fairly limited things at first.
What do you mean by this exactly? That real world has combinatorics problems that are much wider, or that dealing with real world does not reduce well to search in a tree of possible actions?
I think getting this working took a lot of effort and insight, and I don’t mean to discount this effort or insight at all. I couldn’t do what these guys did. But what I mean by “toy problem” is it avoids a lot of stuff about the physical world, hardware, laws, economics, etc. that happen when you try to build real things like cars, robots, or helicopters.
In other words, I think it’s great people figured out the ideal rocket equation. But somehow it will take a lot of elbow grease (that Elon Musk et al are trying to provide) to make this stuff practical for people who are not enormous space agencies.
I don’t think that fair criticism on that point. As far as I understand MIRI did make the biggest survey of AI experts that asked when those experts predict AGI to arrive:
When EY says that this news shows that we should put a significant amount of our probability mass before 2050 that doesn’t contradict expert opinions.
Sure, but it’s not just about what experts say on a survey about human level AI. It’s also about what info a good Go program has for this question, and whether MIRI’s program makes any sense (and whether it should take people’s money). People here didn’t say “oh experts said X, I am updating,” they said “EY said X on facebook, time for me to change my opinion.”
My reaction was more “oh, EY made a good argument about why this is a big deal, so I’ll take that argument into account”.
Presumably a lot of others felt the same way; attributing the change in opinion to just a deference for tribal authority seems uncharitable.
Say I am worried about this tribal thing happening a lot—what would put my mind more at ease?
I don’t know your mind, you tell me? What exactly is it that you find worrying?
My possibly-incorrect guess is that you’re worried about something like “the community turning into an echo chamber that only promotes Eliezer’s views and makes its members totally ignore expert opinion when forming their views”. But if that was your worry, the presence of highly upvoted criticisms of Eliezer’s views should do a lot to help, since it shows that the community does still take into account (and even actively reward!) well-reasoned opinions that show dissent from the tribal leaders.
So since you still seem to be worried despite the presence of those comments, I’m assuming that your worry is something slightly different, but I’m not entirely sure of what.
One problem is that the community has few people actually engaged enough with cutting edge AI / machine learning / whatever-the-respectable-people-call-it-this-decade research to have opinions that are grounded in where the actual research is right now. So a lot of the discussion is going to consist of people either staying quiet or giving uninformed opinions to keep the conversation going. And what incentive structures there are here mostly work for a social club, so there aren’t really that many checks and balances that keep things from drifting further away from being grounded in actual reality instead of the local social reality.
Ilya actually is working with cutting edge machine learning, so I pay attention to his expressions of frustration and appreciate that he persists in hanging out here.
Agreed both with this being a real risk, and it being good that Ilya hangs out here.
Who do you think said that in this case?
Just to be clear about your position, what do you think are reasonable values for
human-level AI with 10% probability
/human-level AI with 50% probability
andhuman-level AI with 90% probability
?I think the question in this thread is about how much the deep learning Go program should move my beliefs about this, whatever they may be. My answer is “very little in a sooner direction” (just because it is a successful example of getting a complex thing working). The question wasn’t “what are your belief about how far human level AI is” (mine are centered fairly far out).
I think this debate is quite hard with terms vague terms like “very little” and “far out”. I really do think it would be helpful for other people trying to understand your position if you put down your numbers for those predictions.
The point is how much we should update our AI future timeline beliefs (and associated beliefs about whether it is appropriate to donate to MIRI and how much) based on the current news of DeepMind’s AlphaGo success.
There is a difference between “Gib moni plz because the experts say that there is a 10% probability of human-level AI within 2022” and “Gib moni plz because of AlphaGo”.
I understand IlyaShpitser to claim that there are people who update their AI future timeline beliefs in a way that isn’t appropriate because of EY statements. I don’t think that’s true.
I don’t have a source on this, but I remember an anecdote from Kurzweil that scientists who worked on early transistors were extremely skeptical about the future of the technology. They were so focused on solving specific technical problems that they didn’t see the big picture. Whereas an outside could have just looked at the general trend and predicted a doubling every 18 months, and they would have been accurate for at least 50 years.
So that’s why I wouldn’t trust various ML experts like Ng that have said not to worry about AGI. No, the specific algorithms they work on are not anywhere near human level. But the general trend, and the proof that humans aren’t really that special, is concerning.
I’m not saying that you should just trust Yudkowsky or me instead. And expert opinion still has value. But maybe pick an expert that is more “big picture” focused? Perhaps Jürgen Schmidhuber, who has done a lot of notable work on deep learning and ML, but also has an interest in general intelligence and self improving AIs.
And I don’t have any specific prediction from him on when we will reach AGI. But he did say last year that he believes we will reach monkey level intelligence in 10 years. Which is quite a huge milestone.
Another candidate might be the group being discussed in this thread, Deepmind. They are focused on reaching general AI instead of just typical machine vision work. That’s why they have such a strong interest in game playing. I don’t have any specific predictions from them either, but I do get the impression they are very optimistic.
I’m not buying this.
There are tons of cases where people look at the current trend and predict it will continue unabated into the future. Occasionally they turn out to be right, mostly they turn out to be wrong. In retrospect it’s easy to pick “winners”, but do you have any reason to believe it was more than a random stab in the dark which got lucky?
If you were trying to predict the future of flight in 1900, you’d do pretty terrible by surveying experts. You would do far better by taking a Kurzweil style approach where you put combustion engine performance on a chart and compared it to estimates of the power/weight ratios required for flight.
The point of that comment wasn’t to praise predicting with trends. It was to show an example where experts are sometimes overly pessimistic and not looking at the big picture.
When people say that current AI sucks, and progress is really hard, and they can’t imagine how it will scale to human level intelligence, I think it’s a similar thing. They are overly focused on current methods and their shortcomings and difficulties. They aren’t looking at the general trend that AI is rapidly making a lot of progress. Who knows what could be achieved in decades.
I’m not talking about specific extrapolations like Moore’s law, or even imagenet benchmarks—just the general sense of progress every year.
This claim doesn’t make much sense from the outset. Look at your specific example of transistors. In 1965, an electronics magazine wanted to figure out what would happen over time with electronics/transistors so they called up an expert, the director of research of Fairchild semiconductor. Gordon Moore (the director of research), proceeded to coin Moore’s law and tell them the doubling would continue for at least a decade, probably more. Moore wasn’t an outsider, he was an expert.
You then generalize from an incorrect anecdote.
I never said that every engineer at every point in time was pessimistic. Just that many of them were at one time. And I said it was a second hand anecdote, so take that for what it’s worth.
You have to be more specific with the timeline. Transistors were invented in 1925 but received little interests due to many technical problems. It took three decades of research before the first commercial transistors were produced by Texas Instruments in 1954.
Gordon Moore formulated his eponymous law in 1965, while he was director of R&D at Fairchild Semiconductor, a company whose entire business consisted in the manufacture of transistors and integrated circuits. By that time, tens of thousands transistor-based computers were in active commercial use.
It wouldn’t have made a lot of sense to predict any doublings for transistors in an integrated circuit before 1960, because I think that is when they were invented.
In what specific areas do you think LWers are making serious mistakes by ignoring or not accepting strong enough priors from experts?
As I said, the ideal is to use expert opinion as prior unless you have a lot of good info, or you think something is uniquely dysfunctional about an area (its rationalist folklore that a lot of areas are dysfunctional—“the world is mad”—but I think people are being silly about this). Experts really do know a lot.
You also need to figure out who are actual experts and what do they actually say. That’s a non-trivial task—reading reports on science in mainstream media will just stuff your head with nonsense.
It’s true, reading/scholarship is hard (even for scientists).
It’s actually much worse than that, because huge breakthroughs themselves are what create new experts. So on the eve of huge breakthroughs, currently recognized experts invariably predict the future is far, simply because they can’t see the novel path towards the solution.
In this sense everyone who is currently an AI expert is, trivially, someone who has failed to create AGI. The only experts who have any sort of clear understanding of how far AGI is are either not currently recognized or do not yet exist.
Btw, I don’t consider myself an AI expert. I am not sure what “AI expertise” entails, I guess knowing a lot about lots of things that include stuff like stats/ML but also other things, including a ton of engineering. I think an “AI expert” is sort of like “an airplane expert.” Airplanes are too big for one person—you might be an expert on modeling fluids or an expert on jet engines, but not an expert on airplanes.
AI, general singulatarianism, cryonics, life extension?
And the many-worlds interpretation of quantum mechanics. That is, all EY’s hobby horses. Though I don’t know how common these positions are among the unquiet spirits that haunt LessWrong.
My thoughts exactly.
Not if you like paper clips.
Were you genuinely asking, or...
I was asking, but Eliezer’s commentary convinced me to be worried.
I think it’s a giant leap for go and one small step for mankind.
(Well, I don’t know how giant a leap it is. But it’s a hell of an achievement.)
AI researcher Yoshua Bengio says machines won’t become dangerously smart anytime soon. Choice quote: