I agree. I don’t find this result to be any more or less indicative of near-term AI than Google’s success on ImageNet in 2012. The algorithm learns to map positions to moves and values using CNNs, just as CNNs can be used to learn mappings from images to 350 classes of dog breeds and more. It turns out that Go really is a game about pattern recognition and that with a lot of data you can replicate the pattern detection for good moves in very supervised ways (one could call their reinforcement learning actually supervised because the nature of the problem gives you credit assignment for free).
I think what this result says is thus: “Any tasks humans can do, an AI can now learn to do better, given a sufficient source of training data.”
Games lend themselves to auto-generation of training data, in the sense that the AI can at the very least play against itself. No matter how complex the game, a deep neural net will find the structure in it, and find a deeper structure than human players can find.
We have now answered the question of, “Are deep neural nets going to be sufficient to match or exceed task-specific human performance at any well-specified task?” with “Yes, they can, and they can do it better and faster than we suspected.” The next hurdle—which all the major companies are working on—is to create architectures that can find structure in smaller datasets, less well-tailored training data, and less well-specified tasks.
I included the word “sufficient” as an ass-covering move, because one facet of the problem is we don’t really know what will serve as a “sufficient” amount of training data in what context.
But, what specific types of tasks do you think machines still can’t do, given sufficient training data? If your answer is something like “physics research,” I would rejoinder that if you could generate training data for that job, a machine could do it.
Grand pronouncements with an ass-covering move look silly :-)
One obvious problem is that you are assuming stability. Consider modeling something that changes (in complex ways) with time—like the economy of the United States. Is “training data” from the 1950s relevant to the currrent situation?
Generally speaking, the speed at which your “training data” gets stale puts an upper limit on the relevant data that you can possibly have and that, in turn, puts an upper limit on the complexity of the model (NNs included) that you can build on its basis.
I don’t see how we anything like know that deep NNs with ‘sufficient training data’ would be sufficient for all problems. We’ve seen them be sufficient for many different problems and can expect them to be sufficient for many more, but all?
I think what this result says is thus: “Any tasks humans can do, an AI can now learn to do better, given a sufficient source of training data.”
Yes, but that would likely require an extremely large amount of training data because to prepare actions for many kind of situations you’d have an exponential blow up to cover many combinations of many possibilities, and hence the model would need to be huge as well. It also would require high-quality data sets with simple correction signals in order to work, which are expensive to produce.
I think, above all for building a real-time AI you need reuse of concepts so that abstractions can be recombined and adapted to new situations; and for concept-based predictions (reasoning) you need one-shot learning so that trains of thoughts can be memorized and built upon. In addition, the entire network needs to learn somehow to determine which parts of the network in the past were responsible for current reward signals which are delayed and noisy. If there is a simple and fast solutions to this, then AGI could be right around the corner. If not, it could take several decades of research.
In addition, the entire network needs to learn somehow to determine which parts of the network in the past were responsible for current reward signals which are delayed and noisy.
This is a well-known problem, called reinforcement learning. It is a significant component in the reported results. (What happens in practice is that a network’s ability to assign “credit” or “blame” for reward signals falls off exponentially with increasing delay. This is a significant limitation, but reinforcement learning is nevertheless very helpful given tight feedback loops.)
Yes, but as I wrote above, the problems of credit assignment, reward delay and noise are non-existent in this setting, and hence their work does not contribute at all to solving AI.
Reward delay is not very significant in this task, since the task is episodic and fully observable, and there is no time preference, thus you can just play a game to completion without updating and then assign the final reward to all the positions.
In more general reinforcement learning settings, where you want to update your policy during the execution, you have to use some kind of temporal difference learning method, which is further complicated if the world states are not fully observable.
Credit assignment is taken care of by backpropagation, as usual in neural networks. I don’t know why RaelwayScot brought it up, unless they meant something else.
I meant that for AI we will possibly require high-level credit assignment, e.g. experiences of regret like “I should be more careful in these kinds of situations”, or the realization that one particular strategy out of the entire sequence of moves worked out really nicely. Instead it penalizes/enforces all moves of one game equally, which is potentially a much slower learning process. It turns out playing Go can be solved without much structure for the credit assignment processes, hence I said the problem is non-existent, i.e. there wasn’t even need to consider it and further our understanding of RL techniques.
thus you can just play a game to completion without updating and then assign the final reward to all the positions.
Agreed, with the caveat that this is a stochastic object, and thus not a fully simple problem. (Even if I knew all possible branches of the game tree that originated in a particular state, I would need to know how likely any of those branches are to be realized in order to determine the current value of that state.)
Even if I knew all possible branches of the game tree that originated in a particular state, I would need to know how likely any of those branches are to be realized in order to determine the current value of that state.
Well, the value of a state is defined assuming that the optimal policy is used for all the following actions. For tabular RL you can actually prove that the updates converge to the optimal value function/policy function (under some conditions). If NN are used you don’t have any convergence guarantees, but in practice the people at DeepMind are able to make it work, and this particular scenario (perfect observability, determinism and short episodes) is simpler than, for instance that of the Atari DQN agent.
“Nonexistent problems” was meant as a hyperbole to say that they weren’t solved in interesting ways and are extremely simple in this setting because the states and rewards are noise-free. I am not sure what you mean by the second question. They just apply gradient descent on the entire history of moves of the current game such that expected reward is maximized.
It seems to me that the problem of value assignment to boards—”What’s the edge for W or B if the game state looks like this?” is basically a solution to that problem, since it gives you the counterfactual information you need (how much would placing a stone here improve my edge?) to answer those questions.
I agree that it’s a much simpler problem here than it is in a more complicated world, but I don’t think it’s trivial.
I agree. I don’t find this result to be any more or less indicative of near-term AI than Google’s success on ImageNet in 2012. The algorithm learns to map positions to moves and values using CNNs, just as CNNs can be used to learn mappings from images to 350 classes of dog breeds and more. It turns out that Go really is a game about pattern recognition and that with a lot of data you can replicate the pattern detection for good moves in very supervised ways (one could call their reinforcement learning actually supervised because the nature of the problem gives you credit assignment for free).
I think what this result says is thus: “Any tasks humans can do, an AI can now learn to do better, given a sufficient source of training data.”
Games lend themselves to auto-generation of training data, in the sense that the AI can at the very least play against itself. No matter how complex the game, a deep neural net will find the structure in it, and find a deeper structure than human players can find.
We have now answered the question of, “Are deep neural nets going to be sufficient to match or exceed task-specific human performance at any well-specified task?” with “Yes, they can, and they can do it better and faster than we suspected.” The next hurdle—which all the major companies are working on—is to create architectures that can find structure in smaller datasets, less well-tailored training data, and less well-specified tasks.
I don’t think it says anything like that.
I included the word “sufficient” as an ass-covering move, because one facet of the problem is we don’t really know what will serve as a “sufficient” amount of training data in what context.
But, what specific types of tasks do you think machines still can’t do, given sufficient training data? If your answer is something like “physics research,” I would rejoinder that if you could generate training data for that job, a machine could do it.
Grand pronouncements with an ass-covering move look silly :-)
One obvious problem is that you are assuming stability. Consider modeling something that changes (in complex ways) with time—like the economy of the United States. Is “training data” from the 1950s relevant to the currrent situation?
Generally speaking, the speed at which your “training data” gets stale puts an upper limit on the relevant data that you can possibly have and that, in turn, puts an upper limit on the complexity of the model (NNs included) that you can build on its basis.
I don’t see how we anything like know that deep NNs with ‘sufficient training data’ would be sufficient for all problems. We’ve seen them be sufficient for many different problems and can expect them to be sufficient for many more, but all?
Yes, but that would likely require an extremely large amount of training data because to prepare actions for many kind of situations you’d have an exponential blow up to cover many combinations of many possibilities, and hence the model would need to be huge as well. It also would require high-quality data sets with simple correction signals in order to work, which are expensive to produce.
I think, above all for building a real-time AI you need reuse of concepts so that abstractions can be recombined and adapted to new situations; and for concept-based predictions (reasoning) you need one-shot learning so that trains of thoughts can be memorized and built upon. In addition, the entire network needs to learn somehow to determine which parts of the network in the past were responsible for current reward signals which are delayed and noisy. If there is a simple and fast solutions to this, then AGI could be right around the corner. If not, it could take several decades of research.
This is a well-known problem, called reinforcement learning. It is a significant component in the reported results. (What happens in practice is that a network’s ability to assign “credit” or “blame” for reward signals falls off exponentially with increasing delay. This is a significant limitation, but reinforcement learning is nevertheless very helpful given tight feedback loops.)
Yes, but as I wrote above, the problems of credit assignment, reward delay and noise are non-existent in this setting, and hence their work does not contribute at all to solving AI.
Credit assignment and reward delay are nonexistent? What do you think happens when one diffs the board strength of two potential boards?
Reward delay is not very significant in this task, since the task is episodic and fully observable, and there is no time preference, thus you can just play a game to completion without updating and then assign the final reward to all the positions.
In more general reinforcement learning settings, where you want to update your policy during the execution, you have to use some kind of temporal difference learning method, which is further complicated if the world states are not fully observable.
Credit assignment is taken care of by backpropagation, as usual in neural networks. I don’t know why RaelwayScot brought it up, unless they meant something else.
I meant that for AI we will possibly require high-level credit assignment, e.g. experiences of regret like “I should be more careful in these kinds of situations”, or the realization that one particular strategy out of the entire sequence of moves worked out really nicely. Instead it penalizes/enforces all moves of one game equally, which is potentially a much slower learning process. It turns out playing Go can be solved without much structure for the credit assignment processes, hence I said the problem is non-existent, i.e. there wasn’t even need to consider it and further our understanding of RL techniques.
Agreed, with the caveat that this is a stochastic object, and thus not a fully simple problem. (Even if I knew all possible branches of the game tree that originated in a particular state, I would need to know how likely any of those branches are to be realized in order to determine the current value of that state.)
Well, the value of a state is defined assuming that the optimal policy is used for all the following actions. For tabular RL you can actually prove that the updates converge to the optimal value function/policy function (under some conditions). If NN are used you don’t have any convergence guarantees, but in practice the people at DeepMind are able to make it work, and this particular scenario (perfect observability, determinism and short episodes) is simpler than, for instance that of the Atari DQN agent.
“Nonexistent problems” was meant as a hyperbole to say that they weren’t solved in interesting ways and are extremely simple in this setting because the states and rewards are noise-free. I am not sure what you mean by the second question. They just apply gradient descent on the entire history of moves of the current game such that expected reward is maximized.
It seems to me that the problem of value assignment to boards—”What’s the edge for W or B if the game state looks like this?” is basically a solution to that problem, since it gives you the counterfactual information you need (how much would placing a stone here improve my edge?) to answer those questions.
I agree that it’s a much simpler problem here than it is in a more complicated world, but I don’t think it’s trivial.