Finally, RL practitioners have known that genuine causal reasoning could never be achieved via known RL architectures- you’d only ever get something that could execute the same policy as an agent that had reasoned that way, via a very expensive process of evolving away from dominated strategies at each step down the tree of move and countermove. It’s the biggest known unknown on the way to AGI.
What’s the argument here? Do you think that the AGZ policy (which is extremely good at Go or Chess even without any tree search) doesn’t do any causal reasoning? That it only ever learns to play parts of the game tree it’s seen during training? What does “genuine causal reasoning” even mean?
It looks to me like causal reasoning is just another type of computation, and that you could eventually find that computation by local search. If you need to use RL to guide that search then it’s going to take a long time—AlphaStar was very expensive, and still only trained a policy with ~80M parameters.
From my perspective it seems like the big questions are just how large a policy you would need to train using existing methods in order to be competitive with a human (my best guess would be a ~trillion to a ~quadrillion), and whether you can train it by copying rather than needing to use RL.
To copy myself in another thread, AlphaZero did some (pruned) game tree exploration in a hardcoded way that allowed the NN to focus on the evaluation of how good a given position was; this allowed it to kind of be a “best of both worlds” between previous algorithms like Stockfish and a pure deep reinforcement learner.
Re: your middle paragraph, I agree that you’re correct about an RL agent doing metalearning, though we’re also agreed that with current architectures it would take a prohibitive amount of computation to get anything like a competent general causal reasoner that way.
I’m not going to go up against your intuitions on imitation learning etc; I’m just surprised if you don’t expect there’s a necessary architectural advance needed to make anything like general causal reasoning emerge in practice from some combination of imitation learning and RL.
I meant to ask about the policy network in AlphaZero directly. It plays at the professional level (the Nature paper puts it at a comparable Elo to Fan Hui) with no tree search, using a standard neural network architecture trained by supervised learning. It performs fine on parts of the search tree that never appeared during training. What distinguishes this kind of reasoning from “if I see X, I do Y”?
(ETA clarification, because I think this was probably the misunderstanding: the policy network plays Go with no tree search, tree search is only used to generate training data. That suggests the AlphaStar algorithm would produce similar behavior without using tree search ever, probably using at most 100x the compute of AlphaZero and I’d be willing to bet on <10x.)
From the outside, it looks like human-level play at Starcraft is more complicated (in a sense) than human-level play at Go, and so it’s going to take bigger models in order to reach a similar level of performance. I don’t see a plausible-looking distinction-in-principle that separates the strategy in Starcraft from strategy in Go.
IIUC the distinction being made is about the training data, granted the assumption that you may be able to distill tree-search-like abilities into a standard NN with supervised learning if you have samples from tree search available as supervision targets in the first place.
AGZ was hooked up to a tree search in its training procedure, so its training signal allowed it to learn not just from the game trees it “really experienced” during self-play episodes but also (in a less direct way) from the much larger pool of game trees it “imagined” while searching for its next move during those same episodes. The former is always (definitionally) available in self-play, but the latter is only available if tree search is feasible.
But to be clear, (i) it would then also be learned by imitating a large enough dataset from human players who did something like tree search internally while playing, (ii) I think the tree search makes a quantitative not qualitative change, and it’s not that big (mostly improves stability, and *maybe* a 10x speedup, over self-play).
I don’t see how (i) follows? The advantage of (internal) tree search during training is precisely that it constrains you to respond sensibly to situations that are normally very rare (but are easily analyzable once they come up), e.g. “cheap win” strategies that are easily defeated by serious players and hence never come up in serious play.
AGZ is only trained on the situations that actually arise in games it plays.
I agree with the point that “imitation learning from human games” will only make you play well on kinds of situations that arise in human games, and that self-play can do better by making you play well on a broader set of situations. You could also train on all the situations that arise in a bigger tree search (though AGZ did not) or against somewhat-random moves (which AGZ probably did).
(Though I don’t see this as affecting the basic point.)
Just a layman here, and not sure if this is what this particular disagreement is about, but one impression I’ve gotten from AlphaGoZero and GPT2 is that while there are definitely more architectural advances to made, they may be more of the sort “make better use of computation, generally” than anything feels particularly specific to the strategy/decision-making problems in particular. (And I get the impression that at least some people saying that there are further breakthroughs needed are thinking of something ‘more specific to general intelligence’)
Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games—the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled—our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
the big questions are just how large a policy you would need to train using existing methods in order to be competitive with a human (my best guess would be a ~trillion to a ~quadrillion)
What’s the argument here? Do you think that the AGZ policy (which is extremely good at Go or Chess even without any tree search) doesn’t do any causal reasoning? That it only ever learns to play parts of the game tree it’s seen during training? What does “genuine causal reasoning” even mean?
It looks to me like causal reasoning is just another type of computation, and that you could eventually find that computation by local search. If you need to use RL to guide that search then it’s going to take a long time—AlphaStar was very expensive, and still only trained a policy with ~80M parameters.
From my perspective it seems like the big questions are just how large a policy you would need to train using existing methods in order to be competitive with a human (my best guess would be a ~trillion to a ~quadrillion), and whether you can train it by copying rather than needing to use RL.
To copy myself in another thread, AlphaZero did some (pruned) game tree exploration in a hardcoded way that allowed the NN to focus on the evaluation of how good a given position was; this allowed it to kind of be a “best of both worlds” between previous algorithms like Stockfish and a pure deep reinforcement learner.
Re: your middle paragraph, I agree that you’re correct about an RL agent doing metalearning, though we’re also agreed that with current architectures it would take a prohibitive amount of computation to get anything like a competent general causal reasoner that way.
I’m not going to go up against your intuitions on imitation learning etc; I’m just surprised if you don’t expect there’s a necessary architectural advance needed to make anything like general causal reasoning emerge in practice from some combination of imitation learning and RL.
I meant to ask about the policy network in AlphaZero directly. It plays at the professional level (the Nature paper puts it at a comparable Elo to Fan Hui) with no tree search, using a standard neural network architecture trained by supervised learning. It performs fine on parts of the search tree that never appeared during training. What distinguishes this kind of reasoning from “if I see X, I do Y”?
(ETA clarification, because I think this was probably the misunderstanding: the policy network plays Go with no tree search, tree search is only used to generate training data. That suggests the AlphaStar algorithm would produce similar behavior without using tree search ever, probably using at most 100x the compute of AlphaZero and I’d be willing to bet on <10x.)
From the outside, it looks like human-level play at Starcraft is more complicated (in a sense) than human-level play at Go, and so it’s going to take bigger models in order to reach a similar level of performance. I don’t see a plausible-looking distinction-in-principle that separates the strategy in Starcraft from strategy in Go.
IIUC the distinction being made is about the training data, granted the assumption that you may be able to distill tree-search-like abilities into a standard NN with supervised learning if you have samples from tree search available as supervision targets in the first place.
AGZ was hooked up to a tree search in its training procedure, so its training signal allowed it to learn not just from the game trees it “really experienced” during self-play episodes but also (in a less direct way) from the much larger pool of game trees it “imagined” while searching for its next move during those same episodes. The former is always (definitionally) available in self-play, but the latter is only available if tree search is feasible.
But to be clear, (i) it would then also be learned by imitating a large enough dataset from human players who did something like tree search internally while playing, (ii) I think the tree search makes a quantitative not qualitative change, and it’s not that big (mostly improves stability, and *maybe* a 10x speedup, over self-play).
I don’t see how (i) follows? The advantage of (internal) tree search during training is precisely that it constrains you to respond sensibly to situations that are normally very rare (but are easily analyzable once they come up), e.g. “cheap win” strategies that are easily defeated by serious players and hence never come up in serious play.
AGZ is only trained on the situations that actually arise in games it plays.
I agree with the point that “imitation learning from human games” will only make you play well on kinds of situations that arise in human games, and that self-play can do better by making you play well on a broader set of situations. You could also train on all the situations that arise in a bigger tree search (though AGZ did not) or against somewhat-random moves (which AGZ probably did).
(Though I don’t see this as affecting the basic point.)
Why just a 10x speedup over model free RL? I would’ve expected much more.
Just a layman here, and not sure if this is what this particular disagreement is about, but one impression I’ve gotten from AlphaGoZero and GPT2 is that while there are definitely more architectural advances to made, they may be more of the sort “make better use of computation, generally” than anything feels particularly specific to the strategy/decision-making problems in particular. (And I get the impression that at least some people saying that there are further breakthroughs needed are thinking of something ‘more specific to general intelligence’)
New paper relevant to this discussion: https://arxiv.org/abs/1911.08265
Curious where this estimate comes from?