Alpha Go seems to play really bad moves when it is loosing—this makes some sense as humans also make overplays out of desperation, but it suggests that Alpha Go would be bad at handicap games, unless they change the algorithum to maximise score instead of win probability.
Nothing “bad” about desperate overplays while losing from Alpha Go’s perspective. In the same way that it doesn’t care about winning by more than a half point, it doesn’t mind making its loss more crushing. Invade every territory. If it doesn’t work, you lose by a bit more. Boo hoo. If it works, you might winl
I’m very interested in the fact that they coded a “resign” function into it. I wouldn’t have expected that.
Has anyone from Google commented much on AlphaGo’s mistakes here? Why it made the mistake at 79, why it didn’t notice until later that it was suddenly losing, and why it started playing so badly when it did notice.
(I’ve seen commentary from people who’ve played other monte-carlo based bots, but I’m curious whether Google has confirmed them.)
I don’t think I’ve seen anyone say this explicitly: I would guess that part of the problem was AG hasn’t had much training in “mistakes humans are likely to make”. With good play, it could have recovered against Lee, but not against itself, and it didn’t know it was playing Lee; somehow, the moves it actually played were ones that would have increased its chances of winning if it was playing itself.
I think the DeepMind folks said that they have to get back to London to analyse the case in detail.
somehow, the moves it actually played were ones that would have increased its chances of winning if it was playing itself.
I don’t think that’s a good explanation. There’s no way that removing it’s own ko threats with moves like P14 and O11 would have increased it’s chances if it would have played against itself.
It look’s a bit like belief propagation to update after missing an important move doesn’t really work.
I think its policy net was only trained on amateurs, not professionals or self-play, making it a little weak. Normally, I suppose that reading large numbers of game trees compensates, but the odds of Lee making his brilliant move 78 (and one other move, but I can’t remember which) were 1/10000, so I think that AG never even analysed the first move of that sequence.
In other words:
David Ormerod of GoGameGuru stated that although an analysis of AlphaGo’s play around 79–87 was not yet available, he believed it was a result of a known weakness in play algorithms which use Monte Carlo tree search. In essence, the search attempts to prune sequences which are less relevant. In some cases a play can lead to a very specific line of play which is significant, but which is overlooked when the tree is pruned, and this outcome is therefore “off the search radar”.[56]
I wonder if Google could publish a sgf showing the most probable lines of play as calculated at each move, as well as the estimated probability of each of Lee’s moves?
I wonder if the best thing to do would be to train nets on: strong amateur games (lots of games, but perhaps lower quality moves?); pro games (fewer games but higher quality?); and self-play (high quality, but perhaps not entirely human-like?) and then take the average of the three nets?
Of course, this triples the GPU cycles needed, but it could perhaps be implemented just for the first few moves in the game tree?
Naively, pruning seems like it would cause a mistake at 77 (allowing the brilliant followup 78), not at 79 (when you can’t accidentally prune 78 because it’s already on the board). But people have been saying that it made a mistake at 79.
I don’t recall much detail about AG, but I thought the training it did was to improve the policy net? If the policy net was only trained on amateurs, what was it learning from self-play?
not at 79 (when you can’t accidentally prune 78 because it’s already on the board
Of course, but I can’t remember which was the other very low-probability move, so perhaps it was one of the later moves in that sequence?
I don’t recall much detail about AG, but I thought the training it did was to improve the policy net? If the policy net was only trained on amateurs, what was it learning from self-play?
I thought the self-play only trained the value net (because they want it to predict human moves, not its own moves), but I might be remembering incorrectly. Pity that the paper is behind a paywall.
Go champion Lee Se-dol strikes back to beat Google’s DeepMind AI for first time in forth game 3:1 http://www.theverge.com/2016/3/13/11184328/alphago-deepmind-go-match-4-result
Alpha Go seems to play really bad moves when it is loosing—this makes some sense as humans also make overplays out of desperation, but it suggests that Alpha Go would be bad at handicap games, unless they change the algorithum to maximise score instead of win probability.
Nothing “bad” about desperate overplays while losing from Alpha Go’s perspective. In the same way that it doesn’t care about winning by more than a half point, it doesn’t mind making its loss more crushing. Invade every territory. If it doesn’t work, you lose by a bit more. Boo hoo. If it works, you might winl
I’m very interested in the fact that they coded a “resign” function into it. I wouldn’t have expected that.
That’s not what happened. T9 wasn’t a desperate overplay. It was just bad. J10 might have made more sense as desperate overplay.
Has anyone from Google commented much on AlphaGo’s mistakes here? Why it made the mistake at 79, why it didn’t notice until later that it was suddenly losing, and why it started playing so badly when it did notice.
(I’ve seen commentary from people who’ve played other monte-carlo based bots, but I’m curious whether Google has confirmed them.)
I don’t think I’ve seen anyone say this explicitly: I would guess that part of the problem was AG hasn’t had much training in “mistakes humans are likely to make”. With good play, it could have recovered against Lee, but not against itself, and it didn’t know it was playing Lee; somehow, the moves it actually played were ones that would have increased its chances of winning if it was playing itself.
I think the DeepMind folks said that they have to get back to London to analyse the case in detail.
I don’t think that’s a good explanation. There’s no way that removing it’s own ko threats with moves like P14 and O11 would have increased it’s chances if it would have played against itself.
It look’s a bit like belief propagation to update after missing an important move doesn’t really work.
I think its policy net was only trained on amateurs, not professionals or self-play, making it a little weak. Normally, I suppose that reading large numbers of game trees compensates, but the odds of Lee making his brilliant move 78 (and one other move, but I can’t remember which) were 1/10000, so I think that AG never even analysed the first move of that sequence.
In other words:
I wonder if Google could publish a sgf showing the most probable lines of play as calculated at each move, as well as the estimated probability of each of Lee’s moves?
I wonder if the best thing to do would be to train nets on: strong amateur games (lots of games, but perhaps lower quality moves?); pro games (fewer games but higher quality?); and self-play (high quality, but perhaps not entirely human-like?) and then take the average of the three nets?
Of course, this triples the GPU cycles needed, but it could perhaps be implemented just for the first few moves in the game tree?
I don’t think the issue is that 78 was a human like move. It’s just a move that’s hard to see both for humans and non-humans.
Naively, pruning seems like it would cause a mistake at 77 (allowing the brilliant followup 78), not at 79 (when you can’t accidentally prune 78 because it’s already on the board). But people have been saying that it made a mistake at 79.
I don’t recall much detail about AG, but I thought the training it did was to improve the policy net? If the policy net was only trained on amateurs, what was it learning from self-play?
Of course, but I can’t remember which was the other very low-probability move, so perhaps it was one of the later moves in that sequence?
I thought the self-play only trained the value net (because they want it to predict human moves, not its own moves), but I might be remembering incorrectly. Pity that the paper is behind a paywall.