It is also interesting to know the size of Alphago.
Wiki says: “The distributed version in October 2015 was using 1,202 CPUs and 176 GPUs (and was developed by teem of 100 scientists). Assuming that it was best GPU on the market in 2015, with power around 1 teraflop, total power of AlphaGO was around 200 teraplop or more. (I would give it 100 Teraflop − 1 Petaflop with 75 probability estimate). I also think that the size of the program is around terabytes, but only conclude it from the number of computers in use.
This could provide us with minimal size of AI on current level of technologies. Fooming for such AI will be not easy as it would require sizeable new resources and rewriting of it complicated inner structure.
And it is also not computer virus size yet, so it can’t run away. A private researcher probably don’t have such computational resources, but hacker could use botnet
But if such AI will be used to create more effective master algorithms, it may foom.
I also think that the size of the program is around terabytes, but only conclude it from the number of computers in use.
I don’t think that’s true. The distributed system for playing is using multiple copies of the CNN value network so each one can do board evaluation during the MCTS on its own without the performance disaster of sending it over the network to a GPU or something crazy like that, not a single one sharded over two hundred servers (CPU!=computer). Similarly for training: each was training the same network in parallel, not 1/200th of the full NN. (You could train something like AlphaGo on your laptop’s GPU, it’d just take like 2 years by their wallclock numbers.)
The actual CNN is going to be something like 10MB-1GB, because more than that and you can’t fit it on 1 GPU to do training. Reading the paper, it seems to be fairly comparable in size to ImageNet competitors:
Neural network architecture. The input to the policy network is a 19×19×48 image stack consisting of 48 feature planes. The first hidden layer zero pads the input into a 23×23 image, then convolves k filters of kernel size 5×5 with stride 1 with the input image and applies a rectifier nonlinearity. Each of the subsequent hidden layers 2 to 12 zero pads the respective previous hidden layer into a 21×21 image, then convolves k filters of kernel size 3×3 with stride 1, again followed by a rectifier nonlinearity. The final layer convolves 1 filter of kernel size 1×1 with stride 1, with a different bias for each position, and applies a softmax function. The match version of AlphaGo used k=192 filters; Fig. 2b and Extended Data Table 3 additionally show the results of training with k=128, 256 and 384 filters.The input to the value network is also a 19×19×48 image stack, with an additional binary feature plane describing the current colour to play. Hidden layers 2 to 11 are identical to the policy network, hidden layer 12 is an additional convolution layer, hidden layer 13 convolves 1 filter of kernel size 1×1 with stride 1, and hidden layer 14 is a fully connected linear layer with 256 rectifier units. The output layer is a fully connected linear layer with a single tanh unit.
So 500M would be a reasonable guess if you don’t want to work out how many parameters that 13-layer network translates to. Not large at all, and model compression would at least halve that.
This could provide us with minimal size of AI on current level of technologies. Fooming for such AI will be not easy as it would require sizeable new resources and rewriting of it complicated inner structure.
200 GPUs is not that expensive. Amazon will rent you 1 GPU at spot for ~$0.2/hour, so <$1k/day.
Demis said that AlphaGo also works on a single computer. The distributed version has 75% winning chance against the one computer version.
The hardware they used seem to be the point where there are dimishing return of adding additional hardware.
It is also interesting to know the size of Alphago.
Wiki says: “The distributed version in October 2015 was using 1,202 CPUs and 176 GPUs (and was developed by teem of 100 scientists). Assuming that it was best GPU on the market in 2015, with power around 1 teraflop, total power of AlphaGO was around 200 teraplop or more. (I would give it 100 Teraflop − 1 Petaflop with 75 probability estimate). I also think that the size of the program is around terabytes, but only conclude it from the number of computers in use.
This could provide us with minimal size of AI on current level of technologies. Fooming for such AI will be not easy as it would require sizeable new resources and rewriting of it complicated inner structure.
And it is also not computer virus size yet, so it can’t run away. A private researcher probably don’t have such computational resources, but hacker could use botnet
But if such AI will be used to create more effective master algorithms, it may foom.
I don’t think that’s true. The distributed system for playing is using multiple copies of the CNN value network so each one can do board evaluation during the MCTS on its own without the performance disaster of sending it over the network to a GPU or something crazy like that, not a single one sharded over two hundred servers (CPU!=computer). Similarly for training: each was training the same network in parallel, not 1/200th of the full NN. (You could train something like AlphaGo on your laptop’s GPU, it’d just take like 2 years by their wallclock numbers.)
The actual CNN is going to be something like 10MB-1GB, because more than that and you can’t fit it on 1 GPU to do training. Reading the paper, it seems to be fairly comparable in size to ImageNet competitors:
So 500M would be a reasonable guess if you don’t want to work out how many parameters that 13-layer network translates to. Not large at all, and model compression would at least halve that.
200 GPUs is not that expensive. Amazon will rent you 1 GPU at spot for ~$0.2/hour, so <$1k/day.
Thanks for clarification. If size is rally 500 MB it could easily be stolen or run away, and in 1к a day seems affordable to dedicating hacker.
Demis said that AlphaGo also works on a single computer. The distributed version has 75% winning chance against the one computer version. The hardware they used seem to be the point where there are dimishing return of adding additional hardware.