I also think that the size of the program is around terabytes, but only conclude it from the number of computers in use.
I don’t think that’s true. The distributed system for playing is using multiple copies of the CNN value network so each one can do board evaluation during the MCTS on its own without the performance disaster of sending it over the network to a GPU or something crazy like that, not a single one sharded over two hundred servers (CPU!=computer). Similarly for training: each was training the same network in parallel, not 1/200th of the full NN. (You could train something like AlphaGo on your laptop’s GPU, it’d just take like 2 years by their wallclock numbers.)
The actual CNN is going to be something like 10MB-1GB, because more than that and you can’t fit it on 1 GPU to do training. Reading the paper, it seems to be fairly comparable in size to ImageNet competitors:
Neural network architecture. The input to the policy network is a 19×19×48 image stack consisting of 48 feature planes. The first hidden layer zero pads the input into a 23×23 image, then convolves k filters of kernel size 5×5 with stride 1 with the input image and applies a rectifier nonlinearity. Each of the subsequent hidden layers 2 to 12 zero pads the respective previous hidden layer into a 21×21 image, then convolves k filters of kernel size 3×3 with stride 1, again followed by a rectifier nonlinearity. The final layer convolves 1 filter of kernel size 1×1 with stride 1, with a different bias for each position, and applies a softmax function. The match version of AlphaGo used k=192 filters; Fig. 2b and Extended Data Table 3 additionally show the results of training with k=128, 256 and 384 filters.The input to the value network is also a 19×19×48 image stack, with an additional binary feature plane describing the current colour to play. Hidden layers 2 to 11 are identical to the policy network, hidden layer 12 is an additional convolution layer, hidden layer 13 convolves 1 filter of kernel size 1×1 with stride 1, and hidden layer 14 is a fully connected linear layer with 256 rectifier units. The output layer is a fully connected linear layer with a single tanh unit.
So 500M would be a reasonable guess if you don’t want to work out how many parameters that 13-layer network translates to. Not large at all, and model compression would at least halve that.
This could provide us with minimal size of AI on current level of technologies. Fooming for such AI will be not easy as it would require sizeable new resources and rewriting of it complicated inner structure.
200 GPUs is not that expensive. Amazon will rent you 1 GPU at spot for ~$0.2/hour, so <$1k/day.
I don’t think that’s true. The distributed system for playing is using multiple copies of the CNN value network so each one can do board evaluation during the MCTS on its own without the performance disaster of sending it over the network to a GPU or something crazy like that, not a single one sharded over two hundred servers (CPU!=computer). Similarly for training: each was training the same network in parallel, not 1/200th of the full NN. (You could train something like AlphaGo on your laptop’s GPU, it’d just take like 2 years by their wallclock numbers.)
The actual CNN is going to be something like 10MB-1GB, because more than that and you can’t fit it on 1 GPU to do training. Reading the paper, it seems to be fairly comparable in size to ImageNet competitors:
So 500M would be a reasonable guess if you don’t want to work out how many parameters that 13-layer network translates to. Not large at all, and model compression would at least halve that.
200 GPUs is not that expensive. Amazon will rent you 1 GPU at spot for ~$0.2/hour, so <$1k/day.
Thanks for clarification. If size is rally 500 MB it could easily be stolen or run away, and in 1к a day seems affordable to dedicating hacker.