AlphaZero (just the policy network) I’m more confused about… I expect it still isn’t doing search, but it is literally trained to imitate the outcome of a search so it might have similar mis-generalization properties?
This suggests that the choice of decision theory that amplifies a decision making model (in the sense of IDA/HCH, or just the way MCTS is used in training AlphaZero) might influence robustness of its behavior far off-distribution, even if its behavior around the training distribution is not visibly sensitive to choice of decision theory used for amplification.
Though perhaps this sense of “robustness” is not very appropriate, and a better one should be explicitly based on reflection/extrapolation from behavior in familiar situations, with the expectation that all models fail to be robust sufficiently far off-distribution (in the crash space), and new models must always be prepared in advance of going there.
This suggests that the choice of decision theory that amplifies a decision making model (in the sense of IDA/HCH, or just the way MCTS is used in training AlphaZero) might influence robustness of its behavior far off-distribution, even if its behavior around the training distribution is not visibly sensitive to choice of decision theory used for amplification.
Though perhaps this sense of “robustness” is not very appropriate, and a better one should be explicitly based on reflection/extrapolation from behavior in familiar situations, with the expectation that all models fail to be robust sufficiently far off-distribution (in the crash space), and new models must always be prepared in advance of going there.