Consider a Bayesian learner, that updates the weights of various hypotheses using Bayes Rule. If the hypotheses can influence future events and predictions (for example, maybe it can write out logs, which influence what questions are asked in the future), then hypotheses that affect the future in a way that only they can predict will be selected for by Bayes Rule, rather than hypotheses that straightforwardly predict the future without trying to influence it. In some sense, this is “myopic” behavior on the part of Bayesian updating: Bayes Rule only optimizes per-hypothesis, without taking into account the effect on overall future accuracy. This phenomenon could also apply to neural nets if the <@lottery ticket hypothesis@>(@The Lottery Ticket Hypothesis: Training Pruned Neural Networks@) holds: in this case each “ticket” can be thought of as a competing hypothesis.
Planned summary for the Alignment Newsletter: