I propose a counterexample. Suppose we are playing a series of games with another agent. To play effectively, we train a circuit to predict the opponent’s moves. At this point the circuit already contains an adversarial agent. However, one could object that it’s unfair: we asked for an adversarial agent so we got an adversarial agent (nevertheless for AI alignment it’s still a problem). To remove the objection, let’s make some further assumptions. The training is done on some set of games, but distributional shift happens and later games are different. The opponent knows this, so on the training games it simulates a different agent. Specifically, it simulates an agent who searches for a strategy s.t. the best response to this strategy has the strongest counter-response. The minimal circuit hence contains the same agent. On the training data we win, but on the shifted distribution the daemon deceives us and we lose.
I propose a counterexample. Suppose we are playing a series of games with another agent. To play effectively, we train a circuit to predict the opponent’s moves. At this point the circuit already contains an adversarial agent. However, one could object that it’s unfair: we asked for an adversarial agent so we got an adversarial agent (nevertheless for AI alignment it’s still a problem). To remove the objection, let’s make some further assumptions. The training is done on some set of games, but distributional shift happens and later games are different. The opponent knows this, so on the training games it simulates a different agent. Specifically, it simulates an agent who searches for a strategy s.t. the best response to this strategy has the strongest counter-response. The minimal circuit hence contains the same agent. On the training data we win, but on the shifted distribution the daemon deceives us and we lose.