How about this variation on training a sequence of reporters? As before, consider a collection of successively more powerful predictors. But instead of training a sequence of reporters to the human, train a series of mappings from the activations of M_N to M_(N-1), effectively treating each activation of M_(N-1) as a “question” to be answered, then finally train a human reporter for M_1. This could then be combined with any other strategy for regularizing the reporter. There is still the risk of learning a M_(N-1) simulator at some stage of the process, but this seems like a qualitatively better situation since M_(N-1) is only a little bit less complicated than M_N. A counterexample could still arise if there were discontinuous returns to extra complexity—perhaps there’s a situation where adding more compute does very little to improve the reporter, meaning M_(N+k) is only a little bit more complex than M_(N), but then M_(N+k+1) is suddenly able to use the extra complexity and the M_(N+k) simulator becomes simpler. This could perhaps be ameliorated by using P.’s Bayesian updating idea to train the reporters.
How about this variation on training a sequence of reporters? As before, consider a collection of successively more powerful predictors. But instead of training a sequence of reporters to the human, train a series of mappings from the activations of M_N to M_(N-1), effectively treating each activation of M_(N-1) as a “question” to be answered, then finally train a human reporter for M_1. This could then be combined with any other strategy for regularizing the reporter. There is still the risk of learning a M_(N-1) simulator at some stage of the process, but this seems like a qualitatively better situation since M_(N-1) is only a little bit less complicated than M_N. A counterexample could still arise if there were discontinuous returns to extra complexity—perhaps there’s a situation where adding more compute does very little to improve the reporter, meaning M_(N+k) is only a little bit more complex than M_(N), but then M_(N+k+1) is suddenly able to use the extra complexity and the M_(N+k) simulator becomes simpler. This could perhaps be ameliorated by using P.’s Bayesian updating idea to train the reporters.