Is the idea that the helper AI allows the labeler to understand everything just as well as SmartVault does, so that there’s no difference in their respective Bayes nets, and so it works for SmartVault to use the labeler’s Bayes net?
Yes, that’s the main way this could work. The question is whether an AI understands things that humans can’t understand by doing amplification/debate/rrm, our guess is yes and the argument is mostly “until the builder explains why, gradient descent and science may just have pretty different strengths and weaknesses” (and we can make that more concrete by fleshing out what the world may be like and what the AI learns by gradient descent). But it seemed worth raising because this does appear to make the bad reporter’s job much harder and greatly restrict the space of cases where it fails to report tampering.
Methodologically, the way I think about this kind of thing is: (i) we had a counterexample, (ii) after making this change that particular counterexample no longer works, (iii) now we want to think through whether the counterexample can be adapted.
This is also legitimately less obvious. An AI can’t simulate (human+AI helpers), since each AI helper is as smart as the AI itself and so simulating (human+AI helpers) clearly requires more compute than the AI has. The counterexample is that the AI should just try its best to do inference in the Bayes net that includes “everything the human could understand with the amount of science they have time to do.”
But that does still leave the builder with avenues to try to strengthen the algorithm and win. One way is discussed in the section on speed regularization: if the AI is “trying its best” to do inference in the human Bayes net then there might always be returns to having more time to think (and so it might be able to benefit by transferring over its understanding of what was happening in the AI Bayes net rather than recomputing from the observations). The next step for a builder who wanted to take this approach would be to argue that they can reliably construct a complex enough dataset that this advantage is relevant.
My guess is that this doesn’t work on its own, but if you could scalably construct more complex data then it might work when combined with imitative generalization, as discussed here.
It will depend on how much much high-quality data you need to train the reporter. Probably it’s a small fraction of the data you need to train the predictor, and so for generating each reporter datapoint you can afford to use many times more data than the predictor usually uses. I often imagine the helpers having 10-100x more computation time.
Yes, that’s the main way this could work. The question is whether an AI understands things that humans can’t understand by doing amplification/debate/rrm, our guess is yes and the argument is mostly “until the builder explains why, gradient descent and science may just have pretty different strengths and weaknesses” (and we can make that more concrete by fleshing out what the world may be like and what the AI learns by gradient descent). But it seemed worth raising because this does appear to make the bad reporter’s job much harder and greatly restrict the space of cases where it fails to report tampering.
Methodologically, the way I think about this kind of thing is: (i) we had a counterexample, (ii) after making this change that particular counterexample no longer works, (iii) now we want to think through whether the counterexample can be adapted.
This is also legitimately less obvious. An AI can’t simulate (human+AI helpers), since each AI helper is as smart as the AI itself and so simulating (human+AI helpers) clearly requires more compute than the AI has. The counterexample is that the AI should just try its best to do inference in the Bayes net that includes “everything the human could understand with the amount of science they have time to do.”
But that does still leave the builder with avenues to try to strengthen the algorithm and win. One way is discussed in the section on speed regularization: if the AI is “trying its best” to do inference in the human Bayes net then there might always be returns to having more time to think (and so it might be able to benefit by transferring over its understanding of what was happening in the AI Bayes net rather than recomputing from the observations). The next step for a builder who wanted to take this approach would be to argue that they can reliably construct a complex enough dataset that this advantage is relevant.
My guess is that this doesn’t work on its own, but if you could scalably construct more complex data then it might work when combined with imitative generalization, as discussed here.
This is an interesting tack, this step and the next (“Strategy: have humans adopt the optimal Bayes net”) feels new to me.
Question: what’s the relative amount of compute you are imagining SmartVault and the helper AI having? Both the same, or one having a lot more?
It will depend on how much much high-quality data you need to train the reporter. Probably it’s a small fraction of the data you need to train the predictor, and so for generating each reporter datapoint you can afford to use many times more data than the predictor usually uses. I often imagine the helpers having 10-100x more computation time.