Notice that typically we use the AI for tasks which are hard for H. This means that without the AI’s help, H’s probability of success will usually be low. Quantilization-wise, this is a problem: the AI will be able to eliminate those paths for which H will report failure, but maybe most of the probability mass among apparent-success paths is still on failure (i.e. the success report is corrupt). This is why the timeline part is important.
On a typical task, H expects to fail eventually but they don’t expect to fail soon. Therefore, the AI can safely consider a policies of the form “in the short-term, do something H would do with marginal probability, in the long-term go back to H’s policy”. If by the end of the short-term maneuver H reports an improved prognosis, this can imply that the improvement is genuine (since the AI knows H is probably uncorrupted at this point). Moreover, it’s possible that in the new prognosis H still doesn’t expect to fail soon. This allows performing another maneuver of the same type. This way, the AI can iteratively steer the trajectory towards true success.
This is about right.
Notice that typically we use the AI for tasks which are hard for H. This means that without the AI’s help, H’s probability of success will usually be low. Quantilization-wise, this is a problem: the AI will be able to eliminate those paths for which H will report failure, but maybe most of the probability mass among apparent-success paths is still on failure (i.e. the success report is corrupt). This is why the timeline part is important.
On a typical task, H expects to fail eventually but they don’t expect to fail soon. Therefore, the AI can safely consider a policies of the form “in the short-term, do something H would do with marginal probability, in the long-term go back to H’s policy”. If by the end of the short-term maneuver H reports an improved prognosis, this can imply that the improvement is genuine (since the AI knows H is probably uncorrupted at this point). Moreover, it’s possible that in the new prognosis H still doesn’t expect to fail soon. This allows performing another maneuver of the same type. This way, the AI can iteratively steer the trajectory towards true success.