Suppose that in the first time step Adv is able to output a string x1 that will manipulate H into: (1) giving a probability that is maximally different than p1; and (2) not looking at the rest of Qt (i.e. the human will never see x2,x3,...).
Ignoring inner alignment problems, in the limit it seems plausible that Adv will output such an x1; resulting in p2=p3=...=pt=p∗, and the smallest possible LAdv,1 given p1.
[EDIT: actually, such problems are not specific to this idea and seem to generally apply to the ‘AI safety via debate’ approach.]
Interesting idea.
Suppose that in the first time step Adv is able to output a string x1 that will manipulate H into: (1) giving a probability that is maximally different than p1; and (2) not looking at the rest of Qt (i.e. the human will never see x2,x3,...).
Ignoring inner alignment problems, in the limit it seems plausible that Adv will output such an x1; resulting in p2=p3=...=pt=p∗, and the smallest possible LAdv,1 given p1.
[EDIT: actually, such problems are not specific to this idea and seem to generally apply to the ‘AI safety via debate’ approach.]