I agree that open consultancy will likely work less well than debate in practice for very smart AIs, especially if they don’t benefit much from CoT, in big part because interactions between arguments is often important. But I think it’s not clear when and for what tasks this starts to matter, and this is an empirical question (no need to argue about what is “grounded”). I’m also not convinced that calibration of judges is an issue, and that getting a nice probability matters as opposed to getting an accurate 0-1 answer.
Maybe the extent to which open consultancy dominates regular consultancy is overstated in the post, but I still think that you should be able to identify the kinds of questions for which you have no signal, and avoid the weird distributions of convincingness / non-expert calibrations where the noise from regular consultancy is better than actual random noise on top of open consultancy.
I agree that open consultancy will likely work less well than debate in practice for very smart AIs, especially if they don’t benefit much from CoT, in big part because interactions between arguments is often important. But I think it’s not clear when and for what tasks this starts to matter, and this is an empirical question (no need to argue about what is “grounded”). I’m also not convinced that calibration of judges is an issue, and that getting a nice probability matters as opposed to getting an accurate 0-1 answer.
Maybe the extent to which open consultancy dominates regular consultancy is overstated in the post, but I still think that you should be able to identify the kinds of questions for which you have no signal, and avoid the weird distributions of convincingness / non-expert calibrations where the noise from regular consultancy is better than actual random noise on top of open consultancy.