Differentially make progress on alignment, decreasing the difficulty gap between training a model to be persuasive versus training a model to give a correct explanation. Currently, it is much easier to scale the former (just ask labellers if they were persuaded) than the latter (you need domain experts to check that the explanation was actually correct).
AFAICT, the biggest difficulty gap is (and probably will be) in philosophy, since it’s just as easy as any other area to ask labellers if they are persuaded by some philosophical argument, but we have little idea (both compared to other areas, and in an absolute sense) what constitutes “philosophical truth” or what makes an explanation “correct” in philosophy. So I see solving these metaphilosophical problems as crucial to defending against AI persuasion. Do you agree, and if so why no mention of metaphilosophy in this otherwise fairly comprehensive post on AI persuasion?
AFAICT, the biggest difficulty gap is (and probably will be) in philosophy, since it’s just as easy as any other area to ask labellers if they are persuaded by some philosophical argument, but we have little idea (both compared to other areas, and in an absolute sense) what constitutes “philosophical truth” or what makes an explanation “correct” in philosophy. So I see solving these metaphilosophical problems as crucial to defending against AI persuasion. Do you agree, and if so why no mention of metaphilosophy in this otherwise fairly comprehensive post on AI persuasion?