I think the special brain algorithms in question—e.g., the ones that make us comfortable entrusting a neurotypical human to decide who won in the set-up above—are more familiarly thought of as prosocial or moralcognition. A claim like this would predict that we would be uncomfortable entrusting humans who lacked the relevant prosocial instincts (e.g., psychopaths) to oversee a safety-via-debate-type set-up, which seems correct.
This doesn’t match my intuitions, but I haven’t thought about safety-via-debate too much.
If the debate questions are like “would this plan be good for humanity?”, then no, I don’t want psychopaths adjudicating.
If the questions are isolated and factual, like “does this plan lead to a diamond being synthesized in this particular room?”, then I don’t think I mind (except, perhaps, that some part of me objects to including psychopaths in any endeavor).
If the questions are factual but feed into moral decisions, and the psychopathic judge knows that—like “does this plan lead to lots of people dying?”, then the psychopath would again implicitly be judging based off of expected-outcomes (even if they’re supposed to not consider anything outside the context of the debate; people are people). And I wouldn’t want them involved.
Agreed that there are important subtleties here. In this post, I am really just using the safety-via-debate set-up as a sort of intuitive case for getting us thinking about why we generally seem to trust certain algorithms running in the human brain to adjudicate hard evaluative tasks related to AI safety. I don’t mean to be making any especially specific claims about safety-via-debate as a strategy (in part for precisely the reasons you specify in this comment).
This doesn’t match my intuitions, but I haven’t thought about safety-via-debate too much.
If the debate questions are like “would this plan be good for humanity?”, then no, I don’t want psychopaths adjudicating.
If the questions are isolated and factual, like “does this plan lead to a diamond being synthesized in this particular room?”, then I don’t think I mind (except, perhaps, that some part of me objects to including psychopaths in any endeavor).
If the questions are factual but feed into moral decisions, and the psychopathic judge knows that—like “does this plan lead to lots of people dying?”, then the psychopath would again implicitly be judging based off of expected-outcomes (even if they’re supposed to not consider anything outside the context of the debate; people are people). And I wouldn’t want them involved.
Agreed that there are important subtleties here. In this post, I am really just using the safety-via-debate set-up as a sort of intuitive case for getting us thinking about why we generally seem to trust certain algorithms running in the human brain to adjudicate hard evaluative tasks related to AI safety. I don’t mean to be making any especially specific claims about safety-via-debate as a strategy (in part for precisely the reasons you specify in this comment).