I don’t recall seeing anything addressing this directly: has there been any progress towards dealing with concerns about Goodharting in debate and otherwise the risk of mesa-optimization in the debate approach? The typical risk scenario being something like training debate creates AIs good at convincing humans rather than at convincing humans of the truth, and once you leave the training set of questions were the truth can be reasonably determined independent of the debate mechanism we’ll experience what will amount to a treacherous turn because the debate training process accidentally optimized for a different target (convince humans) than the one intended (convince humans of true statements).
For myself this continues to be a concern which seems inadequately addressed and makes me nervous about the safety of debate, much less its adequacy as a safety mechanism.
I wanted to note that I’m also quite worried about this when it comes to debate (or, really, most things with human models), but think this is an empirical question about human psychology / training mechanisms, where making progress on it (at this stage) looks a lot like just making generic progress. If we have a small set of people who can judge debates, that’s sufficient to eventually use this; if we can identify the set of cognitive tools judges need to have in order to succeed, that’s useful; but until we have a bunch of debates and throw humans at them, it’s not obvious how we’ll get empirical answers to empirical questions.
I would be interested in how much current work on debate is motivated by the “reason works” story where truth reliably has an edge, and how much of it is motivated by something more like computational complexity concerns (where, in the presence of optimization processes, you can apply a constraint to an execution trace and the processes generalize that constraint out to their possibility space). For the latter, you might give up on human-readable language and human evaluation of correctness without giving up on the general cognitive structure.
I don’t recall seeing anything addressing this directly: has there been any progress towards dealing with concerns about Goodharting in debate and otherwise the risk of mesa-optimization in the debate approach? The typical risk scenario being something like training debate creates AIs good at convincing humans rather than at convincing humans of the truth, and once you leave the training set of questions were the truth can be reasonably determined independent of the debate mechanism we’ll experience what will amount to a treacherous turn because the debate training process accidentally optimized for a different target (convince humans) than the one intended (convince humans of true statements).
For myself this continues to be a concern which seems inadequately addressed and makes me nervous about the safety of debate, much less its adequacy as a safety mechanism.
I wanted to note that I’m also quite worried about this when it comes to debate (or, really, most things with human models), but think this is an empirical question about human psychology / training mechanisms, where making progress on it (at this stage) looks a lot like just making generic progress. If we have a small set of people who can judge debates, that’s sufficient to eventually use this; if we can identify the set of cognitive tools judges need to have in order to succeed, that’s useful; but until we have a bunch of debates and throw humans at them, it’s not obvious how we’ll get empirical answers to empirical questions.
I would be interested in how much current work on debate is motivated by the “reason works” story where truth reliably has an edge, and how much of it is motivated by something more like computational complexity concerns (where, in the presence of optimization processes, you can apply a constraint to an execution trace and the processes generalize that constraint out to their possibility space). For the latter, you might give up on human-readable language and human evaluation of correctness without giving up on the general cognitive structure.