Further disclaimer: Feel free to answer even if you don’t find debate promising, but note that I am primarily interested in hearing from people who do actively work on it, or find it promising—or at least from people who have a very good model of specific such people.
Motivation behind the question: People often mention Debate as a promising alignment technique. For example, the AI Safety Fundamentals curriculum features it quite prominently. But I think there is a lack of consensus on “as far as the proposal is concerned, how is Debate actually meant to be used”? (For example, do we apply it during deployment, as a way of checking the safety of solutions proposed by other systems? Or do we use it during deployment, to generate solutions? Or do we use it to generate training data?) And as far as I know, of all the existing work, only the Nov 2023 paper addresses my questions, and it only answers (Q2). But I am not sure to what extent is the answer given there canonical. So I am interested in knowing the opinions of people who currently endorse Debate.
Illustrating what I mean by the questions: If I were to answer the questions 1-3 for RLHF, I could for example say that: (1) RLFH is meant for turning a neural network trained for next-token prediction into, for example, an agent that acts as a chatbot and gives helpful, honest, and lawsuit-less answers. (2) RLHF is used for generating training (or fine-tuning) data (or signal). (3) Seems pretty good for this purpose, for roughly <=human-level AIs.
Further disclaimer: Feel free to answer even if you don’t find debate promising, but note that I am primarily interested in hearing from people who do actively work on it, or find it promising—or at least from people who have a very good model of specific such people.
Motivation behind the question: People often mention Debate as a promising alignment technique. For example, the AI Safety Fundamentals curriculum features it quite prominently. But I think there is a lack of consensus on “as far as the proposal is concerned, how is Debate actually meant to be used”? (For example, do we apply it during deployment, as a way of checking the safety of solutions proposed by other systems? Or do we use it during deployment, to generate solutions? Or do we use it to generate training data?) And as far as I know, of all the existing work, only the Nov 2023 paper addresses my questions, and it only answers (Q2). But I am not sure to what extent is the answer given there canonical. So I am interested in knowing the opinions of people who currently endorse Debate.
Illustrating what I mean by the questions: If I were to answer the questions 1-3 for RLHF, I could for example say that:
(1) RLFH is meant for turning a neural network trained for next-token prediction into, for example, an agent that acts as a chatbot and gives helpful, honest, and lawsuit-less answers.
(2) RLHF is used for generating training (or fine-tuning) data (or signal).
(3) Seems pretty good for this purpose, for roughly <=human-level AIs.