1.1.3: How do we train it to answer questions comprehensively?
Reward it for doing so, and punish it for failing to do so.
This reward function is, by hypothesis, uncomputable. If we do not understand what it is doing without an explanation, how can we judge the correctness of its explanation? A resolution of that might hinge on the distinction in computational complexity between searching for an answer to a problem and verifying one, but instead:
Imagine being a meticulous boss who asks his employee to put together a report. Imagine grilling him about the report, and punishing him every time he fails to answer your questions clearly enough or at a satisfactory level of detail, in addition to punishing him for low-quality reports. If your employee is capable enough, he’ll eventually learn to produce high-quality reports and answer questions satisfactorily when you grill him.
This proposal judges explanations by plausibility and articulateness. Truthfulness is only incidentally relevant and will be Goodharted away.
This proposal judges explanations by plausibility and articulateness. Truthfulness is only incidentally relevant and will be Goodharted away.
Keep in mind that the overseer (two steps forward) is always far more powerful than the agent we’re distilling (one step back), is trained to not Goodhart, is training the new agent to not Goodhart (this is largely my interpretation of what corrigibility gets you), and is explicitly searching for ways in which the new agent may want to Goodhart.
Keep in mind that the overseer (two steps forward) is always far more powerful than the agent we’re distilling (one step back), is trained to not Goodhart, is training the new agent to not Goodhart (this is largely my interpretation of what corrigibility gets you), and is explicitly searching for ways in which the new agent may want to Goodhart.
Well, but Goodhart lurks in the soul of all of us; the question here is something like “what needs to be true about the overseer such that it does not Goodhart (and can recognize it in others)?”
The proposed solution relies on the same thing as the rest of the scheme: that you can evaluate answers to a hard question Q if you already know the answers to some set of easier subquestions Qi. Then induction until you get to simple enough questions.
This reward function is, by hypothesis, uncomputable. If we do not understand what it is doing without an explanation, how can we judge the correctness of its explanation? A resolution of that might hinge on the distinction in computational complexity between searching for an answer to a problem and verifying one, but instead:
This proposal judges explanations by plausibility and articulateness. Truthfulness is only incidentally relevant and will be Goodharted away.
Keep in mind that the overseer (two steps forward) is always far more powerful than the agent we’re distilling (one step back), is trained to not Goodhart, is training the new agent to not Goodhart (this is largely my interpretation of what corrigibility gets you), and is explicitly searching for ways in which the new agent may want to Goodhart.
Well, but Goodhart lurks in the soul of all of us; the question here is something like “what needs to be true about the overseer such that it does not Goodhart (and can recognize it in others)?”
Corrigibility. Without corrigibility I would be just as scared of Goodhart.
The proposed solution relies on the same thing as the rest of the scheme: that you can evaluate answers to a hard question Q if you already know the answers to some set of easier subquestions Qi. Then induction until you get to simple enough questions.