I like this post a lot. I pretty strongly agree that process-level feedback (what I would probably call mechanistic incentives) is necessary for inner alignment—and I’m quite excited about understanding what sorts of learning mechanisms we should be looking for when we give process-level feedback (and recursive quantilization seems like an interesting option in that space).
Since detecting malign hypotheses is difficult, we want the learning system to help us out here. It should generalize from examples of malign hypotheses, and attempt to draw a broad boundary around malignancy. Allowing the system to judge itself in this way can of course lead to malign reinterpretations of user feedback, but hopefully allows for a basin of attraction in which benevolent generalizations can be learned.
Notably, one way to get this is to have the process feedback given by an overseer implemented as a human with access to a prior version of the model being overseen (and then train the model both on the oversight signal directly and to match the amplified human’s behavior doing oversight), as in relaxed adversarial training.
Notably, one way to get this is to have the process feedback given by an overseer implemented as a human with access to a prior version of the model being overseen (and then train the model both on the oversight signal directly and to match the amplified human’s behavior doing oversight), as in relaxed adversarial training.
I guess you could say what I’m after is a learning theory of generalizing process-level feedback—I your setup could do a version of the thing, but I’m thinking about Bayesian formulations because I think it’s an interesting challenge. Not because it has to be Bayesian, but because it should be in some formal framework which allows loss bounds, and if it ends up not being Bayesian I think that’s interesting/important to notice.
(Which might then translate to insights about a neural net version of the thing?)
I suppose another part of what’s going on here is that I want several of my criteria working together—I think they’re all individually achievable, but part of what’s interesting to me about the project is the way the criteria jointly reinforce the spirit of the thing I’m after.
Actually, come to think of it, factored-cognition-style amplification (was there a term for this type of amplification specifically? Ah, imitative amplification) gives us a sort of “feedback at all levels” capability, because the system is trained to answer all sorts of questions. So it can be trained to answer questions about human values, and meta-questions about learning human values, and doubly and triply meta questions about how to answer those questions. These are all useful to the extent that the human in the amplification makes use of those questions and answers while doing their tasks.
One thing this lacks (in comparison to recursive quantilizers at least) is the idea of necessarily optimizing things through search at all. Information and meta-information about values and how to optimize them is not necessarily used in HCH because questions aren’t necessarily answered via search. This can of course be a safety advantage. But it is operating on an entirely different intuition than recursive quantilizers. So, the “ability to receive feedback at all levels” means something different.
My intuition is that if I were to articulate more desiderata with the HCH case in mind, it might have something to do with the “learning the human prior” stuff.
I like this post a lot. I pretty strongly agree that process-level feedback (what I would probably call mechanistic incentives) is necessary for inner alignment—and I’m quite excited about understanding what sorts of learning mechanisms we should be looking for when we give process-level feedback (and recursive quantilization seems like an interesting option in that space).
Notably, one way to get this is to have the process feedback given by an overseer implemented as a human with access to a prior version of the model being overseen (and then train the model both on the oversight signal directly and to match the amplified human’s behavior doing oversight), as in relaxed adversarial training.
I guess you could say what I’m after is a learning theory of generalizing process-level feedback—I your setup could do a version of the thing, but I’m thinking about Bayesian formulations because I think it’s an interesting challenge. Not because it has to be Bayesian, but because it should be in some formal framework which allows loss bounds, and if it ends up not being Bayesian I think that’s interesting/important to notice.
(Which might then translate to insights about a neural net version of the thing?)
I suppose another part of what’s going on here is that I want several of my criteria working together—I think they’re all individually achievable, but part of what’s interesting to me about the project is the way the criteria jointly reinforce the spirit of the thing I’m after.
Actually, come to think of it, factored-cognition-style amplification (was there a term for this type of amplification specifically? Ah, imitative amplification) gives us a sort of “feedback at all levels” capability, because the system is trained to answer all sorts of questions. So it can be trained to answer questions about human values, and meta-questions about learning human values, and doubly and triply meta questions about how to answer those questions. These are all useful to the extent that the human in the amplification makes use of those questions and answers while doing their tasks.
One thing this lacks (in comparison to recursive quantilizers at least) is the idea of necessarily optimizing things through search at all. Information and meta-information about values and how to optimize them is not necessarily used in HCH because questions aren’t necessarily answered via search. This can of course be a safety advantage. But it is operating on an entirely different intuition than recursive quantilizers. So, the “ability to receive feedback at all levels” means something different.
My intuition is that if I were to articulate more desiderata with the HCH case in mind, it might have something to do with the “learning the human prior” stuff.