Paul Christiano does have a blog post titled Counterfactual oversight vs. training data, which talks about the same thing as this post except that he uses the term “counterfactual oversight”, which is just Counterfactual Oracles applied to human imitation (which he proposes to use to “oversee” some larger AI system).
I am having trouble parsing/understanding this part.
The linked post by Paul doesn’t seem to talk about human imitation. Is there a separate post/comment somewhere that connects counterfactual oversight to human imitation, or is the connection to human imitation somehow implicit in the linked post?
The linked post by Paul seems to be talking about counterfactual oversight as a way to train the counterfactual oracle, but I’m parsing your sentence as saying that there is then a further step where the counterfactual oracle is used to oversee a larger AI system (i.e. the “oversee” in your sentence is different from the “oversight” in “counterfactual oversight”). Is this right?
“Counterfactual oversight” isn’t really explained in the post I linked to, but rather in the first post that post links to, titled Human-in-the-counterfactual-loop. (The link text is “counterfactual human oversight”. Yeah having all these different names is confusing, but these three phrases are referring to the same thing.) The key part of that post is this:
Human-in-the-counterfactual-loop. Each time the system wants to act, it consults a human with a very small probability. The system does what it thinks a human would have told it to do if the human had been consulted.
So the “what it thinks a human would have told it to do if the human had been consulted” is the human imitation part, because it’s predicting what a human would do, which is the same as imitating a human. And then the oversight part is that the human imitation is telling the system what to do. (My “oversee” is referring to this “oversight”.)
I will say some things that occurred to me while thinking more about this, and hope that someone will correct me if I get something wrong.
“Human imitation” is sometimes used to refer to the outward behavior of the system (e.g. “imitation learning”, and in posts like “Just Imitate Humans?”), and sometimes to refer to the model of the human inside the system (e.g. here when you say “the human imitation is telling the system what to do”).
A system that is more capable than a human can still be a “human imitation”, because “human imitation” is being used in the sense of “modeling humans inside the system” instead of “has the outward behavior of a human”.
There is a distinction between the counterfactual training procedure vs the resulting system. “Counterfactual oracle” (singular) seems to be used to refer to the resulting system, and Paul calls this “the system” in his “Human-in-the-counterfactual-loop” post. “Counterfactual oracles” (plural) is used both as a plural version of the resulting system and also as a label for the general training procedure. “Human-in-the-counterfactual-loop”, “counterfactual human oversight”, and “counterfactual oversight” all refer to the training procedure (but only when the procedure uses a model of the human).
I am having trouble parsing/understanding this part.
The linked post by Paul doesn’t seem to talk about human imitation. Is there a separate post/comment somewhere that connects counterfactual oversight to human imitation, or is the connection to human imitation somehow implicit in the linked post?
The linked post by Paul seems to be talking about counterfactual oversight as a way to train the counterfactual oracle, but I’m parsing your sentence as saying that there is then a further step where the counterfactual oracle is used to oversee a larger AI system (i.e. the “oversee” in your sentence is different from the “oversight” in “counterfactual oversight”). Is this right?
“Counterfactual oversight” isn’t really explained in the post I linked to, but rather in the first post that post links to, titled Human-in-the-counterfactual-loop. (The link text is “counterfactual human oversight”. Yeah having all these different names is confusing, but these three phrases are referring to the same thing.) The key part of that post is this:
So the “what it thinks a human would have told it to do if the human had been consulted” is the human imitation part, because it’s predicting what a human would do, which is the same as imitating a human. And then the oversight part is that the human imitation is telling the system what to do. (My “oversee” is referring to this “oversight”.)
I hope that clears things up?
Thanks! I think I understand this now.
I will say some things that occurred to me while thinking more about this, and hope that someone will correct me if I get something wrong.
“Human imitation” is sometimes used to refer to the outward behavior of the system (e.g. “imitation learning”, and in posts like “Just Imitate Humans?”), and sometimes to refer to the model of the human inside the system (e.g. here when you say “the human imitation is telling the system what to do”).
A system that is more capable than a human can still be a “human imitation”, because “human imitation” is being used in the sense of “modeling humans inside the system” instead of “has the outward behavior of a human”.
There is a distinction between the counterfactual training procedure vs the resulting system. “Counterfactual oracle” (singular) seems to be used to refer to the resulting system, and Paul calls this “the system” in his “Human-in-the-counterfactual-loop” post. “Counterfactual oracles” (plural) is used both as a plural version of the resulting system and also as a label for the general training procedure. “Human-in-the-counterfactual-loop”, “counterfactual human oversight”, and “counterfactual oversight” all refer to the training procedure (but only when the procedure uses a model of the human).