Counterfactual Oracles = online supervised learning with random selection of training episodes
Most people here probably already understand this by now, so this is more to prevent new people from getting confused about the point of Counterfactual Oracles (in the ML setting) because there’s not a top-level post that explains it clearly at a conceptual level. Paul Christiano does have a blog post titled Counterfactual oversight vs. training data, which talks about the same thing as this post except that he uses the term “counterfactual oversight”, which is just Counterfactual Oracles applied to human imitation (which he proposes to use to “oversee” some larger AI system). But the fact he doesn’t mention “Counterfactual Oracle” makes it hard for people to find that post or see the connection between it and Counterfactual Oracles. And as long as I’m writing a new top level post, I might as well try to explain it with my own words. The second part of this post lists some remaining problems with oracles/predictors that are not solved by Counterfactual Oracles.
Without further ado, I think when the Counterfactual Oracle is translated into the ML setting, it has three essential characteristics:
Supervised training—This is safer than reinforcement learning because we don’t have to worry about reward hacking (i.e., reward gaming and reward tampering), and it eliminates the problem of self-confirming predictions (which can be seen as a form of reward hacking). In other words, if the only thing that ever sees the Oracle’s output during a training episode is an automated system that computes the Oracle’s reward/loss, and that system is secure because it’s just computing a simple distance metric (comparing the Oracle’s output to the training label), then reward hacking and self-confirming predictions can’t happen. (ETA: See this comment for how I’ve changed my mind about this.)
Independent labeling of data—This is usually taken for granted in supervised learning but perhaps should be explicitly emphasized here. To prevent self-confirming predictions, the labeling of data has to be done without causal influence from the Oracle. That is, the Oracle’s output should be isolated in some protected system until it’s time to compute the reward/loss so that it can’t affect how the training label is generated. But note that it’s perfectly fine for humans or other systems to look at the question/input that is given to the Oracle, in order generate the training label.
Online learning—The Oracle never stops learning, so it can eventually adjust to any distributional shift. (But note that an Oracle implemented using current ML techniques might perform very badly, possibly in a malign way, while it’s still adjusting.)
Random selection of training episodes—I’ll just quote Paul here: “Counterfactual oversight consists of labelling a random subset of data and using it as online training data. The key difference is that any given data point may become a training data point, with the decision made after the learning system has made a decision about it. As long as the randomization is unpredictable to the learner, this gives us a formal guarantee that there can’t be any noticeable difference between the training and test data. And therefore if our learner behaves well on training data, it really must behave well on test data.” (Note that what Stuart Armstrong calls “erasure” just means that the current episode has been selected as a training episode.)
Remaining Safety Problems
Counterfactual Oracles solve (or are a proposal to solve) some safety problems associated with predictors/oracles, but others remain. Besides the distributional shift problem mentioned above, here are a few more that come to mind. Note that these are not problems specific to Counterfactual Oracles, but Stuart Armstrong’s Safe Uses of AI Oracles seems wrong or misleading when it says “This paper presented two Oracle designs which are both safe and useful.”
Benign Induction—This is a concern that the Oracle might come to believe (perhaps correctly in some sense) that we probably live in a simulation or Zoo Earth, and that at some specific point the alien simulation runners or zoo keepers will intervene in the simulation or zoo to change the training signal for the Oracle, causing the Oracle (on both fake and real Earths) to output something that for example causes our world to be taken over by the aliens.
Reward Takeover on Real Earth—The Oracle may predict a future in which its training data / reward signal comes under the control of a hostile agent that will arise or already exists on the real Earth (e.g., an unaligned AI or an unfriendly government) which causes it to output something that benefits that hostile agent at the expense of its current user.
Safe Approximation—If the Oracle is not able to exactly predict something, could its attempt at approximation cause safety problems? For example if we use it to predict human behavior, could the approximate prediction be unsafe because it ends up predicting a human-like agent with non-human values?
Inner alignment—The ML training process may not produce a model that actually optimizes for what we intend for it to optimize for (namely minimizing loss for just the current episode, conditional on the current episode being selected as a training episode).
(I just made up the names for #2 and #3 so please feel free to suggest improvements for them.)
- The Parable of Predict-O-Matic by 15 Oct 2019 0:49 UTC; 350 points) (
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- AI safety via market making by 26 Jun 2020 23:07 UTC; 71 points) (
- Conditioning Generative Models for Alignment by 18 Jul 2022 7:11 UTC; 59 points) (
- Modeling the impact of safety agendas by 5 Nov 2021 19:46 UTC; 51 points) (
- 1hr talk: Intro to AGI safety by 18 Jun 2019 21:41 UTC; 36 points) (
- Conditioning Predictive Models: Open problems, Conclusion, and Appendix by 10 Feb 2023 19:21 UTC; 36 points) (
- Random Thoughts on Predict-O-Matic by 17 Oct 2019 23:39 UTC; 35 points) (
- On unfixably unsafe AGI architectures by 19 Feb 2020 21:16 UTC; 33 points) (
- Strategy For Conditioning Generative Models by 1 Sep 2022 4:34 UTC; 31 points) (
- Training goals for large language models by 18 Jul 2022 7:09 UTC; 28 points) (
- Outer alignment and imitative amplification by 10 Jan 2020 0:26 UTC; 24 points) (
- Some reasons why a predictor wants to be a consequentialist by 15 Apr 2022 15:02 UTC; 23 points) (
- Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI by 16 Dec 2021 22:41 UTC; 22 points) (
- 14 Jul 2020 21:22 UTC; 13 points) 's comment on Arguments against myopic training by (
- 18 Oct 2019 17:18 UTC; 7 points) 's comment on Random Thoughts on Predict-O-Matic by (
- 21 Jul 2020 23:29 UTC; 4 points) 's comment on Arguments against myopic training by (
- 10 Sep 2019 4:20 UTC; 2 points) 's comment on Just Imitate Humans? by (
- 25 Sep 2019 2:01 UTC; 2 points) 's comment on This is a test post by (
This assumption seems large, and I don’t understand why we can make it. The loss is computed against the output of a physical system in the real world which is not, in fact, causally independent of the oracle’s predictions. It is not Platonically graded on how correct it is with respect to what we had in mind. If the proposal is “just make the grading system really hard to hack”, that runs into the same problems as boxing—you’re still running a system that is looking for ways to hurt you.
Are you saying that even if we store the Oracle’s output in some memory buffer with minimal processing, and then compute the loss against the independently generated training label using a simple distance metric, the Oracle could still hack the grading system to give itself a zero loss? Hmm, maybe, but doesn’t the risk of that seem a lot smaller than if these precautions are not taken (i.e., if we’re using reinforcement learning or SL but generating training labels after looking at the Oracle’s output)? Or are these risks actually comparable in magnitude in your mind? Or are you saying that the risk is still unacceptably large and we should reduce it further if we can?
Gradient descent won’t optimize for this behavior though, it really seems like you want to study this under inner alignment. (It’s hard for me to see how you can meaningfully consider the problems separately.)
Yes, if the oracle gives itself zero loss by hacking the grading system then it will stop being updated, but the same is true if the mesa-optimizer tampers with the outer training process in any other way, or just copies itself to a different substrate, or whatever.
Here’s my understanding and elaboration of your first paragraph, to make sure I understand it correctly and to explain it to others who might not:
What we want the training process to produce is a mesa-optimizer that tries to minimize the actual distance between its output and the training label (i.e., the actual loss). We don’t want a mesa-optimizer that tries to minimize the output of the physical grading system (i.e., the computed loss). The latter kind of model will hack the grading system if given a chance, while the former won’t. However these two models would behave identically in any training episode where reward hacking doesn’t occur, so we can’t distinguish between them without using some kind of inner alignment technique (which might for example look at how the models work on the inside rather than just how they behave).
If we can solve this inner alignment problem then it fully addresses TurnTrout’s (Alex Turner’s) concern, because the system would no longer be “looking for ways to hurt you.”
Hmm, actually this still doesn’t fully address TurnTrout’s (Alex Turner’s) concern, because this mesa-optimizer could try to minimize the actual distance between its output and the training label by changing the training label (what that means depends on how the training label is defined within its utility function). To do that it would have to break out of the box that it’s in, which may not be possible, but this is still a system that is “looking for ways to hurt you.” It seems that what we really want is a mesa-optimizer that tries to minimize the actual loss while pretending that it has no causal influence on the training label (even if it actually does because there’s a way to break out of its box).
This seems like a harder inner alignment problem than I thought, because we have to make the training process converge upon a rather unnatural kind of agent. Is this still a feasible inner alignment problem to solve, and if not is there another way to get around this problem?
[EDIT (2019-11-09): I no longer think that the argument I made here—about a theoretical learning algorithm—seems to apply to common practical learning algorithms; see here (H/T Abram for showing me that my reasoning was wrong).]
If the trained model tries to minimize loss in future episodes, it definitely seems dangerous, but I’m not sure that we should consider this an inner-alignment failure. In some sense we got the behavior that our episodic learning algorithm was optimizing for.
For example, consider the following episodic learning algorithm: At the end of each episode, if the model failed to achieve the episode’s goal its network parameters are completely randomized (and if it achieves the goal, the model is unchanged). If we run this learning algorithm for an arbitrarily long time, we should expect to end up with a model that behaves in a way that results in achieving the goal in every future episode (if such a model exists).
Interesting… it seems that this doesn’t necessarily happen if we use online gradient descent instead, because the loss gradient (computed for a single episode) ought to lead away from model parameters that would increase the loss for the current episode and reduce it for future episodes. Is that right, and if so, how can we think more generally about what kinds of learning algorithms will produce episodic optimizers vs cross-episodic optimizers?
Also, what name would you suggest for this problem, if not “inner alignment”? (“Inner alignment” actually seems fine to me, but maybe I can be persuaded that it should be called something else instead.)
I call this problem “non-myopia,” which I think interestingly has both an outer alignment component and an inner alignment component:
If you train using something like population-based training that explicitly incentivizes cross-episode performance, then the resulting non-myopia was an outer alignment failure.
Alternatively, if you train using standard RL/SL/etc. without any PBT, but still get non-myopia, then that’s an inner alignment failure. And I find this failure mode quite plausible: even if your training process isn’t explicitly incentivizing non-myopia, it might be that non-myopic agents are simpler/more natural/easier to find/etc. such that your inductive biases still incentivize them.
Oh, so even online gradient descent could generate non-myopic agents with large (or non-negligible) probability because non-myopic agents could be local optima for “current episode performance” and their basins of attraction collectively could be large (or non-negligible) compared to the basins of attraction for myopic agents. So starting with random model parameters one might well end up at a non-myopic agent through online gradient descent. Is this an example of what you mean?
Thinking about this more, this doesn’t actually seem very likely for OGD since there are likely to be model parameters controlling how farsighted the agent is (e.g., its discount rate or planning horizon) so it seems like non-myopic agents are not local optima and OGD would keep going downhill (to more and more myopic agents) until it gets to a fully myopic agent. Does this seem right to you?
I don’t think that’s quite right. At least if you look at current RL, it relies on the existence of a strict episode boundary past which the agent isn’t supposed to optimize at all. The discount factor is only per-step within an episode; there isn’t any between-episode discount factor. Thus, if you think that simple agents are likely to care about things beyond just the episode that they’re given, then you get non-myopia. In particular, if you put an agent in an environment with a messy episode boundary (e.g. it’s in the real world such that its actions in one episode have the ability to influence its actions in future episodes), I think the natural generalization for an agent in that situation is to keep using something like its discount factor past the artificial episode boundary created by the training process, which gives you non-myopia.
Hmm, I guess I was mostly thinking about non-myopia in the context of using SL to train a Counterfactual Oracle, which wouldn’t necessarily have steps or a non-zero discount factor within an episode. It seems like the easiest way for non-myopia to arise in this context is if the Oracle tries to optimize across episodes using a between-episode discount factor or just a fixed horizon. But as I argued this doesn’t seem to be a local minimum with regard to current episode loss so it seems like OGD wouldn’t stop here but would keep optimizing the Oracle until it’s not non-myopic anymore.
I’m pretty confused about the context that you’re talking about, but why not also have a zero per-step discount factor to try to rule out the scenario you’re describing, in order to ensure myopia?
ETA: On the other hand, unless we have a general solution to inner alignment, there are so many different ways that inner alignment could fail to be achieved (see here for another example) that we should probably just try to solve inner alignment in general and not try to prevent specific failure modes like this.
I agree, my reasoning above does not apply to gradient descent (I misunderstood this point before reading your comment).
I think it still applies to evolutionary algorithms (which might end up being relevant).
Maybe learning algorithms that have the following property are more likely to yield models with “cross-episodic behavior”:
During training, a parameter’s value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future episodes.
Maybe “non-myopia” as Evan suggested.
When I talk about an episodic learning algorithm, I usually mean one that actually optimizes performance within an episode (like most of the algorithms in common use today, e.g. empirical risk minimization treating episode initial conditions as fixed). The algorithm you described doesn’t seem like an “episodic” learning algorithm, given that it optimizes total performance (and essentially ignores episode boundaries).
(This comment has been heavily edited after posting.)
What’s an algorithm, or instructions for a human, for determining whether a learning algorithm is “episodic” or not? For example it wasn’t obvious to me that Ofer’s algorithm isn’t episodic and I had to think for a while (mentally simulate his algorithm) to see that what he said is correct. Is there a shortcut to figuring out whether a learning algorithm is episodic without having to run or simulate the algorithm? You mention “ignores episode boundaries” but I don’t see how to tell that Ofer’s algorithm ignores episode boundaries since it seems to be just looking at the current episode’s performance when making a decision.
How do you even tell that an algorithm is optimizing something?
In most cases we have some argument that an algorithm is optimizing the episodic reward, and it just comes down to the details of that argument.
If you are concerned with optimization that isn’t necessarily intended and wondering how to more effectively look out for it, it seems like you should ask “would a policy that has property P be more likely to be produced under this algorithm?” For P=”takes actions that lead to high rewards in future episodes” the answer is clearly yes, since any policy that persists for a long time necessarily has property P (though of course it’s unclear if the algorithm works at all). For normal RL algorithms there’s not any obvious mechanism by which this would happen. It’s not obvious that it doesn’t, until you prove that these algorithms converge to optimizing per-episode rewards. I don’t see any mechanical way to test that (just like I don’t see any mechanical way to test almost any property that we talk about in almost any argument about anything).
So when you wrote “When I talk about an episodic learning algorithm, I usually mean one that actually optimizes performance within an episode (like most of the algorithms in common use today, e.g. empirical risk minimization treating episode initial conditions as fixed).” earlier, you had in mind that most of the algorithms in common use today have already been proven to converge to optimizing per-episode rewards? If so, I didn’t know that background fact and misinterpreted you as a result. Can you or someone else please explicitly confirm or disconfirm this for me?
Yes, most of the algorithms in use today are known to converge or roughly converge to optimizing per-episode rewards. In most cases it’s relatively clear that there is no optimization across episode boundaries (by the outer optimizer).
There’s a similar issue with less extreme requirements:
Imagine there’s a circumstance in which the variable you want to predict can be affected by predictions. Fortunately, you were smart enough to use a counterfactual oracle. Unfortunately, you weren’t the only person who had this idea, and absent coordination to use the same RNG (in the same way), rather than the oracles learning from episodes they don’t influence and not learning to make “manipulative predictions”, instead they learned from each other (because even when their output is erased, the other oracles’ outputs aren’t) and eventually make manipulative predictions.
I vaguely agree with this concern but would like a clearer understanding of it. Can you think of a specific example of how this problem can happen?
I don’t have much in the way of a model of “manipulative predictions”—they’ve been mentioned before as a motivation for counterfactual oracles.
I think the original example was, there’s this one oracle that everyone has access to and believes, and it says “company X’s stock is gonna go way done by the end of today” and because everyone believes it, it happens.
In a similar fashion, I can imagine multiple people/groups trying to independently create (their own) “oracles” for predicting the (stock) market.
I think the following is potentially another remaining safety problem:
[EDIT: actually it’s an inner alignment problem, using the definition here]
[EDIT2: i.e. using the following definitions from the above link:
“we will use base objective to refer to whatever criterion the base optimizer was using to select between different possible systems and mesa-objective to refer to whatever criterion the mesa-optimizer is using to select between different possible outputs. ”
“We will call the problem of eliminating the base-mesa objective gap the inner alignment problem”.
]
Assuming the oracle cares only about minimizing the loss in the current episode—as defined by a given loss function—it might act in a way that will cause the invocation of many “luckier” copies of itself (ones that, with very high probability, output a value that gets the minimal loss, e.g. by “magically” finding that value stored somewhere in the model, or by running on very reliable hardware). In this scenario, the oracle does not intrinsically care about the other copies of itself; it just wants to maximize the probability that the current execution is one of those “luckier” copies.
Episodic learning algorithms will still penalize this behavior if it appears on the training distribution, so it seems reasonable to call this an inner alignment problem.
Ah, I agree (edited my comment above accordingly).
I am having trouble parsing/understanding this part.
The linked post by Paul doesn’t seem to talk about human imitation. Is there a separate post/comment somewhere that connects counterfactual oversight to human imitation, or is the connection to human imitation somehow implicit in the linked post?
The linked post by Paul seems to be talking about counterfactual oversight as a way to train the counterfactual oracle, but I’m parsing your sentence as saying that there is then a further step where the counterfactual oracle is used to oversee a larger AI system (i.e. the “oversee” in your sentence is different from the “oversight” in “counterfactual oversight”). Is this right?
“Counterfactual oversight” isn’t really explained in the post I linked to, but rather in the first post that post links to, titled Human-in-the-counterfactual-loop. (The link text is “counterfactual human oversight”. Yeah having all these different names is confusing, but these three phrases are referring to the same thing.) The key part of that post is this:
So the “what it thinks a human would have told it to do if the human had been consulted” is the human imitation part, because it’s predicting what a human would do, which is the same as imitating a human. And then the oversight part is that the human imitation is telling the system what to do. (My “oversee” is referring to this “oversight”.)
I hope that clears things up?
Thanks! I think I understand this now.
I will say some things that occurred to me while thinking more about this, and hope that someone will correct me if I get something wrong.
“Human imitation” is sometimes used to refer to the outward behavior of the system (e.g. “imitation learning”, and in posts like “Just Imitate Humans?”), and sometimes to refer to the model of the human inside the system (e.g. here when you say “the human imitation is telling the system what to do”).
A system that is more capable than a human can still be a “human imitation”, because “human imitation” is being used in the sense of “modeling humans inside the system” instead of “has the outward behavior of a human”.
There is a distinction between the counterfactual training procedure vs the resulting system. “Counterfactual oracle” (singular) seems to be used to refer to the resulting system, and Paul calls this “the system” in his “Human-in-the-counterfactual-loop” post. “Counterfactual oracles” (plural) is used both as a plural version of the resulting system and also as a label for the general training procedure. “Human-in-the-counterfactual-loop”, “counterfactual human oversight”, and “counterfactual oversight” all refer to the training procedure (but only when the procedure uses a model of the human).