Rohin Shah comments on Don’t align agents to evaluations of plans

Rohin Shah 31 Dec 2022 9:16 UTC
LW: 13 AF: 9
−9
AF
Some notes based on a conversation with Alex (this comment is about conveying my viewpoint, not his):
Note that, while I will talk about systems with a planner and evaluator in this comment; I don’t necessarily think that’s how advanced AI systems will look; it’s just a useful example that helps with illustration (and seems to be the model that Alex is using, based on this pseudocode).
Similarly note that, while I’m going to constantly say “it seems like a good grader-optimizer would not have this problem”, I’m not imagining that we specifically try to build a grader-optimizer—rather, I’m imagining that we use a robust grader to oversee the model during training, and let the motivations shake out however they shake out. I then claim that this plan seems reasonable, and as part of that I am arguing “if it turns out that this leads to a grader-optimizer for the robust grader we used during training, that seems okay”.
Here’s a bunch of arguments one could give for the superiority of values-executors:
One is doomed, the other isn’t. I think the overall intuitive argument for Alex’s position is “look, grader-optimizers seem obviously doomed, because the AI tries to and succeeds at tricking the grader. Values-executors might be hard but they don’t seem obviously doomed”.
Response: Grader-optimizers aren’t doomed; common sense intuitions will lead you astray here—you need to be thinking about sufficiently intelligent and competent graders which the AI can’t trick. The rest of this comment is basically applying this same pattern to various different more precise arguments.
Type signature of graders. One thing that came up was that Alex was imagining that the grader looked only at finished plans (e.g. “synthesize DNA sequence <sequence> and release it in a pond”); in this case it seems particularly easy for the AI to trick the grader into accepting a plan that looks good but isn’t actually good. Values-executors are less likely to have this failure mode, because the plan-evaluators (aka world model + shards) contains the information that the planner used to come up with the plan (e.g. in the example above they would likely know what the purpose of the synthesized DNA sequence is).
Response: I agree this is true as stated; this is basically the informed oversight problem. However, an actually good grader-optimizer would involve a grader that gets to access much more information—not just the finished plan, but all of the reasoning that went into producing the plan, or even the entire training history of the model.
(You could still say “to get a good grader-optimizer we have to deal with the informed oversight problem, but for a values-executor you’ve just agreed that is automatically solved, so that’s a reason to think values-executors are easier”. Here I’d agree that once you have successfully built a values-executor, you don’t have to solve informed oversight, but I think you need to deal with informed oversight to build the values-executors in the first place. If during training, your supposed diamond maximizer puts a picture of diamonds in front of its camera, and you reward it because you didn’t know that’s what it did, then you are accidentally building a diamond-picture-maximizer instead of a diamond-maximizer.)
Robustness during deployment. Here, the argument is that for a values-executor, you only need to give it robust rewards during training—after that, you can deploy it in the world and don’t need to be robust to any new plans it comes up with. In the diamond case, you can allow it to learn new knowledge, become smarter, etc and eventually it will tile the universe with diamonds. In contrast, with grader-optimizers, since the AI is maximizing the output of the grader, the grader has to be robust against all the plans the AI ever comes up with, including when the AI becomes way smarter. So the robustness requirements on the grader are way higher in the grader-optimizer case than in the values-executor case.
Response: It’s not obviously fine to simply let the values-executor become smarter. (To see this, imagine giving humans the ability to edit their own brains—I think several people wouldn’t have enough caution / care to avoid destroying themselves with this ability.) To the extent the values-executor does well at this, I expect it is because it did things like (a) philosophical reflection on its values in the face of new knowledge and capabilities, and (b) thinking ahead of time about potential consequences of plans that could make it smarter, and avoiding plans which might change its values (like indiscriminate brain editing).
Similarly, an actually good grader-optimizer would involve a grader that also does these things. It wouldn’t approve of plans that let the AI scale up its intelligence massively without correspondingly scaling up the grader. It would approve of plans where the AI thinks about how to improve the wisdom and intelligence of the grader, or about how to solve the alignment problem for a future successor AI.
Is it easier to instill these properties into a values-executor rather than a grader-optimizer? I don’t think so; it seems like the easiest way to instill these properties is to learn them from humans / existing knowledge; that doesn’t seem to advantage values-executors over grader-optimizers.
Possible modifications to values-executors. Following Alex’s pseudocode for a diamond values-executor, imagine a system made up of a planner that suggests plans by querying a generative model, and an evaluator that evaluates plans by predicting the consequences of the plan and evaluating how good they are. (In the pseudocode a plan is evaluated as diamondShard(WM.getConseq(plan)).) Consider possible modifications to this system:
1. Instead of having potential plan modifications sampled from a generative model, the plans could be found by black-box search (or white-box gradient ascent) over which inputs would maximize the output of the evaluator. Response: I agree this makes it more likely to find some adversarial input to the evaluator, but that’s not because you “turned it into a grader-optimizer”, it’s because you changed the planning algorithm (and in particular changed the distribution of plans considered by the planner so that adversarial plans were more likely to be considered).
2. Instead of having the evaluator rate plans based on how many diamonds are predicted to be created, we now have the evaluator rate plans based on its prediction of what the evaluator will say in the future about how many diamonds there are. This makes the agent worse. In particular, consider the plan “study the evaluator to find a way to trigger an underflow error that causes its output to wrap around to +inf”—this plan would be rejected by the original agent, but would plausibly be executed by the new agent. (This is pretty similar to the wireheading literature, e.g. Learning What to Value and current-RF optimization.) Response: I think you have created a bad grader and that’s why you’re in trouble (just as you would be in trouble if you made a values-executor with bad values). When I imagine a grader-optimizer that is actually good, I imagine a grader that disapproves of plans of the form “search for errors in the grader” (and so those plans don’t happen).
Conclusion. Putting it all together, here are a bunch of different things that can make your [planner + evaluator]-system safer:
1. Informed oversight. Ensure that the evaluator knows and understands the details about how the plan was generated. (Cf. the plagiarism example.)
2. Avoid being pushed OOD. Have the planner consider plans that are similar to those that the evaluator has had historical experience with, rather than searching over totally novel plans.
3. Current-RF optimization. The evaluator should evaluate plans using its current beliefs rather than the prediction of its future beliefs. (The latter means it will approve of plans that foreseeably manipulate it.)
4. Scaling safely. Ensure that as your AI system becomes more capable, there is also the wisdom to handle those new capabilities well. (As a particular example, the evaluator should reject plans that make the planner way more powerful without also making the evaluator a lot more powerful.)
I don’t think these are specific to values-executors rather than grader-optimizers. If you do build a grader-optimizer, you do want to take care that your grader has all of these properties; but you will need to do similar things when building a values-executor.
- TurnTrout 3 Jan 2023 5:32 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Strong-upvote, strong disagreevote, thanks so much for writing this out :) I’m not going to reply more until I’ve cleared up this thread with you (seems important, if you think the pseudocode was a grader-optimizer, well… I think that isn’t obviously doomed).