Rohin Shah comments on Don’t align agents to evaluations of plans

Rohin Shah 29 Nov 2022 15:13 UTC
LW: 6 AF: 4
−9
AF
Sounds to me like you buy it but you don’t know anything else to do?
Yes, and in particular I think direct-goal approaches do not avoid the issue. In particular, I can make an analogous claim for them:
“From the perspective of the human-AI system overall, having an AI motivated by direct goals is building a system that works at cross purposes with itself, as the human puts in constant effort to ensure that the direct goal embedded in the AI is “hardened” to represent human values as well as possible, while the AI is constantly searching for upwards-errors in the instilled values (i.e. things that score highly according to the instilled values but lowly according to the human).”
Like, once you broaden to the human-AI system overall, I think this claim is just “A principal-agent problem / Goodhart problem involves two parts of a system working at cross purposes with each other”, which is both (1) true and (2) unavoidable (I think).
It seems like you could describe this as “the AI’s plans for improving efficiency are implicitly searching for errors in the concept of diamonds, and the AI has to spend extra effort hardening its concept of diamonds to defend against this attack”. So what’s the difference between this issue and the issue with grader optimization?
1. Values-execution. Diamond-evaluation error-causing plans exist and are stumble-upon-able, but the agent wants to avoid errors.
2. Grader-optimization. The agent seeks out errors in order to maximize evaluations.
The part of my response that you quoted is arguing for the following claim:
If you are analyzing the AI system in isolation (i.e. not including the human), I don’t see an argument that says [grader-optimization would violate the non-adversarial principle] and doesn’t say [values-execution would violate the non-adversarial principle]”.
As I understand it you are saying “values-execution wants to avoid errors but grader-optimization does not”. But I’m not seeing it. As far as I can tell the more correct statements are “agents with metacognition about their grader / values can make errors, but want to avoid them” and “it is a type error to talk about errors in the grader / values for agents without metacognition about their grader / values”.
(It is a type error in the latter case because what exactly are you measuring the errors with respect to? Where is the ground truth for the “true” grader / values? You could point to the human, but my understanding is that you don’t want to do this and instead just talk about only the AI cognition.)
For reference, in the part that you quoted, I was telling a concrete story of a values-executor with metacognition, and saying that it too had to “harden” its values to avoid errors. I do agree that it wants to avoid errors. I’d be interested in a concrete example of a grader-optimizer with metacognition that that doesn’t want to avoid errors in its grader.
Like, in what sense does Bill not want to avoid errors in his grader?
I don’t mean that Bill from Scenario 2 in the quiz is going to say “Oh, I see now that actually I’m tricking myself about whether diamonds are being created, let me go make some actual diamonds now”. I certainly agree that Bill isn’t going to try making diamonds, but who said he should? What exactly is wrong with Bill’s desire to think that he’s made a bunch of diamonds? Seems like a perfectly coherent goal to me; it seems like you have to appeal to some outside-Bill perspective that says that actually the goal was making diamonds (in which case you’re back to talking about the full human-AI system, rather than the AI cognition in isolation).
What I mean is that Bill from Scenario 2 might say “Hmm, it’s possible that if I self-modify by sticking a bunch of electrodes in my brain, then it won’t really be me who is feeling the accomplishment of having lots of diamonds. I should do a bunch of neuroscience and consciousness research first to make sure this plan doesn’t backfire on me”.