Rohin Shah comments on Don’t align agents to evaluations of plans

Rohin Shah 29 Nov 2022 16:34 UTC
LW: 7 AF: 3
−8
AF
I wish you wouldn’t use IMO vague and suggestive and proving-too-much selection-flavor arguments, in favor of a more mechanistic analysis.
Can you name a way in which my arguments prove too much? That seems like a relatively concrete thing that we should be able to get agreement on.
You do not need an agent to have perfect values.
I did not claim (nor do I believe) the converse.
Many foundational arguments are about grader-optimization, so you can’t syntactically conclude “imperfect values means doom.” That’s true in the grader case, but not here.
I disagree that this is true in the grader case. You can have a grader that isn’t fully robust but is sufficiently robust that the agent can’t exploit any errors it would make.
If you drop out or weaken the influence of IF plan can be easily modified to incorporate more diamonds, THEN do it, that won’t necessarily mean the AI makes some crazy diamond-less universe.
The difficulty in instilling values is not that removing a single piece of the program / shard that encodes it will destroy the value. The difficulty is that when you were instilling the value, you accidentally rewarded a case where the agent tried a plan that produced pictures of diamonds (because you thought they were real diamonds), and now you’ve instilled a shard that upweights plans that produce pictures of diamonds. Or that you rewarded the agent for thoughts like “this will make pretty, transparent rocks” (which did lead to plans that produced diamonds), leading to shards that upweight plans that produce pretty, transparent rocks, and then later the agent tiles the universe with clear quartz.
The value shards are the things which optimize hard, by wielding the rest of the agent’s cognition (e.g. the world model, the general-purpose planning API).
So, I’m basically asking that you throw an error and recheck your “selection on imperfection → doom” arguments, as I claim many of these arguments reflect grader-specific problems.
I think that the standard arguments work just fine for arguing that “incorrect value shards → doom”, precisely because the incorrect value shards are the things that optimize hard.
(Here incorrect value shards means things like “the value shards put their influence towards plans producing pictures of diamonds” and not “the diamond-shard, but without this particular if clause”.)
It is extremely relevant [...]
This doesn’t seem like a response to the argument in the paragraph that you quoted; if it was meant to be then I’ll need you to rephrase it.