TurnTrout comments on Don’t align agents to evaluations of plans

TurnTrout 28 Nov 2022 22:40 UTC
LW: 5 AF: 3
3
AF
Strong-upvoted and strong-disagreevoted. Thanks so much for the thoughtful comment.
I’m rushing to get a lot of content out, so I’m going to summarize my main reactions now & will be happy to come back later.
- I wish you wouldn’t use IMO vague and suggestive and proving-too-much selection-flavor arguments, in favor of a more mechanistic analysis.
- I consider your arguments to blur nearly-unalignable design patterns (e.g. grader optimization) with shard-based agents, and then comment that both patterns pose challenges, so can we really say one is better? More on this later.
- As Charles and Adam seem to say, you seem to be asking “how did you specify the values properly?” without likewise demanding “how do we inner-align the actor? How did we specify the grader?”.
  - Given an inner-aligned actor and a grader which truly cares about diamonds, you don’t get an actor/grader which makes diamonds.
  - Given a value-AGI which truly cares about diamonds, the AGI makes diamonds.
  - If anything, the former seems to require more specification difficulty, and yet it still horribly fails.
just as “you have a literally perfect overseer” seems theoretically possible but unrealistic, so too does “you instill the direct goal literally exactly correctly”. Presumably one of these works better in practice than the other, but it’s not obvious to me which one it is.
You do not need an agent to have perfect values. As you commented below, a values-AGI with Rohin’s current values seems about as good as a values-AGI with Rohin’s CEV. Many foundational arguments are about grader-optimization, so you can’t syntactically conclude “imperfect values means doom.” That’s true in the grader case, but not here.
That reasoning is not immediately applicable to “how stable is diamond-producing behavior to various perturbations of the agent’s initial decision-influences (i.e. shards)?”. That’s all values are, on my terminology. Values are contextually activated influences on decision-making. That’s it. Values are not the optimization target of the agent with those values. If you drop out or weaken the influence of IF plan can be easily modified to incorporate more diamonds, THEN do it, that won’t necessarily mean the AI makes some crazy diamond-less universe. It means that it stops tailoring plans in a certain way, in a certain situation.
This is also why more than one person has “truly” loved their mother for more than a single hour (else their values might change away from true perfection). It’s not like there’s an “literally exactly correct” value-shard for loving someone.
This is also why values can be seriously perturbed but still end up OK. Imagine a value-shard which controls all decision-making when I’m shown a certain QR code, but which is otherwise inactive. My long-run outcomes probably wouldn’t differ, and I expect the same for an AGI.
The value shards aren’t getting optimized hard. The value shards are the things which optimize hard, by wielding the rest of the agent’s cognition (e.g. the world model, the general-purpose planning API).
So, I’m basically asking that you throw an error and recheck your “selection on imperfection → doom” arguments, as I claim many of these arguments reflect grader-specific problems.
Separately, I don’t see this as all that relevant to what work we do in practice: even if we thought that we should be creating an AI system with a direct goal,
It is extremely relevant, unless we want tons of our alignment theory to be predicated on IMO confused ideas about how agent motivations work, or what values we want in an agent, or the relative amount of time we spend researching “objective robustness” (often unhelpful IMO) vs interpretability vs cognitive-update dynamics (e.g. what reward shaping does mechanistically to a network in different situations) vs… If we stay in the grader-optimization frame, I think we’re going to waste a bunch of time figuring out how to get inexploitable graders.
It would be quite stunning if, after renouncing one high-level world-view of how agent motivations work, the optimal research allocation remained the same.
I agree that if you do IDA or debate or whatever, you get agents with direct goals. Which invalidates a bunch of analysis around indirect goals—not only do I think we shouldn’t design grader-optimizers, I think we thankfully won’t get them.
- Rohin Shah 29 Nov 2022 16:34 UTC
  LW: 7 AF: 3
  −8
  AF Parent
  I wish you wouldn’t use IMO vague and suggestive and proving-too-much selection-flavor arguments, in favor of a more mechanistic analysis.
  Can you name a way in which my arguments prove too much? That seems like a relatively concrete thing that we should be able to get agreement on.
  You do not need an agent to have perfect values.
  I did not claim (nor do I believe) the converse.
  Many foundational arguments are about grader-optimization, so you can’t syntactically conclude “imperfect values means doom.” That’s true in the grader case, but not here.
  I disagree that this is true in the grader case. You can have a grader that isn’t fully robust but is sufficiently robust that the agent can’t exploit any errors it would make.
  If you drop out or weaken the influence of IF plan can be easily modified to incorporate more diamonds, THEN do it, that won’t necessarily mean the AI makes some crazy diamond-less universe.
  The difficulty in instilling values is not that removing a single piece of the program / shard that encodes it will destroy the value. The difficulty is that when you were instilling the value, you accidentally rewarded a case where the agent tried a plan that produced pictures of diamonds (because you thought they were real diamonds), and now you’ve instilled a shard that upweights plans that produce pictures of diamonds. Or that you rewarded the agent for thoughts like “this will make pretty, transparent rocks” (which did lead to plans that produced diamonds), leading to shards that upweight plans that produce pretty, transparent rocks, and then later the agent tiles the universe with clear quartz.
  The value shards are the things which optimize hard, by wielding the rest of the agent’s cognition (e.g. the world model, the general-purpose planning API).
  So, I’m basically asking that you throw an error and recheck your “selection on imperfection → doom” arguments, as I claim many of these arguments reflect grader-specific problems.
  I think that the standard arguments work just fine for arguing that “incorrect value shards → doom”, precisely because the incorrect value shards are the things that optimize hard.
  (Here incorrect value shards means things like “the value shards put their influence towards plans producing pictures of diamonds” and not “the diamond-shard, but without this particular if clause”.)
  It is extremely relevant [...]
  This doesn’t seem like a response to the argument in the paragraph that you quoted; if it was meant to be then I’ll need you to rephrase it.
- TurnTrout 29 Nov 2022 6:26 UTC
  LW: 2 AF: 2
  0
  AF Parent
  See also the follow-up post: Alignment allows imperfect decision-influences and doesn’t require robust grading.