TurnTrout comments on Don’t align agents to evaluations of plans

TurnTrout 28 Nov 2022 22:52 UTC
LW: 2 AF: 2
0
AF
Do you have any objections/disagreements with this? Secondarily, if as a result of 1 and 2 I’m doing a fairly wide search and considering many plans, doesn’t it stop making sense at some point to say “They are not a grader-optimizer.”?
I wrote in the post:
Value-child comes up with a hard-work plan as an effect of his cognition, not as a motivating cause—not because he only wants to come up with plans he himself will rate highly. He values working hard.
As a corollary, grader-optimization is not synonymous with planning. Grader-optimization is when high plan-evaluations are the motivating cause of planning, where “I found a plan which I think leads to diamond” is the terminal goal, and not just a side effect of cognition (as it is for values-child).
Sorry if I’m just repeating something you read and understood, but I do feel like this criterion answers “no, this is still not grader-optimization; the effective search over lots of plans is still a side-effect of your cognition, not the terminal end.”
In particular, note that the strategy you described would not strongly want to be given the actual-highest-rated plan—or maybe it would want to know more about the plan as a curiosity, but not in order to evaluate and execute that plan. That’s one way in which saying “your strategy is not grader-optimization” constrains my anticipations in a useful-seeming way.
But isn’t 1 here is at least as good as 2, since the CEV-universe-simulation could always compute X=[the program that would be recommended by AGI designers in an altruistic and superintelligent branch of humanity] then return 1 iff input-plan = ‘run X then shuts itself off without doing anything else’ (by doing a simple text match), 0 otherwise, so there’s no chance of adversarial inputs? Not to say this is a realistic way of getting an aligned AGI, but just that your argument seems to be proving too much, if it’s saying that 2 is safer/better than 1.
This is a good point. I’m wondering about the type of the (presumably Cartesian) interface between the CEV-sim and the actor. First, CEV-sim shouldn’t be affectable by the input-plan unless and until they run some stats on it. Otherwise the actor could (maybe?) side-channel attack them via whatever computer registers the input-plan shows up in. And CEV-sim does have to infer what they’re being used for, at each invocation of the grader (since they don’t retain memory across counterfactuals).
a. That aside, if CEV-sim can just syntactically check whether the input-plan runs that, then your argument seems good.
b. If CEV-sim has to understand the actor’s latent state context (is it all in the plan?), in order to make sure that the purported X-running plan isn’t just running dangerous Y in another programming language… Seems like they can’t do this.
I feel like we’re dealing with (b) more than (a), so I’d say “no, 2 is safer than 1” tentatively.