Steven Byrnes comments on Don’t align agents to evaluations of plans

Steven Byrnes 28 Nov 2022 20:14 UTC
LW: 2 AF: 2
0
AF
I think that grader-optimization is likely to fail catastrophically when the grader is (some combination of):
- more like “built / specified directly and exogenously by humans or other simple processes”, less like e.g. “a more and more complicated grader getting gradually built up through some learning process as the space-of-possible-plans gets gradually larger”
- more like “looking at the eventual consequences of the plan”, less like “assessing plans for deontology and other properties” (related post) (e.g. “That plan seems to pattern-match to basilisk stuff” could be a strike against a plan, but that evaluation is not based solely on the plan’s consequences.)
- more like “looking through tons of wildly-out-of-the-box plans”, less like “looking through a white-list of a small number of in-the-box plans”
Maybe we agree so far?
But I feel like this post is trying to go beyond that and say something broader, and I think that’s where I get off the boat.
I claim that maybe there’s a map-territory confusion going on. In particular, here are two possible situations:
- (A) Part of the AGI algorithm involves listing out multiple plans, and another part of the algorithm involves a “grader” that grades the plans.
- (B) Same as (A), but also assume that the high-scoring plans involve a world-model (“map”), and somewhere on that map is an explicit (metacognitive / reflective) representation of the “grader” itself, and the (represented) grader’s (represented) grade outputs (within the map) are identical to (or at least close to) the actual grader’s actual grades within the territory.
I feel like OP equivocates between these. When it’s talking about algorithms it seems to be (A), but when it’s talking about value-child and appendix C and so on, it seems to be (B).
In the case of people, I want to say that the “grader” is roughly “valence” / “the feeling that this is a good idea”.
I claim that (A), properly understood, should seem/feel almost tautological—like, it should be impossible to introspectively imagine (A) being false! It’s kinda the claim “People will do things that they feel motivated to do”, or something like that. By contrast, (B) is not tautological, or even true in general—it describes hedonists: “The person is thinking about how to get very positive valence on their own thoughts, and they’re doing whatever will lead to that”.
I think this is related to Rohin’s comment (“An AI system with a “direct (object-level) goal” is better than one with “indirect goals””)—the AGI has a world-model / map, its “goals” are somewhere on the map (inevitably, I claim), and we can compare the option of “the goals are in the parts of the map that correspond to object-level reality (e.g. diamonds)”, versus “the goals are in the parts of the map that correspond to a little [self-reflective] portrayal of the AGI’s own evaluative module (or some other represented grader) outputting a high score”. That’s the distinction between (not-B) vs (B) respectively. But I think both options are equally (A).
(Sidenote: There are obvious reasons to think that (A) might lead to (B) in the context of powerful model-based RL algorithms. But I claim that this is not inevitable. I think OP would agree with that.)
What links here?
- Steven Byrnes's comment on Disentangling Shard Theory into Atomic Claims by Leon Lang (18 Jan 2023 1:53 UTC; 6 points)
- TurnTrout 16 Dec 2022 20:07 UTC
  LW: 2 AF: 2
  0
  AF Parent
  As I read your comment, I kept expecting to find the point where we disagreed, but… I didn’t really find one? I’m not saying “don’t have (A) in the training goal” nor am I saying “don’t let (A) be present in the AI’s mind.”
  ETA tweaked for clarity.