Steven Byrnes comments on Disentangling Shard Theory into Atomic Claims

Steven Byrnes Jan 18, 2023, 1:53 AM
LW: 6 AF: 4
0
AF
RL training processes create actors, not graders …
For example, in the motivation framework in the brain-like AGI safety sequence, the thought generator produces thoughts that are explicitly chosen so as to maximize the reward prediction error.
I think you maybe have some confusions along the lines I was discussing here:
I claim that maybe there’s a map-territory confusion going on. In particular, here are two possible situations:
- (A) Part of the AGI algorithm involves listing out multiple plans, and another part of the algorithm involves a “grader” that grades the plans.
- (B) Same as (A), but also assume that the high-scoring plans involve a world-model (“map”), and somewhere on that map is an explicit (metacognitive / reflective) representation of the “grader” itself, and the (represented) grader’s (represented) grade outputs (within the map) are identical to (or at least close to) the actual grader’s actual grades within the territory.
[I wasn’t sure when I first wrote that comment, but Alex Turner clarified that he was exclusively talking about (B) not (A) when he said “Don’t align agents to evaluations of plans” and such.]
The brain algorithm fits (A) (or so I claim), but that’s compatible with either (B) or (not-B), depending on what happens during training etc.
- Leon Lang Jan 18, 2023, 2:37 AM
  1 point
  0
  Parent
  Thank you! Then I indeed misunderstood Alex Turner’s claim, and I basically seem to agree with my new understanding.