TurnTrout comments on A shot at the diamond-alignment problem

TurnTrout 7 Oct 2022 18:21 UTC
LW: 7 AF: 4
2
AF
Why does the ensembling matter?
I think there’s something like “why are human values so ‘reasonable’, such that [TurnTrout inference alert!] someone can like coffee and another person won’t and that doesn’t mean they would extrapolate into bitter enemies until the end of Time?”, and the answer seems like it’s gonna be because they don’t have one criterion of Perfect Value that is exactly right over which they argmax, but rather they do embedded, reflective heuristic search guided by thousands of subshards (shiny objects, diamonds, gems, bright objects, objects, power, seeing diamonds, knowing you’re near a diamond, …), such that removing a single subshard does not catastrophically exit the regime of Perfect Value.
I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they’re from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don’t think I’ve properly communicated my feelings in this comment, but hopefully it’s better than nothing))
What links here?
- johnswentworth's comment on A shot at the diamond-alignment problem by TurnTrout (16 Oct 2022 17:29 UTC; 7 points)
- tailcalled 7 Oct 2022 22:56 UTC
  LW: 2 AF: 1
  −1
  AF Parent
  I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they’re from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don’t think I’ve properly communicated my feelings in this comment, but hopefully it’s better than nothing))
  My intuition is that in order to go beyond imitation learning and random exploration, we need some sort of “iteration” system (a la IDA), and the cases of such systems that we know of tend to either literally be argmax planners with crisp utility functions, or have similar problems to argmax planners with crisp utility functions.
  - TurnTrout 8 Oct 2022 0:05 UTC
    LW: 2 AF: 2
    0
    AF Parent
    What about this post?
    - tailcalled 8 Oct 2022 7:13 UTC
      2 points
      −4
      Parent
      Well so you’re obviously pretraining using imitation learning, so I’ve got that part down.
      
      If I understand your post right, the rest of the policy training is done by policy gradients on human-induced rewards? As I understand it, policy gradient is close to a macimally sample-hungry method, because it does not do any modelling. At one level I would class this as random exploration, but on another level the humans are allowed to provide reinforcement based on methods rather than results, so I suppose this also gives it an element of imitation learning.
      
      So I guess my expectation is that your training method is too sample inefficient to achieve much beyond human imitation.