Alex Flint comments on Parsing Abram on Gradations of Inner Alignment Obstacles

Alex Flint 4 May 2021 23:20 UTC
LW: 4 AF: 3
AF
Thank you for the pointer. Why is the tangent space hypothesis version of the LTH scarier?
- Daniel Kokotajlo 5 May 2021 4:27 UTC
  LW: 4 AF: 4
  AF Parent
  Well, it seems to be saying that the training process basically just throws away all the tickets that score less than perfectly, and randomly selects one of the rest. This means that tickets which are deceptive agents and whatnot are in there from the beginning, and if they score well, then they have as much chance of being selected at the end as anything else that scores well. And since we should expect deceptive agents that score well to outnumber aligned agents that score well… we should expect deception.
  I’m working on a much more fleshed out and expanded version of this argument right now.
  - Alex Flint 5 May 2021 21:00 UTC
    LW: 4 AF: 3
    AF Parent
    Yeah right, that is scarier. Looking forward to reading your argument, esp re why we would expect deceptive agents that score well to outnumber aligned agents that score well.
    
    Although in the same sense we could say that a rock “contains” many deceptive agents, since if we viewed the rock as a giant mixture of computations then we would surely find some that implement deceptive agents.