paulfchristiano comments on paulfchristiano’s Shortform

paulfchristiano 2 Jul 2021 1:58 UTC
LW: 4 AF: 3
AF
You could imitate human answers, or you could ask a human “Is answer $A^{'}$ much better than answer $A$ ?” Both of these only work for questions that humans can evaluate (in hindsight), and then the point of the scheme is to get an adequate generalization to (some) questions that humans can’t answer.
- adamShimi 2 Jul 2021 14:57 UTC
  LW: 2 AF: 1
  AF Parent
  Ok, so you optimize the circuit both for speed and for small loss on human answers/comparisons, hoping that it generalizes to more questions while not being complex enough to be deceptive. Is that what you mean?
  - paulfchristiano 3 Jul 2021 18:29 UTC
    LW: 4 AF: 3
    AF Parent
    I’m mostly worried about parameter sharing between the human models in the environment and the QA procedure (which leads the QA to generalize like a human instead of correctly). You could call that deception but I think it’s a somewhat simpler phenomenon.