adamShimi comments on paulfchristiano’s Shortform

adamShimi 1 Jul 2021 16:07 UTC
LW: 2 AF: 1
AF
One aspect of this proposal which I don’t know how to do is evaluation the answers of the question-answerer. That looks too me very related to the deconfusion of universality that we discussed a few months ago, and without an answer to this, I feel like I don’t even know how to run this silly approach.
- paulfchristiano 2 Jul 2021 1:58 UTC
  LW: 4 AF: 3
  AF Parent
  You could imitate human answers, or you could ask a human “Is answer $A^{'}$ much better than answer $A$ ?” Both of these only work for questions that humans can evaluate (in hindsight), and then the point of the scheme is to get an adequate generalization to (some) questions that humans can’t answer.
  - adamShimi 2 Jul 2021 14:57 UTC
    LW: 2 AF: 1
    AF Parent
    Ok, so you optimize the circuit both for speed and for small loss on human answers/comparisons, hoping that it generalizes to more questions while not being complex enough to be deceptive. Is that what you mean?
    - paulfchristiano 3 Jul 2021 18:29 UTC
      LW: 4 AF: 3
      AF Parent
      I’m mostly worried about parameter sharing between the human models in the environment and the QA procedure (which leads the QA to generalize like a human instead of correctly). You could call that deception but I think it’s a somewhat simpler phenomenon.