ESRogs comments on Learning the prior

ESRogs 6 Jul 2020 7:17 UTC
LW: 4 AF: 2
AF
Ah, I see. It sounds like the key thing I was missing was that the strangeness of the prior only matters when you’re testing on a different distribution than you trained on. (And since you can randomly sample from x* when you solicit forecasts from humans, the train and test distributions can be considered the same.)
- Daniel Kokotajlo 6 Jul 2020 12:48 UTC
  LW: 7 AF: 4
  AF Parent
  Is that actually true though? Why is that true? Say we are training the model on a dataset of N human answers, and then we are doing to deploy it to answer 10N more questions, all from the same big pool of questions. The AI can’t tell whether it is in training or deployment, but it could decide to follow a policy of giving some sort of catastrophic answer with probability 1/10N, so that probably it’ll make it through training just fine and then still get to cause catastrophe.
  - paulfchristiano 8 Jul 2020 0:46 UTC
    LW: 6 AF: 3
    AF Parent
    That’s right—you still only get a bound on average quality, and you need to do something to cope with failures so rare they never appear in training (here’s a post reviewing my best guesses).
    But before you weren’t even in the game, it wouldn’t matter how well adversarial training worked because you didn’t even have the knowledge to tell whether a given behavior is good or bad. You weren’t even getting the right behavior on average.
    (In the OP I think the claim “the generalization is now coming entirely from human beliefs” is fine, I meant generalization from one distribution to another. “Neural nets are are fine” was sweeping these issues under the rug. Though note that in the real world the distribution will change from neural net training to deployment, it’s just exactly the normal robustness problem. The point of this post is just to get it down to only a robustness problem that you could solve with some kind of generalization of adversarial training, the reason to set it up as in the OP was to make the issue more clear.)
  - evhub 6 Jul 2020 19:08 UTC
    LW: 6 AF: 3
    AF Parent
    I agree with Daniel. Certainly training on actual iid samples from the deployment distribution helps a lot—as it ensures that your limiting behavior is correct—but in the finite data regime you can still find a deceptive model that defects some percentage of the time.
  - ESRogs 6 Jul 2020 18:25 UTC
    LW: 2 AF: 1
    AF Parent
    This is a good question, and I don’t know the answer. My guess is that Paul would say that that is a potential problem, but different from the one being addressed in this post. Not sure though.
    - paulfchristiano 8 Jul 2020 0:57 UTC
      LW: 4 AF: 2
      AF Parent
      Yeah, that’s my view.
      - ESRogs 8 Jul 2020 3:07 UTC
        2 points
        Parent
        Thanks for confirming.