paulfchristiano comments on Early 2022 Paper Round-up

paulfchristiano 15 Apr 2022 18:45 UTC
LW: 32 AF: 8
AF
I really liked the summarizing differences between distributions paper.
I think I’m excited for broadly the same reasons you are, but to state the case in my own words:
- “ML systems learn things that humans don’t know from lots of data” seems like one of the central challenges in alignment.
- Summarizing differences between distributions seems like a good model for that problem, it seems to cover the basic dynamics of the problem and appears crisply even for very weak systems.
- The solution you are pursuing seems like it could be scaled quite far.
- I don’t think that “natural language hypotheses” can cover all or even most of the things that models can learn. But it seems important anyway:
  - It seems like the best first thing to do, and it seems great to start from there and scale up.
  - I think that even if natural language hypotheses can’t cover everything models can do, it could still potentially go a very long way to closing the gap between models and humans and so greatly expand the regime where models can be safely overseen.
- This paper focuses on applications where humans care about the summary per se, but over the long run I’m most interested in learning classifiers that generalize safely to data where we don’t have labels. Another way of putting this is that I’m excited about giving humans natural language hypotheses learned from large amounts of data as a way to extend the reach of process-based oversight.
It seems like it would make sense and be valuable for labs interested in alignment to start adopting this kind of tool in practice (with the same kinds of benefits as early adoption of RLHF or early forms of amplification/debate). I think that’s probably possible now or soon. Though academic research could also push them much further and it wouldn’t be crazy to wait on that.
In some not-too-distant future it would be really cool if a fine-tuning API could also tell me something like “Here is a natural language hypothesis; predicted human judgments using this hypothesis explain X% of the learned classifier’s performance.” I expect this would already catch a large number of bugs or spurious correlations and add value.
For mostly-subhuman systems I think X% might be very close to 100%. I think it might already often be worthwhile to aim to replace the opaque classifier with a prediction of human judgments given a natural-language hypothesis (e.g. in any case where training data coverage isn’t that great, humans have strong priors, or generalization errors can be very costly).
One caveat is that “summarizing differences between distributions” seems a bit too strong as a claim about what you’re doing. It seems like all you get is some hypothesis that does a good job of telling which of the two distributions a datapoint came from; it need not capture all or even most of the differences between the distributions.
I think that the motivation and approach have a lot of overlap with imitative generalization. But to clarify the relationship: my goal when writing about imitative generalization (or iterated amplification, or relaxed adversarial training, or etc.) is mostly to start thinking through the limits of those techniques and doing useful theoretical work as far in advance as possible, rather than to plant a flag or claim credit without actually doing the hard work of making things work in practice. I’m definitely happy if any of my writing causes people to work on these topics, but as far as I know that didn’t happen in this case.
- jsteinhardt 16 Apr 2022 0:30 UTC
  6 points
  Parent
  Thanks! I pretty much agree with everything you said. This is also largely why I am excited about the work, and I think what you wrote captures it more crisply than I could have.