Daniel Kokotajlo comments on The Plan − 2023 Version

Daniel Kokotajlo 30 Dec 2023 3:27 UTC
19 points
0
raters make systematic errors—regular, compactly describable, predictable errors.
If I understand the tentative OpenAI plan correctly (at least the one mentioned in the recent paper, not sure if it’s the Main Plan or just one of several being considered) the idea is to use various yet-to-be-worked out techniques to summon certain concepts (concepts like “is this sentence true” and “is this chain-of-thought a straightforward obedient one or a sneaky deceptive one?”) into a reward model, and then use the reward model to train the agent. So, the hope is that if the first step works, we no longer have the “systematic errors” problem, and instead have a “perfect” reward system.

How do we get step one to work? That’s the hard part & that’s what the recent paper is trying to make progress on. I think. Part of the hope is that our reward model can be made from a pretrained base model, which hopefully won’t be situationally aware.
- Akash 30 Dec 2023 11:24 UTC
  11 points
  3
  Parent
  Does the “rater problem” (raters have systematic errors) simply apply to step one in this plan? I agree that once you have a perfect reward model, you no longer need human raters.
  But it seems like the “rater problem” still applies if we’re going to train the reward model using human feedback. Perhaps I’m too anchored to thinking about things in an RLHF context, but it seems like at some point in the process we need to have some way of saying “this is true” or “this chain-of-thought is deceptive” that involves human raters.
  Is the idea something like:
  - Eliezer: Human raters make systematic errors
  - OpenAI: Yes, but this is only a problem if we have human raters indefinitely provide feedback. If human raters are expected to provide feedback on 10,000,000 responses under time-pressure, then surely they will make systematic errors.
  - OpenAI: But suppose we could train a reward model on a modest number of responses and we didn’t have time-pressure. For this dataset of, say, 10,000 responses, we are super careful, we get a ton of people to double-check that everything is accurate, and we are nearly certain that every single label is correct. If we train a reward model on this dataset, and we can get it to generalize properly, then we can get past the “humans make systematic errors” problem.
  Or am I totally off//the idea is different than this//the “yet-to-be-worked-out-techniques” would involve getting the reward model to learn stuff without ever needing feedback from human raters?
  - leogao 30 Dec 2023 18:07 UTC
    11 points
    3
    Parent
    Short answer: The core focus of the “yet to be worked out techniques” is to figure out the “how do we get it to generalize properly” part, not the “how do we be super careful with the labels” part.
    
    Longer answer: We can consider weak to strong generalization as actually two different subproblems:
    
    generalizing from correct labels on some easy subset of the distribution (the 10,000 super careful definitely 100% correct labels)
    generalizing from labels which can be wrong and are more correct on easy problems than hard problems, but we don’t exactly know when the labels are wrong (literally just normal human labels)
    
    The setting in the paper doesn’t quite distinguish between the two but I personally think the former problem is more interesting and contains the bulk of the difficulty. Namely, most of the difficulty is in understanding when generalization happens/fails and what kinds of generalizations are more natural.
    What links here?
    Deceptive AI ≠ Deceptively-aligned AI by Steven Byrnes (7 Jan 2024 16:55 UTC; 96 points)
  - Daniel Kokotajlo 30 Dec 2023 22:34 UTC
    4 points
    0
    Parent
    Leo already answered, but yes, the rater problem applies to step one of that plan. And the hope is that progress can be made on solving the rater problem in this setting. Because e.g. our model isn’t situationally aware.
    - johnswentworth 2 Jan 2024 17:48 UTC
      2 points
      0
      Parent
      Why would the rater model not be situationally aware?