Aaron_Scher comments on Aaron_Scher’s Shortform

Aaron_Scher 8 Jun 2024 23:32 UTC
67 points
41
Note on something from the superalignment section of Leopold Aschenbrenner’s recent blog posts:
Evaluation is easier than generation. We get some of the way “for free,” because it’s easier for us to evaluate outputs (especially for egregious misbehaviors) than it is to generate them ourselves. For example, it takes me months or years of hard work to write a paper, but only a couple hours to tell if a paper someone has written is any good (though perhaps longer to catch fraud). We’ll have teams of expert humans spend a lot of time evaluating every RLHF example, and they’ll be able to “thumbs down” a lot of misbehavior even if the AI system is somewhat smarter than them. That said, this will only take us so far (GPT-2 or even GPT-3 couldn’t detect nefarious GPT-4 reliably, even though evaluation is easier than generation!)
Disagree about papers. I don’t think it takes merely a couple hours to tell if a paper is any good. In some cases it does, but in other cases, entire fields have been led astray for years due to bad science (e.g., replication crisis in psych, where numerous papers spurred tons of follow up work on fake things; a year and dozens of papers later we still don’t know if DPO is better than PPO for frontier AI development (though perhaps this is known in labs, and my guess is some people would argue this question is answered); IIRC it took like 4-8 months for the alignment community to decide CCS was bad (this is a contentious and oversimplifying take), despite many people reading the original paper). Properly vetting a paper in the way you will want to do for automated alignment research, especially if you’re excluding fraud from your analysis, is about knowing whether the insights in the paper will be useful in the future, it’s not just checking if they use reasonable hyperparameters on their baseline comparisons.
One counterpoint: it might be fine to have some work you mistakenly think is good, as long as it’s not existential-security-critical and you have many research directions being explored in parallel. That is, because you can run tons of your AIs at once, they can explore tons of research directions and do a bunch of the follow-up work that is needed to see if an insight is important. There may not be a huge penalty for having a slightly poor training signal, as long as it can get the quality of outputs good enough.
This [how easily can you evaluate a paper] is a tough question to answer — I would expect Leopold’s thoughts here to dominated by times he has read shitty papers, rightly concluded they are shitty, and patted himself on the back for his paper-critique skills — I know I do this. But I don’t expect being able to differentiate shitty vs. (okay + good + great) is enough. At a meta level, this post is yet another claim that “evaluation is easier than generation” will be pretty useful for automating alignment — I have grumbled about this before (though can’t find anything I’ve finished writing up), and this is yet another largely-unsubstantiated claim in that direction. There is a big difference between the claims “because evaluation is generally easier than generation, evaluating automated alignment research will be a non-zero amount easier than generating it ourselves” and “the evaluation-generation advantage will be enough to significantly change our ability to automate alignment research and is thus a meaningful input into believing in the success of an automated alignment plan”; the first is very likely true, but the second maybe not.
On another note, the line “We’ll have teams of expert humans spend a lot of time evaluating every RLHF example” seems absurd. It feels a lot like how people used to say “we will keep the AI in a nice sandboxed environment”, and now most user-facing AI products have a bunch of tools and such. It sounds like an unrealistic safety dream. This also sounds terribly inefficient — it would only work if your model is very sample-efficiently learning from few examples — which is a particular bet I’m not confident in. And my god, the opportunity cost of having your $300k engineers label a bunch of complicated data! It looks to me like what labs are doing for self play (I think my view is based on papers out of meta and GDM) is having some automated verification like code passing unit tests, and using a ton of examples. If you are going to come around saying they’re going to pivot from ~free automated grading to using top engineers for this, the burden of proof is clearly on you, and the prior isn’t so good.
What links here?
- The Leopold Model: Analysis and Reactions by Zvi (14 Jun 2024 15:10 UTC; 108 points)
- niplav 10 Jun 2024 8:04 UTC
  2 points
  0
  Parent
  See also the position of the interior.
  - Aaron_Scher 10 Jun 2024 17:08 UTC
    1 point
    0
    Parent
    Hm, can you explain what you mean? My initial reaction is that AI oversight doesn’t actually look a ton like this position of the interior where defenders must defend every conceivable attack whereas attackers need only find one successful strategy. A large chunk of why I think these are disanalogous is that getting caught is actually pretty bad for AIs — see here.