Rohin Shah comments on On scalable oversight with weak LLMs judging strong LLMs

Rohin Shah 14 Jul 2024 7:43 UTC
LW: 15 AF: 13
5
AF
Okay, I think it’s pretty clear that the crux between us is basically what I was gesturing at in my first comment, even if there are minor caveats that make it not exactly literally that.
I’m probably not going to engage with perspectives that say all current [alignment work towards building safer future powerful AI systems] is net negative, sorry. In my experience those discussions typically don’t go anywhere useful.
- Joe Collman 14 Jul 2024 20:52 UTC
  LW: 4 AF: 3
  0
  AF Parent
  That’s fair. I agree that we’re not likely to resolve much by continuing this discussion. (but thanks for engaging—I do think I understand your position somewhat better now)
  What does seem worth considering is adjusting research direction to increase focus on [search for and better understand the most important failure modes] - both of debate-like approaches generally, and any [plan to use such techniques to get useful alignment work done].
  I expect that this would lead people to develop clearer, richer models.
  Presumably this will take months rather than hours, but it seems worth it (whether or not I’m correct—I expect that [the understanding required to clearly demonstrate to me that I’m wrong] would be useful in a bunch of other ways).