Joe Collman comments on On scalable oversight with weak LLMs judging strong LLMs

Joe Collman 14 Jul 2024 20:52 UTC
LW: 4 AF: 3
0
AF
That’s fair. I agree that we’re not likely to resolve much by continuing this discussion. (but thanks for engaging—I do think I understand your position somewhat better now)
What does seem worth considering is adjusting research direction to increase focus on [search for and better understand the most important failure modes] - both of debate-like approaches generally, and any [plan to use such techniques to get useful alignment work done].
I expect that this would lead people to develop clearer, richer models.
Presumably this will take months rather than hours, but it seems worth it (whether or not I’m correct—I expect that [the understanding required to clearly demonstrate to me that I’m wrong] would be useful in a bunch of other ways).