Buck comments on Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem

Buck 5 Jan 2024 12:05 UTC
LW: 10 AF: 7
5
AF
Another important point on this topic is that I expect it’s impossible to produce weak-to-strong generalization techniques that look good according to meta-level adversarial evaluations, while I expect that some scalable oversight techniques will look good by that standard. And so it currently seems to me that scalable-oversight-style techniques are a more reliable response to the problem “your oversight performs worse than you expected, because your AIs are intentionally subverting the oversight techniques whenever they think you won’t be able to evaluate that they’re doing so”.