Unfortunately, I think that this problem extends up a meta-level as well: AI safety research is extremely difficult to evaluate. There’s extensive debate about which problems and techniques safety researchers should focus on, even extending to debates about whether particular research directions are actively harmful. The object- and meta-level problems are related—if we had an easy-to-evaluate alignment metric, we could check whether various alignment strategies lead to models scoring higher on this metric, and use that as a training signal for alignment research itself.
This makes me wonder, are there proxy metrics that we can use? By “proxy metric”, I mean something that doesn’t necessarily fully align with what we want, but is close or often correlated. Proxy metrics are gameable, so we can’t really trust their evaluations of powerful algorithmic optimizers. But human researchers are less good at optimizing things, so their might exist proxies that can be a good enough guiding signal for us.
One possible such proxy signal is “community approval”, operationalized as something like forum comments. I think this is a pretty shoddy signal, not least because community feedback often directly conflicts. Another is evaluations from successful established researchers, which is more informative but less scalable (and depends on your operationalization of “successful” and “established”).
Unfortunately, I think that this problem extends up a meta-level as well: AI safety research is extremely difficult to evaluate. There’s extensive debate about which problems and techniques safety researchers should focus on, even extending to debates about whether particular research directions are actively harmful. The object- and meta-level problems are related—if we had an easy-to-evaluate alignment metric, we could check whether various alignment strategies lead to models scoring higher on this metric, and use that as a training signal for alignment research itself.
This makes me wonder, are there proxy metrics that we can use? By “proxy metric”, I mean something that doesn’t necessarily fully align with what we want, but is close or often correlated. Proxy metrics are gameable, so we can’t really trust their evaluations of powerful algorithmic optimizers. But human researchers are less good at optimizing things, so their might exist proxies that can be a good enough guiding signal for us.
One possible such proxy signal is “community approval”, operationalized as something like forum comments. I think this is a pretty shoddy signal, not least because community feedback often directly conflicts. Another is evaluations from successful established researchers, which is more informative but less scalable (and depends on your operationalization of “successful” and “established”).