The most recent thing I’ve seen on the topic is this post from yesterday on debate, which found that debate does basically nothing. In fairness there have also been some nominally-positive studies (which the linked post also mentions), though IMO their setup is more artificial and their effect sizes are not very compelling anyway.
My qualitative impression is that HCH/debate/etc have dropped somewhat in relative excitement as alignment strategies over the past year or so, more so than I expected. People have noticed the unimpressive results to some extent, and also other topics (e.g. mechinterp, SAEs) have gained a lot of excitement. That said, I do still get the impression that there’s a steady stream of newcomers getting interested in it.
The most recent thing I’ve seen on the topic is this post from yesterday on debate, which found that debate does basically nothing. In fairness there have also been some nominally-positive studies (which the linked post also mentions), though IMO their setup is more artificial and their effect sizes are not very compelling anyway.
My qualitative impression is that HCH/debate/etc have dropped somewhat in relative excitement as alignment strategies over the past year or so, more so than I expected. People have noticed the unimpressive results to some extent, and also other topics (e.g. mechinterp, SAEs) have gained a lot of excitement. That said, I do still get the impression that there’s a steady stream of newcomers getting interested in it.