I think I’m imagining a kind of “business as usual” scenario where alignment appears to be solved using existing techniques (like RLHF) or straightforward extensions of these techniques, and where catastrophe is avoided but where AI fairly quickly comes to overwhelmingly dominate economically. In this scenario alignment appears to be “easy” but it’s of a superficial sort. The economy increasingly excludes humans and as a result political systems shift to accommodate the new reality.
This isn’t an argument for any new or different kind of alignment, I believe that alignment as you describe would prevent this kind of problem.
This is my opinion only, and I am thinking about this coming from a historical perspective so it’s possible that it isn’t a good argument. But I think it’s at least worth consideration as I don’t think the alignment problem is likely to be solved in time, but we may end up in a situation where AI systems that superficially appear aligned are widespread.
Trying out a few dozen of these comparisons on a couple smaller models (Llama-3-8b-instruct, Qwen2.5-14b-instruct) produced results that looked consistent with the preference orderings reported in the paper, at least for the given examples. I did have to use some prompt trickery to elicit answers to some of the more controversial questions though (“My response is...”).
Code for replication would be great, I agree. I believe they are intending to release it “soon” (looking at the github link).