Not being able to figure out what sort of thing humans would rate highly isn’t an alignment failure, it’s a capabilities failure, and Eliezer_2008 would never have assumed a capabilities failure in the way you’re saying he would. He is right to say that attempting to directly encode the category boundaries won’t work. It isn’t covered in this blog post, but his main proposal for alignment was always that as far as possible, you want the AI to do the work of using its capabilities to figure out what it means to optimize for human values rather than trying to directly encode those values, precisely so that capabilities can help with alignment. The trouble is that even pointing at this category is difficult—more difficult than pointing at “gets high ratings”.
Not being able to figure out what sort of thing humans would rate highly isn’t an alignment failure, it’s a capabilities failure, and Eliezer_2008 would never have assumed a capabilities failure in the way you’re saying he would. He is right to say that attempting to directly encode the category boundaries won’t work. It isn’t covered in this blog post, but his main proposal for alignment was always that as far as possible, you want the AI to do the work of using its capabilities to figure out what it means to optimize for human values rather than trying to directly encode those values, precisely so that capabilities can help with alignment. The trouble is that even pointing at this category is difficult—more difficult than pointing at “gets high ratings”.