This is a great comment. I will have to think more about your overall point, but aside from that, you’ve made some really useful distinctions. I’ve been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it’s about outer alignment). Or maybe inner alignment just shouldn’t be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn’t cluster together the problems people have been trying to cluster together.
Right, but John is disagreeing with Evan’s frame, and John’s argument that such-and-such problems aren’t inner alignment problems is that they are outer alignment problems.
This is a great comment. I will have to think more about your overall point, but aside from that, you’ve made some really useful distinctions. I’ve been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it’s about outer alignment). Or maybe inner alignment just shouldn’t be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn’t cluster together the problems people have been trying to cluster together.
Haven’t read the full comment thread, but on this sentence
Evan actually wrote a post to explain that it isn’t the complement for him (and not the compliment either :p)
Right, but John is disagreeing with Evan’s frame, and John’s argument that such-and-such problems aren’t inner alignment problems is that they are outer alignment problems.