You do mention the biggest issue with this showing introspection, “Models only exhibit introspection on simpler tasks”, and yet the idea you are going for is clearly for its application to very complex tasks where we can’t actually check its work. This flaw seems likely fatal, but who knows at this point? (The fact that GPT-4o and Llama 70B do better than GPT-3.5 does is evidence, but see my later problems with this...)
I addressed this point here. Also see section 7.1.1 in the paper.
I addressed this point here. Also see section 7.1.1 in the paper.