To be able to do this step in the most general setting seems to capture the entire difficulty of interpretability—if we could assess whether a model’s outputs faithfully reflect it’s internal “thinking” and hence that all of it’s reasoning is what we’re seeing, then that would be a huge jump forwards (and perhaps possible be equivalent to solving) something like ELK. Given that that problem is known to be quite difficult, and we currently don’t have solutions for it, I’m uncertain whether this reduction of aligning a language model into first verifying all its visible reasoning is complete, correct and faithful, and then doing other steps (i.e. actively optimising against this our measures of correct reasoning) is one that makes the problem easier. Do you think it’s meaningfully different (i.g. easier) to solve the “assess reasoning authenticity” completely than to solve ELK, or another hard interpretability problem?
To be able to do this step in the most general setting seems to capture the entire difficulty of interpretability—if we could assess whether a model’s outputs faithfully reflect it’s internal “thinking” and hence that all of it’s reasoning is what we’re seeing, then that would be a huge jump forwards (and perhaps possible be equivalent to solving) something like ELK. Given that that problem is known to be quite difficult, and we currently don’t have solutions for it, I’m uncertain whether this reduction of aligning a language model into first verifying all its visible reasoning is complete, correct and faithful, and then doing other steps (i.e. actively optimising against this our measures of correct reasoning) is one that makes the problem easier. Do you think it’s meaningfully different (i.g. easier) to solve the “assess reasoning authenticity” completely than to solve ELK, or another hard interpretability problem?