I agree that reading the CoT could be very useful, and is a very promising area of research. In fact, I think reading CoTs could be a much more surefire interpretability method than mechanistic interpretability, although the latter is also quite important.
I feel like research showing that CoTs aren’t faithful isn’t meant to say “we should throw out the CoT.” It’s more like “naively, you’d think the CoT is faithful, but look, it sometimes isn’t. We shouldn’t take the CoT at face value, and we should develop methods that ensure that it is faithful.”
Personally, what I want most out of a chain of thought is that its literal, human-interpretable meaning contains almost all the value the LLM gets out of the CoT (vs. immediately writing its answer). Secondarily, it would be nice if the CoT didn’t include a lot of distracting junk that doesn’t really help the LLM (I suspect this was largely solved by o1, since it was trained to generate helpful CoTs).
I don’t actually care much about the LLM explaining why it believes things that it can determine in a single forward pass, such as “French is spoken in Paris.” It wouldn’t be practically useful for the LLM to think these things through explicitly, and these thoughts are likely too simple to be helpful to us.
If we get to the point that LLMs can frequently make huge, accurate logical leaps in a single forward pass that humans can’t follow at all, I’d argue that at that point, we should just make our LLMs smaller and focus on improving their explicit CoT reasoning ability, for the sake of maintaining interpretability.
I agree that reading the CoT could be very useful, and is a very promising area of research. In fact, I think reading CoTs could be a much more surefire interpretability method than mechanistic interpretability, although the latter is also quite important.
I feel like research showing that CoTs aren’t faithful isn’t meant to say “we should throw out the CoT.” It’s more like “naively, you’d think the CoT is faithful, but look, it sometimes isn’t. We shouldn’t take the CoT at face value, and we should develop methods that ensure that it is faithful.”
Personally, what I want most out of a chain of thought is that its literal, human-interpretable meaning contains almost all the value the LLM gets out of the CoT (vs. immediately writing its answer). Secondarily, it would be nice if the CoT didn’t include a lot of distracting junk that doesn’t really help the LLM (I suspect this was largely solved by o1, since it was trained to generate helpful CoTs).
I don’t actually care much about the LLM explaining why it believes things that it can determine in a single forward pass, such as “French is spoken in Paris.” It wouldn’t be practically useful for the LLM to think these things through explicitly, and these thoughts are likely too simple to be helpful to us.
If we get to the point that LLMs can frequently make huge, accurate logical leaps in a single forward pass that humans can’t follow at all, I’d argue that at that point, we should just make our LLMs smaller and focus on improving their explicit CoT reasoning ability, for the sake of maintaining interpretability.