If its true thoughts are transparent and expressed in natural language(see e.g. Measuring Faithfulness in Chain-of-Thought Reasoning)
This seems technically true but a bit of a trap, since it may be easier to get ‘looks like it expresses its thoughts in natural language’ than ‘reliably actually does’ and specifying the difference may be too subtle for people.
This seems technically true but a bit of a trap, since it may be easier to get ‘looks like it expresses its thoughts in natural language’ than ‘reliably actually does’ and specifying the difference may be too subtle for people.