Because a lot of interpretability will be about parsing gigantic internal trains of thought in natural language. This will probably demand sophisticated AI tools to aid it. Some of it will still be about decoding the representations in the LLM giving rise to that natural language. And there will be a lot of theory and experiment about what will cause the internal representation to deviate in meaning from the linguistic output. See this insightful comment by Gwern on pressures for LLMs to use steganography to make codes in their output. I suspect there are other pressures toward convoluted encoding and outright deception that I haven’t thought of. I guess that’s not properly considered interpretability, but it will be closely entwined with interpretability work to test those theories.
Those hurdles for interpretability research exist whether or not someone is using AutoGPT to run the LLM. My question is why you think the interpretability research done so far is less useful, because people are prompting the LLM to act agentically directly instead of {some other thing}.
The interpretability research done so far is still important, and we’ll still need more and better of the same, for the reason you point out. The natural language outputs aren’t a totally trustworthy indicator of the semantics underneath. But they are a big help and a new challenge for interpretability.
Because a lot of interpretability will be about parsing gigantic internal trains of thought in natural language. This will probably demand sophisticated AI tools to aid it. Some of it will still be about decoding the representations in the LLM giving rise to that natural language. And there will be a lot of theory and experiment about what will cause the internal representation to deviate in meaning from the linguistic output. See this insightful comment by Gwern on pressures for LLMs to use steganography to make codes in their output. I suspect there are other pressures toward convoluted encoding and outright deception that I haven’t thought of. I guess that’s not properly considered interpretability, but it will be closely entwined with interpretability work to test those theories.
Those hurdles for interpretability research exist whether or not someone is using AutoGPT to run the LLM. My question is why you think the interpretability research done so far is less useful, because people are prompting the LLM to act agentically directly instead of {some other thing}.
The interpretability research done so far is still important, and we’ll still need more and better of the same, for the reason you point out. The natural language outputs aren’t a totally trustworthy indicator of the semantics underneath. But they are a big help and a new challenge for interpretability.