The same issue applies to chain-of-thought as an interpretability approach: words only trace real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of the AGI’s thoughts are not exposed for direct inspection and can’t be compactly expressed in natural language without abandoning the customary semantics.
I think when most people suggest chain-of-thought tactics, they are imagining an interpretability tool that summarizes an AGI’s considerations in the form of words, AS IF it were a person thinking to themselves, in a way that’s supposed to increase an interpreter’s understanding of what the AGI is about to do, even if the AGI is not literally thinking in terms of words or humans don’t think entirely in words or it’s not a complete mathematical description of its behavior. Your concerns are correct but go way too far in implying an AI could not be DESIGNED to produce such a stream-of-thought which would have >0 value in managing some smarter-than-human AIs. The AI box test seems like it would be remarkably easier to pass if I had an analogous tool to run on Eliezer.
Obviously it’s not an inherent feature for intelligent agents.
Most of the proposals I’ve heard do actually involve getting AI to think in terms of words as its primary internal data structure. But that’s not actually a crux for me. The more important part is this:
Your concerns are correct but go way too far in implying an AI could not be DESIGNED to produce such a stream-of-thought which would have >0 value in managing some smarter-than-human AIs.
>0 value, taken in isolation, is simply not a worthwhile goal to pursue in alignment research. Tons of things provide >0 value in isolation, but do not at all address any of the core subproblems or generalize beyond a specific architecture, and therefore will not cumulatively stack with other work and probably will not even apply to whatever architecture actually ends up being key to use. Epsilons don’t matter unless they stack.
I think when most people suggest chain-of-thought tactics, they are imagining an interpretability tool that summarizes an AGI’s considerations in the form of words, AS IF it were a person thinking to themselves, in a way that’s supposed to increase an interpreter’s understanding of what the AGI is about to do, even if the AGI is not literally thinking in terms of words or humans don’t think entirely in words or it’s not a complete mathematical description of its behavior. Your concerns are correct but go way too far in implying an AI could not be DESIGNED to produce such a stream-of-thought which would have >0 value in managing some smarter-than-human AIs. The AI box test seems like it would be remarkably easier to pass if I had an analogous tool to run on Eliezer.
Obviously it’s not an inherent feature for intelligent agents.
Most of the proposals I’ve heard do actually involve getting AI to think in terms of words as its primary internal data structure. But that’s not actually a crux for me. The more important part is this:
>0 value, taken in isolation, is simply not a worthwhile goal to pursue in alignment research. Tons of things provide >0 value in isolation, but do not at all address any of the core subproblems or generalize beyond a specific architecture, and therefore will not cumulatively stack with other work and probably will not even apply to whatever architecture actually ends up being key to use. Epsilons don’t matter unless they stack.
Ye fair