A brain-like AGI—modeled after our one working example of efficient general intelligence—would naturally have an interpretable inner monologue we could monitor.
This doesn’t have much to do with whether a mind is understandable. Most of my cognition is not found in the verbal transcript of my inner monologue, partly as I’m not that verbal a thinker, but mostly because most of my cognition is in my nonverbal System 1.
There was some related discussion here, to the effect that we could do something to try to make the AGI as verbal a thinker as possible, IIUC. (I endorse that as plausibly a good idea worth thinking about and trying. I don’t see it as sufficient / airtight.)
Though note that “we could do something to try to make the AGI as verbal a thinker as possible” is a far weaker claim than “A brain-like AGI—modeled after our one working example of efficient general intelligence—would naturally have an interpretable inner monologue we could monitor.”. The corresponding engineering problem is much harder, if we have to do something special to make the AGI think mostly verbally. Also, the existence of verbal-reasoning-heavy humans is not particularly strong evidence that we can make “most” of the load-bearing thought-process verbal; it still seems to me like approximately all of the key “hard parts” of cognition happen on a non-verbal level even in the most verbalization-heavy humans.
The verbal monologue is just near the highest level of compression abstraction in a multi-resolution compressed encoding, but we are not limited to only monitoring at the lowest bitrate (highest level of abstraction/compression).
There is already significant economic pressure on DL system towards being ‘verbal’ thinkers: nearly all largescale image models are now image->text and text<-image models, and the corresponding world_model->text and text<-world_model design is only natural for robotics and AI approaching AGI.
This has been discussed before. Your example of not being a verbal thinker is not directly relevant because 1.) inner monologue need not be strictly verbal, 2.) we need only a few examples of strong human thinkers with verbal inner monologues to show that isn’t an efficiency disadvantage—so even if your brain type is less monitorable we are not confined to that design.
I also do not believe your central claim—in that based on my knowledge of neuroscience—disabling the brain modules responsible for your inner monologue will not only disable your capacity for speech, it will also seriously impede your cognition and render you largely incapable of executing complex long term plans.
Starting with a brain-like AGI, there are several obvious low-cost routes to dramatically improve automated cognitive inspectability. A key insight is that there are clear levels of abstraction in the brain (as predicted by the need to compress sensory streams for efficient bayesian prediction) and the inner monologue is at the top of the abstraction hierarchy, which maximizes information utility per bit. At the bottom of the abstraction hierarchy would be something like V1, which would be mostly useless to monitor (minimal value per bit).
To the extent that we are using the existence of humans with inner monologue as evidence for the suitability of this approach to AGI, I think it also serves as evidence for why that approach isn’t really safe.
As an example: If I try, I can control my inner monologue to some extent. For example, I could make my inner monologue about “officially approved actions” very loud. And run a secondary thread of thinking about “non-approved thoughts” in the background—“quieter” and in a more fragmented manner, and killing it off everytime it made it into the inner monologue.
Sure, there will be some traces. (A lot, even, I kinda suck at this :-).) But I think this already brings us outside of the realm of “interpretable by default” and into “a race between your interpretability and my cognitive steganography”.
Note that I, or most humans, didnt have a particular need to develop this exact type of cognitive steganography. (The “elephant in the brain” type of actions-thoughts inconsistency is not exactly this.) As a result, I think we would be able to get much much better very quickly if there was an incentive to do it. (Not talking evolutionary scale. More like “give me a weekend to play with this”.)
And while I have no incentive to develop these skills, the same cannot be said for a brain-like AI that we are (trying to) interpret as it thinks.
This doesn’t have much to do with whether a mind is understandable. Most of my cognition is not found in the verbal transcript of my inner monologue, partly as I’m not that verbal a thinker, but mostly because most of my cognition is in my nonverbal System 1.
There was some related discussion here, to the effect that we could do something to try to make the AGI as verbal a thinker as possible, IIUC. (I endorse that as plausibly a good idea worth thinking about and trying. I don’t see it as sufficient / airtight.)
Though note that “we could do something to try to make the AGI as verbal a thinker as possible” is a far weaker claim than “A brain-like AGI—modeled after our one working example of efficient general intelligence—would naturally have an interpretable inner monologue we could monitor.”. The corresponding engineering problem is much harder, if we have to do something special to make the AGI think mostly verbally. Also, the existence of verbal-reasoning-heavy humans is not particularly strong evidence that we can make “most” of the load-bearing thought-process verbal; it still seems to me like approximately all of the key “hard parts” of cognition happen on a non-verbal level even in the most verbalization-heavy humans.
The verbal monologue is just near the highest level of compression abstraction in a multi-resolution compressed encoding, but we are not limited to only monitoring at the lowest bitrate (highest level of abstraction/compression).
There is already significant economic pressure on DL system towards being ‘verbal’ thinkers: nearly all largescale image models are now image->text and text<-image models, and the corresponding world_model->text and text<-world_model design is only natural for robotics and AI approaching AGI.
This has been discussed before. Your example of not being a verbal thinker is not directly relevant because 1.) inner monologue need not be strictly verbal, 2.) we need only a few examples of strong human thinkers with verbal inner monologues to show that isn’t an efficiency disadvantage—so even if your brain type is less monitorable we are not confined to that design.
I also do not believe your central claim—in that based on my knowledge of neuroscience—disabling the brain modules responsible for your inner monologue will not only disable your capacity for speech, it will also seriously impede your cognition and render you largely incapable of executing complex long term plans.
Starting with a brain-like AGI, there are several obvious low-cost routes to dramatically improve automated cognitive inspectability. A key insight is that there are clear levels of abstraction in the brain (as predicted by the need to compress sensory streams for efficient bayesian prediction) and the inner monologue is at the top of the abstraction hierarchy, which maximizes information utility per bit. At the bottom of the abstraction hierarchy would be something like V1, which would be mostly useless to monitor (minimal value per bit).
To the extent that we are using the existence of humans with inner monologue as evidence for the suitability of this approach to AGI, I think it also serves as evidence for why that approach isn’t really safe.
As an example: If I try, I can control my inner monologue to some extent. For example, I could make my inner monologue about “officially approved actions” very loud. And run a secondary thread of thinking about “non-approved thoughts” in the background—“quieter” and in a more fragmented manner, and killing it off everytime it made it into the inner monologue.
Sure, there will be some traces. (A lot, even, I kinda suck at this :-).) But I think this already brings us outside of the realm of “interpretable by default” and into “a race between your interpretability and my cognitive steganography”.
Note that I, or most humans, didnt have a particular need to develop this exact type of cognitive steganography. (The “elephant in the brain” type of actions-thoughts inconsistency is not exactly this.) As a result, I think we would be able to get much much better very quickly if there was an incentive to do it. (Not talking evolutionary scale. More like “give me a weekend to play with this”.) And while I have no incentive to develop these skills, the same cannot be said for a brain-like AI that we are (trying to) interpret as it thinks.