jacob_cannell comments on How could we know that an AGI system will have good consequences?

jacob_cannell 8 Nov 2022 18:15 UTC
LW: 2 AF: 2
0
AF
This has been discussed before. Your example of not being a verbal thinker is not directly relevant because 1.) inner monologue need not be strictly verbal, 2.) we need only a few examples of strong human thinkers with verbal inner monologues to show that isn’t an efficiency disadvantage—so even if your brain type is less monitorable we are not confined to that design.

I also do not believe your central claim—in that based on my knowledge of neuroscience—disabling the brain modules responsible for your inner monologue will not only disable your capacity for speech, it will also seriously impede your cognition and render you largely incapable of executing complex long term plans.

Starting with a brain-like AGI, there are several obvious low-cost routes to dramatically improve automated cognitive inspectability. A key insight is that there are clear levels of abstraction in the brain (as predicted by the need to compress sensory streams for efficient bayesian prediction) and the inner monologue is at the top of the abstraction hierarchy, which maximizes information utility per bit. At the bottom of the abstraction hierarchy would be something like V1, which would be mostly useless to monitor (minimal value per bit).
- VojtaKovarik 20 Nov 2022 14:44 UTC
  LW: 1 AF: 1
  0
  AF Parent
  To the extent that we are using the existence of humans with inner monologue as evidence for the suitability of this approach to AGI, I think it also serves as evidence for why that approach isn’t really safe.
  
  As an example: If I try, I can control my inner monologue to some extent. For example, I could make my inner monologue about “officially approved actions” very loud. And run a secondary thread of thinking about “non-approved thoughts” in the background—“quieter” and in a more fragmented manner, and killing it off everytime it made it into the inner monologue.
  
  Sure, there will be some traces. (A lot, even, I kinda suck at this :-).) But I think this already brings us outside of the realm of “interpretable by default” and into “a race between your interpretability and my cognitive steganography”.
  
  Note that I, or most humans, didnt have a particular need to develop this exact type of cognitive steganography. (The “elephant in the brain” type of actions-thoughts inconsistency is not exactly this.) As a result, I think we would be able to get much much better very quickly if there was an incentive to do it. (Not talking evolutionary scale. More like “give me a weekend to play with this”.) And while I have no incentive to develop these skills, the same cannot be said for a brain-like AI that we are (trying to) interpret as it thinks.