Roughly speaking, I think that cognitive interpretability approaches are doomed, at least in the modern paradigm, because we’re not building minds but rather training minds, and we have very little grasp of their internal thinking,
A brain-like AGI—modeled after our one working example of efficient general intelligence—would naturally have an interpretable inner monologue we could monitor. There’s good reasons to suspect that DL based general intelligence will end up with something similar simply due to the convergent optimization pressure to communicate complex thought vectors to/from human brains through a low-bitrate channel.
“Well, it never killed all humans in the toy environments we trained it in (at least, not after the first few sandboxed incidents, after which we figured out how to train blatantly adversarial-looking behavior out of it)” doesn’t give me much confidence. If you’re smart enough to design nanotech that can melt all GPUs or whatever (disclaimer: this is a toy example of a pivotal act, and I think better pivotal-act options than this exist) then you’re probably smart enough to figure out when you’re playing for keeps, and all AGIs have an incentive not to kill all “operators” in the toy games once they start to realize they’re in toy games.
Intelligence potential of architecture != intelligence of trained system
The intelligence of a trained system depends on the architectural prior, the training data, and the compute/capacity. Take even an optimally powerful architectural prior—one that would develop into a superintelligence if trained on the internet with reasonable compute—and it would still only be nearly as dumb as a rock if trained solely in atari pong. Somewhere in between the complexity of pong and our reality exists a multi-agent historical sim capable of safely confining a superintelligent architecture and iterating on altruism/alignment safely. So by the time that results in a system that is “smart enough to design nanotech”, it should already be at least as safe as humans. There of course ways that strategy fails, but they don’t fail because ‘smartness’ strictly entails unconfineability—which becomes more clear when you taboo ‘smartness’ and replace it with a slightly more detailed model of intelligence.
A brain-like AGI—modeled after our one working example of efficient general intelligence—would naturally have an interpretable inner monologue we could monitor.
This doesn’t have much to do with whether a mind is understandable. Most of my cognition is not found in the verbal transcript of my inner monologue, partly as I’m not that verbal a thinker, but mostly because most of my cognition is in my nonverbal System 1.
There was some related discussion here, to the effect that we could do something to try to make the AGI as verbal a thinker as possible, IIUC. (I endorse that as plausibly a good idea worth thinking about and trying. I don’t see it as sufficient / airtight.)
Though note that “we could do something to try to make the AGI as verbal a thinker as possible” is a far weaker claim than “A brain-like AGI—modeled after our one working example of efficient general intelligence—would naturally have an interpretable inner monologue we could monitor.”. The corresponding engineering problem is much harder, if we have to do something special to make the AGI think mostly verbally. Also, the existence of verbal-reasoning-heavy humans is not particularly strong evidence that we can make “most” of the load-bearing thought-process verbal; it still seems to me like approximately all of the key “hard parts” of cognition happen on a non-verbal level even in the most verbalization-heavy humans.
The verbal monologue is just near the highest level of compression abstraction in a multi-resolution compressed encoding, but we are not limited to only monitoring at the lowest bitrate (highest level of abstraction/compression).
There is already significant economic pressure on DL system towards being ‘verbal’ thinkers: nearly all largescale image models are now image->text and text<-image models, and the corresponding world_model->text and text<-world_model design is only natural for robotics and AI approaching AGI.
This has been discussed before. Your example of not being a verbal thinker is not directly relevant because 1.) inner monologue need not be strictly verbal, 2.) we need only a few examples of strong human thinkers with verbal inner monologues to show that isn’t an efficiency disadvantage—so even if your brain type is less monitorable we are not confined to that design.
I also do not believe your central claim—in that based on my knowledge of neuroscience—disabling the brain modules responsible for your inner monologue will not only disable your capacity for speech, it will also seriously impede your cognition and render you largely incapable of executing complex long term plans.
Starting with a brain-like AGI, there are several obvious low-cost routes to dramatically improve automated cognitive inspectability. A key insight is that there are clear levels of abstraction in the brain (as predicted by the need to compress sensory streams for efficient bayesian prediction) and the inner monologue is at the top of the abstraction hierarchy, which maximizes information utility per bit. At the bottom of the abstraction hierarchy would be something like V1, which would be mostly useless to monitor (minimal value per bit).
To the extent that we are using the existence of humans with inner monologue as evidence for the suitability of this approach to AGI, I think it also serves as evidence for why that approach isn’t really safe.
As an example: If I try, I can control my inner monologue to some extent. For example, I could make my inner monologue about “officially approved actions” very loud. And run a secondary thread of thinking about “non-approved thoughts” in the background—“quieter” and in a more fragmented manner, and killing it off everytime it made it into the inner monologue.
Sure, there will be some traces. (A lot, even, I kinda suck at this :-).) But I think this already brings us outside of the realm of “interpretable by default” and into “a race between your interpretability and my cognitive steganography”.
Note that I, or most humans, didnt have a particular need to develop this exact type of cognitive steganography. (The “elephant in the brain” type of actions-thoughts inconsistency is not exactly this.) As a result, I think we would be able to get much much better very quickly if there was an incentive to do it. (Not talking evolutionary scale. More like “give me a weekend to play with this”.)
And while I have no incentive to develop these skills, the same cannot be said for a brain-like AI that we are (trying to) interpret as it thinks.
For the latter (toy environments) part, I think we need to distinguish a couple possible types of argument:
Alignment is a property of a training procedure. I.e., the goal is to find a training procedure that will reliably build aligned models, in whatever environment we run it in. We run that training procedure in sandbox environments, and it always builds aligned models. Next, we run that same training procedure (from scratch) in the real world, and we should expect it to likewise build an aligned model.
Alignment is a property of a particular trained model. So we train a model in a sandbox, and verify that it’s aligned (somehow), and then use that very same trained model in the real world.
And also:
A. We’re going to have strong theoretical reasons to expect alignment, and we’re going to use sandbox testing to validate those theories.
B. We’re going to have an unprincipled approach that might or might not create aligned models, and we’re going to use sandbox testing to explore / tweak specific trained models and/or explore / tweak the training approach.
I think Nate is talking about 2 & B, and you’re talking about 1 & (not sure about A vs B).
I think that 2 is fraught because “aligned” has a different meaning in a sandbox versus the real world. In the sandbox, an “aligned” model would be trying to help / empower / whatever the sandbox inhabitants, and in the real world, an “aligned” model would be trying to help / empower / whatever “humanity”.
I think that 1 is potentially fraught too, at least in the absence of A, in that it’s conceivable that we’d find a training procedure that will reliably build aligned models when run in sandboxes while reliably building misaligned models when run in the real world.
I think A is definitely what I’m hoping for, and that 1 & A would make me very optimistic. I think B is pretty fraught, again for the same reason as above—no sandbox is exactly the same as the real world, with “the likely absence of real live humans in the sandbox” being a particularly important aspect of that.
I think there’s a continuum between B and A, and the more we can move from B towards A, the better I feel.
And I think my own time is better spent on trying to move from B towards A, compared to thinking through how to make the most realistic sandboxes possible. But I’m happy for you and anyone else to be doing the latter. And I’m also strongly in favor of people building tools and culture to make it more likely that future AGI programmers will actually do sandbox testing—I have advocated for one aspect of that here.
I mostly agree with these distinctions, but I would add a third and layer them:
Aligned architecture: An architectural prior that results in aligned agents after training in most complex environments of relevance.
Aligned training procedure: The combination of an architectural prior and early training environment design (not a specific environment, but a set of constraints/properties) which results in agents which are then aligned in a wide variety of deployment environments—including those substantially different than the early training environment (out of dist robustness).
Aligned agent—the specific outcome of 1 or 2 which is known to be aligned (at least currently)
Current foundation models also vaguely follow this flow in the sense that the company/org controls the initial training environment and may enforce ‘alignment’ there before deployment (at which point other actors may be allowed to continue training through transfer or fine-tuning)
Humans are only 2 aligned rather than 1 aligned, and I think there are some good reasons for this. Given only simple circuits to guide/align the learning process, I think the best that can be done is to actually exploit distribution shift and use a specialized early training environment that is carefully crafted to provide positive examples that can be classified robustly—these then form the basis of a robust learned distance based classifier that leverages the generative model (I have some more on this probably should put into a short post—but that’s basically the proxy matching idea).
So ideally we’d want 1, but we probably will need to settle for 2. Once we have designs (arch + early training) that work well in sandbox, we’d probably still want to raise them initially in a special early training env that has the right properties, then gradually introduce to the real world and deploy (resulting in 3).
As for A vs B: ideally you may want A but you settle mostly for B. That’s just how the world often works, how DL progressed, etc. We now have more established theory of how DL works as approx bayesian inference, but what actually drove most progress was B style tinkering.
I think A is definitely what I’m hoping for, and that 1 & A would make me very optimistic. I think B is pretty fraught, again for the same reason as above—no sandbox is exactly the same as the real world, with “the likely absence of real live humans in the sandbox” being a particularly important aspect of that.
Well that isn’t quite right—when the AGI advances to human-level there are human-equivalents in the sandbox.
Also sandboxes shouldn’t be exactly the same as the real world! The point is to create a distribution of environments that is wider than the real world, to thereby cover it and especially the future.
A brain-like AGI—modeled after our one working example of efficient general intelligence—would naturally have an interpretable inner monologue we could monitor. There’s good reasons to suspect that DL based general intelligence will end up with something similar simply due to the convergent optimization pressure to communicate complex thought vectors to/from human brains through a low-bitrate channel.
Intelligence potential of architecture != intelligence of trained system
The intelligence of a trained system depends on the architectural prior, the training data, and the compute/capacity. Take even an optimally powerful architectural prior—one that would develop into a superintelligence if trained on the internet with reasonable compute—and it would still only be nearly as dumb as a rock if trained solely in atari pong. Somewhere in between the complexity of pong and our reality exists a multi-agent historical sim capable of safely confining a superintelligent architecture and iterating on altruism/alignment safely. So by the time that results in a system that is “smart enough to design nanotech”, it should already be at least as safe as humans. There of course ways that strategy fails, but they don’t fail because ‘smartness’ strictly entails unconfineability—which becomes more clear when you taboo ‘smartness’ and replace it with a slightly more detailed model of intelligence.
This doesn’t have much to do with whether a mind is understandable. Most of my cognition is not found in the verbal transcript of my inner monologue, partly as I’m not that verbal a thinker, but mostly because most of my cognition is in my nonverbal System 1.
There was some related discussion here, to the effect that we could do something to try to make the AGI as verbal a thinker as possible, IIUC. (I endorse that as plausibly a good idea worth thinking about and trying. I don’t see it as sufficient / airtight.)
Though note that “we could do something to try to make the AGI as verbal a thinker as possible” is a far weaker claim than “A brain-like AGI—modeled after our one working example of efficient general intelligence—would naturally have an interpretable inner monologue we could monitor.”. The corresponding engineering problem is much harder, if we have to do something special to make the AGI think mostly verbally. Also, the existence of verbal-reasoning-heavy humans is not particularly strong evidence that we can make “most” of the load-bearing thought-process verbal; it still seems to me like approximately all of the key “hard parts” of cognition happen on a non-verbal level even in the most verbalization-heavy humans.
The verbal monologue is just near the highest level of compression abstraction in a multi-resolution compressed encoding, but we are not limited to only monitoring at the lowest bitrate (highest level of abstraction/compression).
There is already significant economic pressure on DL system towards being ‘verbal’ thinkers: nearly all largescale image models are now image->text and text<-image models, and the corresponding world_model->text and text<-world_model design is only natural for robotics and AI approaching AGI.
This has been discussed before. Your example of not being a verbal thinker is not directly relevant because 1.) inner monologue need not be strictly verbal, 2.) we need only a few examples of strong human thinkers with verbal inner monologues to show that isn’t an efficiency disadvantage—so even if your brain type is less monitorable we are not confined to that design.
I also do not believe your central claim—in that based on my knowledge of neuroscience—disabling the brain modules responsible for your inner monologue will not only disable your capacity for speech, it will also seriously impede your cognition and render you largely incapable of executing complex long term plans.
Starting with a brain-like AGI, there are several obvious low-cost routes to dramatically improve automated cognitive inspectability. A key insight is that there are clear levels of abstraction in the brain (as predicted by the need to compress sensory streams for efficient bayesian prediction) and the inner monologue is at the top of the abstraction hierarchy, which maximizes information utility per bit. At the bottom of the abstraction hierarchy would be something like V1, which would be mostly useless to monitor (minimal value per bit).
To the extent that we are using the existence of humans with inner monologue as evidence for the suitability of this approach to AGI, I think it also serves as evidence for why that approach isn’t really safe.
As an example: If I try, I can control my inner monologue to some extent. For example, I could make my inner monologue about “officially approved actions” very loud. And run a secondary thread of thinking about “non-approved thoughts” in the background—“quieter” and in a more fragmented manner, and killing it off everytime it made it into the inner monologue.
Sure, there will be some traces. (A lot, even, I kinda suck at this :-).) But I think this already brings us outside of the realm of “interpretable by default” and into “a race between your interpretability and my cognitive steganography”.
Note that I, or most humans, didnt have a particular need to develop this exact type of cognitive steganography. (The “elephant in the brain” type of actions-thoughts inconsistency is not exactly this.) As a result, I think we would be able to get much much better very quickly if there was an incentive to do it. (Not talking evolutionary scale. More like “give me a weekend to play with this”.) And while I have no incentive to develop these skills, the same cannot be said for a brain-like AI that we are (trying to) interpret as it thinks.
For the latter (toy environments) part, I think we need to distinguish a couple possible types of argument:
Alignment is a property of a training procedure. I.e., the goal is to find a training procedure that will reliably build aligned models, in whatever environment we run it in. We run that training procedure in sandbox environments, and it always builds aligned models. Next, we run that same training procedure (from scratch) in the real world, and we should expect it to likewise build an aligned model.
Alignment is a property of a particular trained model. So we train a model in a sandbox, and verify that it’s aligned (somehow), and then use that very same trained model in the real world.
And also:
A. We’re going to have strong theoretical reasons to expect alignment, and we’re going to use sandbox testing to validate those theories.
B. We’re going to have an unprincipled approach that might or might not create aligned models, and we’re going to use sandbox testing to explore / tweak specific trained models and/or explore / tweak the training approach.
I think Nate is talking about 2 & B, and you’re talking about 1 & (not sure about A vs B).
I think that 2 is fraught because “aligned” has a different meaning in a sandbox versus the real world. In the sandbox, an “aligned” model would be trying to help / empower / whatever the sandbox inhabitants, and in the real world, an “aligned” model would be trying to help / empower / whatever “humanity”.
I think that 1 is potentially fraught too, at least in the absence of A, in that it’s conceivable that we’d find a training procedure that will reliably build aligned models when run in sandboxes while reliably building misaligned models when run in the real world.
I think A is definitely what I’m hoping for, and that 1 & A would make me very optimistic. I think B is pretty fraught, again for the same reason as above—no sandbox is exactly the same as the real world, with “the likely absence of real live humans in the sandbox” being a particularly important aspect of that.
I think there’s a continuum between B and A, and the more we can move from B towards A, the better I feel.
And I think my own time is better spent on trying to move from B towards A, compared to thinking through how to make the most realistic sandboxes possible. But I’m happy for you and anyone else to be doing the latter. And I’m also strongly in favor of people building tools and culture to make it more likely that future AGI programmers will actually do sandbox testing—I have advocated for one aspect of that here.
I mostly agree with these distinctions, but I would add a third and layer them:
Aligned architecture: An architectural prior that results in aligned agents after training in most complex environments of relevance.
Aligned training procedure: The combination of an architectural prior and early training environment design (not a specific environment, but a set of constraints/properties) which results in agents which are then aligned in a wide variety of deployment environments—including those substantially different than the early training environment (out of dist robustness).
Aligned agent—the specific outcome of 1 or 2 which is known to be aligned (at least currently)
Current foundation models also vaguely follow this flow in the sense that the company/org controls the initial training environment and may enforce ‘alignment’ there before deployment (at which point other actors may be allowed to continue training through transfer or fine-tuning)
Humans are only 2 aligned rather than 1 aligned, and I think there are some good reasons for this. Given only simple circuits to guide/align the learning process, I think the best that can be done is to actually exploit distribution shift and use a specialized early training environment that is carefully crafted to provide positive examples that can be classified robustly—these then form the basis of a robust learned distance based classifier that leverages the generative model (I have some more on this probably should put into a short post—but that’s basically the proxy matching idea).
So ideally we’d want 1, but we probably will need to settle for 2. Once we have designs (arch + early training) that work well in sandbox, we’d probably still want to raise them initially in a special early training env that has the right properties, then gradually introduce to the real world and deploy (resulting in 3).
As for A vs B: ideally you may want A but you settle mostly for B. That’s just how the world often works, how DL progressed, etc. We now have more established theory of how DL works as approx bayesian inference, but what actually drove most progress was B style tinkering.
Well that isn’t quite right—when the AGI advances to human-level there are human-equivalents in the sandbox.
Also sandboxes shouldn’t be exactly the same as the real world! The point is to create a distribution of environments that is wider than the real world, to thereby cover it and especially the future.