Use that understanding and those interpretability tools to instill a target value (e.g., corrigibility, niceness, or libertarianism) in a powerful language model.
… why specialize to language models? That choice seems basically orthogonal to the core ideas of shard theory. Insofar as Team Shard is trying to develop and leverage shard theory, seems like you should aim for a more generally-useful shard based approach, rather than specialize to language models.
… and develop powerful chain-of-thought interpretability tools to examine those trained values in action.
I’m extremely doubtful that chain-of-thought can produce sufficiently powerful interpretability tools at all, even in principle, for reasons similar to Eliezer’s complaint in List of Lethalities #32:
32. Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can’t be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.
The same issue applies to chain-of-thought as an interpretability approach: words only trace real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of the AGI’s thoughts are not exposed for direct inspection and can’t be compactly expressed in natural language without abandoning the customary semantics.
I’m about 80% sold on that. In principle, I can imagine sufficiently good interpretability plus empirical observation could substitute for a mechanistic theory (assuming we train slowly and pause a lot to see what’s going on inside), but the theoretical progress required for those interpretability tools would overlap pretty damn heavily with the theoretical progress required for value formation theory anyway.
We bet that there are a bunch of quantitative relationships here just waiting to be discovered—that there’s a lot of systematic structure in what learned values form given which training variables. To ever get to these quantitative relationships, we’ll need to muck around with language model fine-tuning under different conditions a lot.
That seems like a weird starting point. It should be much easier to understand the relevant pressures in toy models, rather than jumping straight into a full-scale language model where you don’t have any idea what precise things you’re looking for (e.g. how values or abstractions are represented internally). I’m all for empirical feedback loops, but you need some idea of what to look at in order for that feedback loop to actually tell you anything useful.
Once we have this running with smarter language models, though, we’ll be able to observe what environments and training variables induce what off-distribution behaviors in the models.
That doesn’t really seem to address the hard parts of “a mechanistic theory of algorithm formation in trained intelligences”. Behavior just isn’t that informative, you need to look at the insides and figure out how stuff is represented.
Furthermore, once we have chain-of-thought interpretability tools, we’ll be able to look at these learned values as they run and train using that interpretability power.
??? How on earth is a bunch of chain-of-thought text going to allow you to “look at learned values as they run”? However values are represented internally (assuming language models even have value-like internal structures, which is itself dubious at this point), they sure as hell aren’t represented in natural language.
This will involve some amount of tuning the models to be interpretable in the first place, by first getting the models externally monologuing about their decision-making and then ensuring that the decisions outputted by the model are causally downstream of the legible monologue.
None of that actually makes the values interpretable/legible, it just produces a model which spits out a bunch of English kinda related to some values maybe.
It should be much easier to understand the relevant pressures in toy models, rather than jumping straight into a full-scale language model where you don’t have any idea what precise things you’re looking for (e.g. how values or abstractions are represented internally). I’m all for empirical feedback loops, but you need some idea of what to look at in order for that feedback loop to actually tell you anything useful.
We’re currently working with language-model-text-adventures for three overall reasons:
There’s a nasty capabilities confound in working with most RL models. We found that MineRL agents are likely too stupid to be appreciably capable in Minecraft, and so it won’t be clear whether any given action of theirs occurred because of some value behind it or because the model was making an egregious mistake. It wouldn’t be especially clear if any values at all were forming in a model, if the model just crafts the crafting bench … and then spazzes around helplessly.
And if we scale down to very small RL settings, where the model is competent to its toy world, we just don’t expect the degenerate case of learned value formation we’d see there to meaningfully generalize to larger models.
We’re working with language models rather than CoinRun RL agents (which wouldn’t suffer from a capabilities confound) because we’ll get more bits of data out of the language models! If you force the model to rely on a monologue scratchpad to do any of its inter-forward-pass computation, then the model will have to air much of its computation to us in that scratchpad.
…we’re still going to hit all of the latent computation in the model with all the interpretability tools we can, though—one example being causal tracing of the value-laden outputs of the model.
We tentatively expect that i.i.d. training doesn’t select for the most dangerous coherence-properties as strongly as action-dependent training setups do. So we’re especially interested in looking at docile shard formation in language models. Very plausibly, this will be how the AGI is trained, and so we want to know as much as possible about this special case of learned value formation.
Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can’t be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.
Words only trace real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of the AGI’s thoughts are not exposed for direct inspection and can’t be compactly expressed in natural language without abandoning the customary semantics.
This is a serious worry! We’re hoping to get the model to do as much of its computation visibly, in words, as possible. There’s going to be a lot of additional, significant latent computation going on, and we’ll have to interpret that with other interpretability tools. The tighter the serial-computation limits placed on a transformer forward-pass, by running anti-steganography tools and limiting serial depth, the less slack the model has to form a mesa-optimizer inside a single forward pass that can then deceive us.
This is all for naught if we don’t learn something about genuine consequentialists by studying modern language models in this way. But both kinds of systems share an underlying architecture, i.e., a massive neural network, and its plausible that what we find holds about value formation in the most powerful ML systems today will generalize to tomorrow’s ML.
The same issue applies to chain-of-thought as an interpretability approach: words only trace real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of the AGI’s thoughts are not exposed for direct inspection and can’t be compactly expressed in natural language without abandoning the customary semantics.
I think when most people suggest chain-of-thought tactics, they are imagining an interpretability tool that summarizes an AGI’s considerations in the form of words, AS IF it were a person thinking to themselves, in a way that’s supposed to increase an interpreter’s understanding of what the AGI is about to do, even if the AGI is not literally thinking in terms of words or humans don’t think entirely in words or it’s not a complete mathematical description of its behavior. Your concerns are correct but go way too far in implying an AI could not be DESIGNED to produce such a stream-of-thought which would have >0 value in managing some smarter-than-human AIs. The AI box test seems like it would be remarkably easier to pass if I had an analogous tool to run on Eliezer.
Obviously it’s not an inherent feature for intelligent agents.
Most of the proposals I’ve heard do actually involve getting AI to think in terms of words as its primary internal data structure. But that’s not actually a crux for me. The more important part is this:
Your concerns are correct but go way too far in implying an AI could not be DESIGNED to produce such a stream-of-thought which would have >0 value in managing some smarter-than-human AIs.
>0 value, taken in isolation, is simply not a worthwhile goal to pursue in alignment research. Tons of things provide >0 value in isolation, but do not at all address any of the core subproblems or generalize beyond a specific architecture, and therefore will not cumulatively stack with other work and probably will not even apply to whatever architecture actually ends up being key to use. Epsilons don’t matter unless they stack.
The same issue applies to chain-of-thought as an interpret-ability approach: words only trace real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of the AGI’s thoughts are not exposed for direct inspection and can’t be compactly expressed in natural language without abandoning the customary semantics.
In any reasonable sane AGI project the human overseers will have powerful real time debugging/introspection into the AGI minds, and the AGI should grow up in carefully sealed sandbox sims so they aren’t aware of their true nature (to prevent deception etc).
And given that AGI necessitates human language capability that implies that some chunk of it’s brain will be devoted to compressing and summarizing the internal state&computations down to a linguistic text stream (because that is what language actually is). So the AGI will likely also have an inner monologue which will naturally be hooked up to powerful database/search/indexing/monitoring tools. We can also go deeper if required and train secondary models to inspect/summarize in arbitrary detail.
And given that AGI necessitates human language capability that implies that some chunk of it’s brain will be devoted to compressing and summarizing the internal state&computations down to a linguistic text stream (because that is what language actually is). So the AGI will likely also have an inner monologue which will naturally be hooked up to powerful database/search/indexing/monitoring tools.
That… was wayyyyy too big a jump in reasoning. The ability to use human language externally does not at all imply that the bulk of internal state/computations needs to be summarized to a text stream, nor does it imply that language needs to be produced internally on a continuous basis (as opposed to being produced on the fly, as needed).
If I were going to make a case that AGI will have an internal monologue, I’d say “well humans have an internal monologue, so that’s evidence that an internal monologue is convergent”. But even in humans, peoples’ internal visualizations, urges and flinches, instinctive valuations, and tons of other stuff are not continuously transcribed into the internal monologue; the monologue captures only a specific narrow slice of what the mind is doing. It’s not a broad compression of all the internal activity. And humans can totally just turn off their inner monologue, and keep going about their day functioning just fine.
The ability to use human language externally does not at all imply that the bulk of internal state/computations needs to be summarized to a text stream, nor does it imply that language needs to be produced internally on a continuous basis (as opposed to being produced on the fly, as needed).
Summarizable rather than summarized, but this is in fact just how the brain works (or most brains) and certainly is a valid path to AGI. So the key question is not “is inner monologue essential?”—which you seem to be arguing against—but rather “does inner monologue have much any of a performance disadvantage?”. I’d roughly estimate that turning off the language production centers in the brain would save about 10%.
But even in humans, peoples’ internal visualizations, urges and flinches, instinctive valuations, and tons of other stuff are not continuously transcribed into the internal monologue;
Text/speech is an attention guided summarization/compression, so naturally it doesn’t capture everything—but that’s the point!. It is the highest level abstract summary, and I mentioned we can additionally train aux models to zoom in on any timeslice of thought, to the extent that turns out to be actually useful relative to the compute/storage/bandwidth costs.
And humans can totally just turn off their inner monologue, and keep going about their day functioning just fine.
What? Evidence for this? I can think of only meditation, but people aren’t doing any important intellectual work while meditating. Regardless even if some humans can turn off their inner monologue that would only be relevant if that ability was necessary for higher intelligence (which it near certainly is not).
I turn off my internal monologue intentionally sometimes, it’s not particularly difficult (try it!). Also it does seem to improve performance in some ways—turns out the transcription to language is slow, my thoughts zoom ahead faster when they’re not waiting around for the monologue to catch up. IIRC people in flow-state usually don’t have a monologue going, which would be additional evidence of the performance cost in humans assuming I’m remembering correctly.
Summarizable rather than summarized, but this is in fact just how the brain works (or most brains) and certainly is a valid path to AGI.
Empirically, people seem to have an enormous amount of difficulty expressing many parts of their internal cognition verbally. So no, a lot of it doesn’t even seem to be summarizable, at least given a typical human’s facility with natural language.
Note that inner monologue can be non-verbal—mine often contains abstract pictorial/visual components. For deaf/mute people it likely develops visually.
Empirically, people seem to have an enormous amount of difficulty expressing many parts of their internal cognition verbally. So no, a lot of it doesn’t even seem to be summarizable, at least given a typical human’s facility with natural language.
I guess I may be unusual in this regard then. But naturally the type of brain-like AGI I tend to think of is likely biased towards my own particular cognition and self-understanding thereof—and perhaps this is also true for you.
Regardless: even if 20% of humanity had no inner monologue or 50%, or 99%, it simply wouldn’t matter at all to my larger point: I need only a few examples of high functioning intelligent humans with inner monologues to demonstrate that having an inner monologue is clearly not an efficiency disadvantage! This is a low cost safety feature.
Finally, I’l conclude with that Hellen Keller quote:
“Before my teacher came to me, I did not know that I am. I lived in a world that was a no-world. I cannot hope to describe adequately that unconscious, yet conscious time of nothingness. I did not know that I knew aught, or that I lived or acted or desired. I had neither will nor intellect. I was carried along to objects and acts by a certain blind natural impetus. I had a mind which caused me to feel anger, satisfaction, desire… ”
I need only a few examples of high functioning intelligent humans with inner monologues to demonstrate that having an inner monologue is clearly not an efficiency disadvantage!
I would like to point out that what johnswentworth said about being able to turn off an internal monologue is completely true for me as well. My internal monologue turns itself on and off several (possibly many) times a day when I don’t control it, and it is also quite easy to tell it which way to go on that. I don’t seem to be particularly more or less capable with it on or off, except on a very limited number of tasks. Simple tasks are easier without it, while explicit reasoning and storytelling are easier with it. I think my default is off when I’m not worried (but I do an awful lot of intentional verbal daydreaming and reasoning about how I’m thinking too.).
If DeepMind raises their AGI in a carefully sealed sandbox sim, where it doesn’t know that it’s in a sandbox or that humans exist, there’s still a problem that next year Meta or whatever will raise an AGI without a sandbox. Have you thought about what to do about that problem? Is there some significant task that DeepMind can do with its sandboxed AGI (which doesn’t know about humans and the real world) that prevents Meta from making the unsandboxed AGI afterwards, or that prevents this latter AGI from causing harm? If so, what? Or are you imagining that DeepMind eventually lets the AGI out of the sandbox? If so, how does the sandbox help? Sorry if you already answered this in a blog post a decade ago, I haven’t read those in a while. :)
Building something of significance requires design/develop/test/evaluate iterations. A sandbox sim is just what that looks like for safe AGI. Simboxing is not the design for aligned AGI itself, rather it’s the process/setup that allows teams to build/test/evaluate without risking the world.
The tech infra for sandbox sims is similar to advance game tech anyway and can be shared. Numerous AGI teams can compete safely on alignment benchmarks in simboxes while closely guarding their design secrets/code, running their AGI securely, etc.
Imagine an alternate earth like world that was so fully inhabited that the only two options to test nuclear weapons were 1.) blowing up inhabited cities or perhaps the entire world, or 2.) in big computer sims. Only one of these options is sane.
So with that background in mind—if DM develops aligned AGI first, then hopefully Meta no longer matters, as DM’s capability gap will quickly expand.
The hard part is not training AGI once you have the seed (as we know that only requires about 1e23 sparse flops and probably less, which is roughly equivalent to perhaps 1e26 dense flops—so only 3 OOM beyond GPT3). And a single training run gives you unlimited potential clones of the same AGI mind.
The hard part is finding the right seed, which will take some number of experimental cycles training (hopefully small) AGI populations. Carmack’s recent estimate that the seed will be only a few ten thousand lines of code is a bit lower than my estimate of 100k lines of code, but not by much.
To clarify—by ‘seed’ I mean all the arch design and model code. So take everything you need to easily reproduce GPT3/GATO whatever, including all the underlying cuda code (mostly written by nvidia) - but not the trained weights. Anyone who has that and a copy of the text internet on a few SSDs can then train their own GPT3 for only a few million $. AGI will be no different—most of the expense is on all the R&D buildup, not an individual final training run. (Compare the total R&D expenditure of DeepMind ~ $500M/yr, vs the few million for any single training run cost).
I’m very strongly in favor of pre-deployment sandbox tests.
I don’t think that sandbox tests are sufficient for safety—I remain very concerned about a scenario where we run the “seed” code in a sandbox environment, and we wind up with an AGI that behaves well, but then we run the same “seed” code with access to the real world, and we wind up with an AGI that behaves dangerously.
Simboxes are not sufficient for safety in the same sense that the Imagenet ‘simbox’ (data and evaluation code) was not sufficient for human level visual object classification or the atari simulator setup was not sufficient for human-level atari performance—you still need the design which actually does the right thing.
The problem you specifically refer to is the OOD problem which in this case basically amounts to creating a range of sim scenarios that sufficiently cover the near future multiverse. In some sense this is actually easier than many hard science sims because intelligence and alignment are so universal they transcend the physics details (intelligence and alignment are universal across an enormous swath of possible universes even with very different physics). That is very very different than say nuclear weapons sims where nailing the physics precisely can matter a great deal.
The seed codes for AGI built on simple universal principles such as self-supervised learning, learning to recognize agency in the world, maximization of other agent’s empowerment, etc.
The tests then should cover a large variety of scenarios ranging the gamut from simple tests of empathy and altruism all the way up to larger scale complete simulations of future AGI takeover: worlds where numerous agents compete and cooperate to survive in a harsh world, eventually one/few gains a decisive strategic advantage (absolute power), and faces some ultimate altruism decision like whether to sacrifice themselves in exchange for the resurrection and absolute empowerment of all the other agents. For various reasons magic is the most likely useful safe proxy for technology.
I have a half-written longer post on this, but if you are having trouble imagining these range of tests think of all of the output of the future narrow-AI empowered film/game industries and all of that creativity unleashed on this problem.
Do you put actual humans into the simbox? If no, then isn’t that a pretty big OOD problem? Or if yes, how do you do that safely?
I think I’m skeptical that “learning to recognize agency in the world” and “maximization of other agents’ empowerment” actually exist in the form of “simple universal principles”. For example, when I see a simple animatronic robot, it gives me a visceral impression of agency, but it’s a false impression. Well, hmm, I guess that depends on your definitions. Well anyway, I’ll just say that if an AGI were maximizing the “empowerment” of any simple animatronic robots that it sees, I would declare that this AGI was doing the wrong thing.
It’s fine if you want to just finish your longer post instead of replying here. Either way, looking forward to that! :)
Humans in the simbox—perhaps in the early stages, but not required once it’s running (although human observers have a later role). But that’s mostly tangential.
One of the key ideas here—and perhaps divergent vs many other approaches—is that we want agents to robustly learn and optimize for other agents values: across a wide variety of agents, situations, and agent value distributions. The idea is to handle OOD by generalizing beyond specific human values. Then once we perfect these architectures and training regimes and are satisfied with their alignment evaluations we can deploy them safely in the real world where they will learn and optimize for our values (safe relative to deploying new humans).
I do have a rough sketch of the essence of the mechanism I think the brain is using for value learning and altruism, and I actually found one of your articles to link to that is related.
I suspect you’d agree that self-supervised prediction is a simple, powerful, and universal learning idea—strongly theoretically justified as in Solomonoff/Bayes and AIXI, etc, and clearly also a key brain mechanism. Generalized empowerment or self-improvement is similar—strongly theoretically justified, and also clearly a key brain mechanism. The former guides learning of the predictive world model, the latter guides learning of the action/planning system. Both are also optimal in a certain sense.
Human’s tendency to anthropomorphize, empathize with, and act altruistically towards various animals and even hypothetical non-humans is best explained as a side effect of a very general (arguably overly general!) alignment mechanism.
Yay, a proposal to rant about!
… why specialize to language models? That choice seems basically orthogonal to the core ideas of shard theory. Insofar as Team Shard is trying to develop and leverage shard theory, seems like you should aim for a more generally-useful shard based approach, rather than specialize to language models.
I’m extremely doubtful that chain-of-thought can produce sufficiently powerful interpretability tools at all, even in principle, for reasons similar to Eliezer’s complaint in List of Lethalities #32:
The same issue applies to chain-of-thought as an interpretability approach: words only trace real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of the AGI’s thoughts are not exposed for direct inspection and can’t be compactly expressed in natural language without abandoning the customary semantics.
I’m about 80% sold on that. In principle, I can imagine sufficiently good interpretability plus empirical observation could substitute for a mechanistic theory (assuming we train slowly and pause a lot to see what’s going on inside), but the theoretical progress required for those interpretability tools would overlap pretty damn heavily with the theoretical progress required for value formation theory anyway.
That seems like a weird starting point. It should be much easier to understand the relevant pressures in toy models, rather than jumping straight into a full-scale language model where you don’t have any idea what precise things you’re looking for (e.g. how values or abstractions are represented internally). I’m all for empirical feedback loops, but you need some idea of what to look at in order for that feedback loop to actually tell you anything useful.
That doesn’t really seem to address the hard parts of “a mechanistic theory of algorithm formation in trained intelligences”. Behavior just isn’t that informative, you need to look at the insides and figure out how stuff is represented.
??? How on earth is a bunch of chain-of-thought text going to allow you to “look at learned values as they run”? However values are represented internally (assuming language models even have value-like internal structures, which is itself dubious at this point), they sure as hell aren’t represented in natural language.
None of that actually makes the values interpretable/legible, it just produces a model which spits out a bunch of English kinda related to some values maybe.
We’re currently working with language-model-text-adventures for three overall reasons:
There’s a nasty capabilities confound in working with most RL models. We found that MineRL agents are likely too stupid to be appreciably capable in Minecraft, and so it won’t be clear whether any given action of theirs occurred because of some value behind it or because the model was making an egregious mistake. It wouldn’t be especially clear if any values at all were forming in a model, if the model just crafts the crafting bench … and then spazzes around helplessly.
And if we scale down to very small RL settings, where the model is competent to its toy world, we just don’t expect the degenerate case of learned value formation we’d see there to meaningfully generalize to larger models.
We’re working with language models rather than CoinRun RL agents (which wouldn’t suffer from a capabilities confound) because we’ll get more bits of data out of the language models! If you force the model to rely on a monologue scratchpad to do any of its inter-forward-pass computation, then the model will have to air much of its computation to us in that scratchpad.
…we’re still going to hit all of the latent computation in the model with all the interpretability tools we can, though—one example being causal tracing of the value-laden outputs of the model.
We tentatively expect that i.i.d. training doesn’t select for the most dangerous coherence-properties as strongly as action-dependent training setups do. So we’re especially interested in looking at docile shard formation in language models. Very plausibly, this will be how the AGI is trained, and so we want to know as much as possible about this special case of learned value formation.
This is a serious worry! We’re hoping to get the model to do as much of its computation visibly, in words, as possible. There’s going to be a lot of additional, significant latent computation going on, and we’ll have to interpret that with other interpretability tools. The tighter the serial-computation limits placed on a transformer forward-pass, by running anti-steganography tools and limiting serial depth, the less slack the model has to form a mesa-optimizer inside a single forward pass that can then deceive us.
This is all for naught if we don’t learn something about genuine consequentialists by studying modern language models in this way. But both kinds of systems share an underlying architecture, i.e., a massive neural network, and its plausible that what we find holds about value formation in the most powerful ML systems today will generalize to tomorrow’s ML.
I think when most people suggest chain-of-thought tactics, they are imagining an interpretability tool that summarizes an AGI’s considerations in the form of words, AS IF it were a person thinking to themselves, in a way that’s supposed to increase an interpreter’s understanding of what the AGI is about to do, even if the AGI is not literally thinking in terms of words or humans don’t think entirely in words or it’s not a complete mathematical description of its behavior. Your concerns are correct but go way too far in implying an AI could not be DESIGNED to produce such a stream-of-thought which would have >0 value in managing some smarter-than-human AIs. The AI box test seems like it would be remarkably easier to pass if I had an analogous tool to run on Eliezer.
Obviously it’s not an inherent feature for intelligent agents.
Most of the proposals I’ve heard do actually involve getting AI to think in terms of words as its primary internal data structure. But that’s not actually a crux for me. The more important part is this:
>0 value, taken in isolation, is simply not a worthwhile goal to pursue in alignment research. Tons of things provide >0 value in isolation, but do not at all address any of the core subproblems or generalize beyond a specific architecture, and therefore will not cumulatively stack with other work and probably will not even apply to whatever architecture actually ends up being key to use. Epsilons don’t matter unless they stack.
Ye fair
In any reasonable sane AGI project the human overseers will have powerful real time debugging/introspection into the AGI minds, and the AGI should grow up in carefully sealed sandbox sims so they aren’t aware of their true nature (to prevent deception etc).
And given that AGI necessitates human language capability that implies that some chunk of it’s brain will be devoted to compressing and summarizing the internal state&computations down to a linguistic text stream (because that is what language actually is). So the AGI will likely also have an inner monologue which will naturally be hooked up to powerful database/search/indexing/monitoring tools. We can also go deeper if required and train secondary models to inspect/summarize in arbitrary detail.
That… was wayyyyy too big a jump in reasoning. The ability to use human language externally does not at all imply that the bulk of internal state/computations needs to be summarized to a text stream, nor does it imply that language needs to be produced internally on a continuous basis (as opposed to being produced on the fly, as needed).
If I were going to make a case that AGI will have an internal monologue, I’d say “well humans have an internal monologue, so that’s evidence that an internal monologue is convergent”. But even in humans, peoples’ internal visualizations, urges and flinches, instinctive valuations, and tons of other stuff are not continuously transcribed into the internal monologue; the monologue captures only a specific narrow slice of what the mind is doing. It’s not a broad compression of all the internal activity. And humans can totally just turn off their inner monologue, and keep going about their day functioning just fine.
Summarizable rather than summarized, but this is in fact just how the brain works (or most brains) and certainly is a valid path to AGI. So the key question is not “is inner monologue essential?”—which you seem to be arguing against—but rather “does inner monologue have much any of a performance disadvantage?”. I’d roughly estimate that turning off the language production centers in the brain would save about 10%.
Text/speech is an attention guided summarization/compression, so naturally it doesn’t capture everything—but that’s the point!. It is the highest level abstract summary, and I mentioned we can additionally train aux models to zoom in on any timeslice of thought, to the extent that turns out to be actually useful relative to the compute/storage/bandwidth costs.
What? Evidence for this? I can think of only meditation, but people aren’t doing any important intellectual work while meditating. Regardless even if some humans can turn off their inner monologue that would only be relevant if that ability was necessary for higher intelligence (which it near certainly is not).
I turn off my internal monologue intentionally sometimes, it’s not particularly difficult (try it!). Also it does seem to improve performance in some ways—turns out the transcription to language is slow, my thoughts zoom ahead faster when they’re not waiting around for the monologue to catch up. IIRC people in flow-state usually don’t have a monologue going, which would be additional evidence of the performance cost in humans assuming I’m remembering correctly.
Empirically, people seem to have an enormous amount of difficulty expressing many parts of their internal cognition verbally. So no, a lot of it doesn’t even seem to be summarizable, at least given a typical human’s facility with natural language.
Note that inner monologue can be non-verbal—mine often contains abstract pictorial/visual components. For deaf/mute people it likely develops visually.
I guess I may be unusual in this regard then. But naturally the type of brain-like AGI I tend to think of is likely biased towards my own particular cognition and self-understanding thereof—and perhaps this is also true for you.
Regardless: even if 20% of humanity had no inner monologue or 50%, or 99%, it simply wouldn’t matter at all to my larger point: I need only a few examples of high functioning intelligent humans with inner monologues to demonstrate that having an inner monologue is clearly not an efficiency disadvantage! This is a low cost safety feature.
Finally, I’l conclude with that Hellen Keller quote:
“Before my teacher came to me, I did not know that I am. I lived in a world that was a no-world. I cannot hope to describe adequately that unconscious, yet conscious time of nothingness. I did not know that I knew aught, or that I lived or acted or desired. I had neither will nor intellect. I was carried along to objects and acts by a certain blind natural impetus. I had a mind which caused me to feel anger, satisfaction, desire… ”
That argument I buy.
I would like to point out that what johnswentworth said about being able to turn off an internal monologue is completely true for me as well. My internal monologue turns itself on and off several (possibly many) times a day when I don’t control it, and it is also quite easy to tell it which way to go on that. I don’t seem to be particularly more or less capable with it on or off, except on a very limited number of tasks. Simple tasks are easier without it, while explicit reasoning and storytelling are easier with it. I think my default is off when I’m not worried (but I do an awful lot of intentional verbal daydreaming and reasoning about how I’m thinking too.).
If DeepMind raises their AGI in a carefully sealed sandbox sim, where it doesn’t know that it’s in a sandbox or that humans exist, there’s still a problem that next year Meta or whatever will raise an AGI without a sandbox. Have you thought about what to do about that problem? Is there some significant task that DeepMind can do with its sandboxed AGI (which doesn’t know about humans and the real world) that prevents Meta from making the unsandboxed AGI afterwards, or that prevents this latter AGI from causing harm? If so, what? Or are you imagining that DeepMind eventually lets the AGI out of the sandbox? If so, how does the sandbox help? Sorry if you already answered this in a blog post a decade ago, I haven’t read those in a while. :)
Building something of significance requires design/develop/test/evaluate iterations. A sandbox sim is just what that looks like for safe AGI. Simboxing is not the design for aligned AGI itself, rather it’s the process/setup that allows teams to build/test/evaluate without risking the world.
The tech infra for sandbox sims is similar to advance game tech anyway and can be shared. Numerous AGI teams can compete safely on alignment benchmarks in simboxes while closely guarding their design secrets/code, running their AGI securely, etc.
Imagine an alternate earth like world that was so fully inhabited that the only two options to test nuclear weapons were 1.) blowing up inhabited cities or perhaps the entire world, or 2.) in big computer sims. Only one of these options is sane.
So with that background in mind—if DM develops aligned AGI first, then hopefully Meta no longer matters, as DM’s capability gap will quickly expand.
The hard part is not training AGI once you have the seed (as we know that only requires about 1e23 sparse flops and probably less, which is roughly equivalent to perhaps 1e26 dense flops—so only 3 OOM beyond GPT3). And a single training run gives you unlimited potential clones of the same AGI mind.
The hard part is finding the right seed, which will take some number of experimental cycles training (hopefully small) AGI populations. Carmack’s recent estimate that the seed will be only a few ten thousand lines of code is a bit lower than my estimate of 100k lines of code, but not by much.
To clarify—by ‘seed’ I mean all the arch design and model code. So take everything you need to easily reproduce GPT3/GATO whatever, including all the underlying cuda code (mostly written by nvidia) - but not the trained weights. Anyone who has that and a copy of the text internet on a few SSDs can then train their own GPT3 for only a few million $. AGI will be no different—most of the expense is on all the R&D buildup, not an individual final training run. (Compare the total R&D expenditure of DeepMind ~ $500M/yr, vs the few million for any single training run cost).
Gotcha.
I’m very strongly in favor of pre-deployment sandbox tests.
I don’t think that sandbox tests are sufficient for safety—I remain very concerned about a scenario where we run the “seed” code in a sandbox environment, and we wind up with an AGI that behaves well, but then we run the same “seed” code with access to the real world, and we wind up with an AGI that behaves dangerously.
Simboxes are not sufficient for safety in the same sense that the Imagenet ‘simbox’ (data and evaluation code) was not sufficient for human level visual object classification or the atari simulator setup was not sufficient for human-level atari performance—you still need the design which actually does the right thing.
The problem you specifically refer to is the OOD problem which in this case basically amounts to creating a range of sim scenarios that sufficiently cover the near future multiverse. In some sense this is actually easier than many hard science sims because intelligence and alignment are so universal they transcend the physics details (intelligence and alignment are universal across an enormous swath of possible universes even with very different physics). That is very very different than say nuclear weapons sims where nailing the physics precisely can matter a great deal.
The seed codes for AGI built on simple universal principles such as self-supervised learning, learning to recognize agency in the world, maximization of other agent’s empowerment, etc.
The tests then should cover a large variety of scenarios ranging the gamut from simple tests of empathy and altruism all the way up to larger scale complete simulations of future AGI takeover: worlds where numerous agents compete and cooperate to survive in a harsh world, eventually one/few gains a decisive strategic advantage (absolute power), and faces some ultimate altruism decision like whether to sacrifice themselves in exchange for the resurrection and absolute empowerment of all the other agents. For various reasons magic is the most likely useful safe proxy for technology.
I have a half-written longer post on this, but if you are having trouble imagining these range of tests think of all of the output of the future narrow-AI empowered film/game industries and all of that creativity unleashed on this problem.
Do you put actual humans into the simbox? If no, then isn’t that a pretty big OOD problem? Or if yes, how do you do that safely?
I think I’m skeptical that “learning to recognize agency in the world” and “maximization of other agents’ empowerment” actually exist in the form of “simple universal principles”. For example, when I see a simple animatronic robot, it gives me a visceral impression of agency, but it’s a false impression. Well, hmm, I guess that depends on your definitions. Well anyway, I’ll just say that if an AGI were maximizing the “empowerment” of any simple animatronic robots that it sees, I would declare that this AGI was doing the wrong thing.
It’s fine if you want to just finish your longer post instead of replying here. Either way, looking forward to that! :)
Humans in the simbox—perhaps in the early stages, but not required once it’s running (although human observers have a later role). But that’s mostly tangential.
One of the key ideas here—and perhaps divergent vs many other approaches—is that we want agents to robustly learn and optimize for other agents values: across a wide variety of agents, situations, and agent value distributions. The idea is to handle OOD by generalizing beyond specific human values. Then once we perfect these architectures and training regimes and are satisfied with their alignment evaluations we can deploy them safely in the real world where they will learn and optimize for our values (safe relative to deploying new humans).
I do have a rough sketch of the essence of the mechanism I think the brain is using for value learning and altruism, and I actually found one of your articles to link to that is related.
I suspect you’d agree that self-supervised prediction is a simple, powerful, and universal learning idea—strongly theoretically justified as in Solomonoff/Bayes and AIXI, etc, and clearly also a key brain mechanism. Generalized empowerment or self-improvement is similar—strongly theoretically justified, and also clearly a key brain mechanism. The former guides learning of the predictive world model, the latter guides learning of the action/planning system. Both are also optimal in a certain sense.
Human’s tendency to anthropomorphize, empathize with, and act altruistically towards various animals and even hypothetical non-humans is best explained as a side effect of a very general (arguably overly general!) alignment mechanism.