On the one hand, sure. I think LLMs are basically safe. As long as you keep the current training setup, you can scale them up 1000x and they’re not gonna grow agency or end the world.
LLMs are simulators. They are normally trained to simulate humans (and fictional characters, and groups of humans cooperating to write something), though DeepMind has trained them to instead simulate weather patterns. Humans are not well aligned to other humans: Joseph Stalin was not well aligned to the citizenry of Russia, and as you correctly note, a very smart manipulative human can be a very dangerous thing. LLM base models do not generally simulate the same human each time, they simulate a context-dependent distribution of human behaviors, However, as RLHF-instruct-trained LLMs show, they can be prompted and/or fine-tuned to mostly simulate rather similar humans (normally, helpful/honest/harmless assistants, or at least something that human raters score highly as such). LLMs also don’t simulate humans with IQs > ~180, since those are outside their training distribution. However, once we get a sufficiently a large LLM that has the capacity to do that well, there is going to be a huge financial incentive to figure out how to get it to extrapolate outside its training distribution and consistently simulate very smart humans with IQ 200+, and it’s fairly obvious how one might do this. At this point, you have something whose behavior is consistent enough to count as “sort of a single agent/homunculus”, capable enough to be very dangerous unless well aligned, and smart enough that telling the difference between real alignment and deceptive-alignment is likely to be hard, at least just from observing behavior.
IMO there are two main challenges to aligning ASI: 1) figuring out how to align a simulated superintelligent human-like mind given that you have direct access to their neural net, can filter their experiences, can read their translucent thoughts, and can do extensive training on them (and remembering that they are human-like in their trained behavior, but not in the underlying architecture, just as DeepMind’s weather system simulations are NOT a detailed 3D model of the atmosphere) 2) thinking very carefully about how you built your ASI to ensure that you didn’t accidentally build something weirder, more alien, or harder to align than a simulated human-like mind. I agree with the article that failing 2) is a plausible failure mode if you’re not being careful, but I don’t think 1) is trivial either, though I do think it might be tractable.
The LLM training loop shapes the ML models to be approximate simulators of the target distribution, yes. “Approximate” is the key word here.
I don’t think the LLM training loop, even scaled very far, is going to produce a model that’s actually generally intelligent, i. e. that’s inferred the algorithms that implement human general intelligence and has looped them into its own cognition. So no matter how you try to get it to simulate a genius-level human, it’s not going to produce genius-level human performance. Not in the ways that matter.
Particularly clever CoT-style setups may be able to do that, which I acknowledge in the post by saying that slightly-tweaked scaffolded LLMs may not be as safe as just LLMs. But I also expect that sort of setup to be prohibitively compute-expensive, such that we’ll get to AGI by architectural advances before we have enough compute to make them work. I’m not strongly confident on this point, however.
On pure LLM-simulated humans, I’m not sure either way. I wouldn’t be astonished if a sufficiently large LLM trained on a sufficiently large amount of data could actually simulate an IQ ~100 – ~120 humans well enough that having a large supply of fast, promptable cheap simulations was Transformative AI. But I also wouldn’t be astonished if we found that was primarily good for an approximation of human System 1 thinking, and that doing a good job of simulating human System 2 thinking over significant periods it was either necessary, or at least a lot cheaper, to supply the needed cognitive abilities via scaffolding (it rather depends on how future LLMs act at very long context lengths, and if we can fix a few of their architecturally-induced blindspots, which I’m optimistic about but is unproven). And completely I agree that the alignment properties of a base model LLM, an RLHF trained LLM, a scaffolded LLM, and other yet-to-be-invented variants are not automatically the same, and we do need people working on them to think about this quite carefully. I’m just not convinced that even the base model is safe, if it can become an AGI by simulating a very smart human when sufficiently large and sufficiently prompted.
While scaffolding provides additional complexities to alignment, it also provides additional avenues for alignment: now their thoughts are translucent and we can audit and edit their long-term memories.
I had noticed you weren’t making that mistake; but I have seen other people on Less Wrong somehow assume that humans must be aligned to other humans (I assume because they understand human values?) Sadly that’s just not the case: if it was, we wouldn’t need locks or law enforcement, and would already have UBI. So I thought it was worth including those steps in my argument, for other readers who might benefit from me belaboring the point.
doing a good job of simulating human System 2 thinking over significant periods it was either necessary, or at least a lot cheaper, to supply the needed cognitive abilities via scaffolding
I agree that sufficiently clever scaffolding could likely supply this. But:
I expect that figuring out what this scaffolding is, is a hard scientific challenge, such that by-default, on the current paradigm, we’ll get to AGI by atheoretic tinkering with architectures rather than by figuring out how intelligence actually works and manually implementing that. (Hint: clearly it’s not as simple as the most blatantly obvious AutoGPT setup.)
If we get there by figuring out the scaffolding, that’d actually be a step towards a more alignable AGI, in the sense of us getting some idea of how to aim its cognition. Nowhere near sufficient for alignment and robust aimability, but a step in the right direction.
All valid points. (Though people are starting to get quite good results out of agentic scaffolds, for short chains of thought, so it’s not that hard, and the promary issue seems to be that exsting LLMs just aren’t consistent enough in their behavior to be able to keep it going for long.)
On you second bullet: personally I want to build a scaffolding suitable for an AGI-that-is-a-STEM-researcher in which the long term approximate-Bayesian reasoning on theses is something like explicit and mathematical symbol manipulation and/or programmed calculation and/or tool-AI (so a blend of LLM with AIXI-like GOFAI) — since I think then we could safely point it at Value Learning or AI-assisted Alignment and get a system with a basin of attraction converging from partial alignment to increasingly-accurate alignment (that’s basically my current SuperAlignment plan). But then for a sufficiently large transformer model their in-context learning is already approximately Bayesian, so we’d be duplicating an existing mechanism, like RAG duplicating long-term memory when the LLM already has in-context memory. I’m wondering if we could get an LLM sufficiently well-calibrated that we could just use its logits (on a carefully selected token) as a currency of exchange to the long-term approximate Bayesianism calculation: “I have weighed all the evidence and it has shifted my confidence in the thesis… [now compare logits of ‘up’ vs ‘down’, or do a trained linear probe calibrated in logits, or something]
An LLM can be strongly super-human in its ability to predict the next token (that some distribution over humans with IQ < 100 would write) even if it was trained only on the written outputs of humans with IQ < 100.
More generally, the cognitive architecture of an LLM is very different from that of a person, and IMO we can use our knowledge of human behavior to reason about LLM behavior.
If you doubt that transformer models are simulators, why was DeepMind so successful in using them for predicting weather patterns? Why have they been so successful for many other sequence prediction tasks? I suggest you read up on some of the posts under Simulator Theory, which explain this better and at more length than I can in this comment thread.
On them being superhuman at predicting tokens — yes, absolutely. What’s your point? The capabilities of the agents simulated are capped by the computational complexity of the simulator, but not vice-versa. If you take the architecture and computational power needed to run GPT-10 and use it to train a base model only on (enough) text from humans with IQ <80, then the result will do an amazing, incredibly superhumanly accurate job of simulating the token-generation behavior of humans with an IQ <80.
The cognitive architecture of an LLM is very different from that of a person, and it is a mistake IMO to believe we can use our knowledge of human behavior to reason about an LLM.
If you want to reason about a transformer model, you should be using learning theory, SLT, compression, and so forth. However, what those tell us is basically that (within the limits of their capacity and training data) transformers run good simulations. So if you train them to simulate humans, then (to the extent that the simulation is accurate) human psychology applies, and thus things like EmotionPrompts work. So LLM-simulated humans make human-like mistakes when they’re being correctly simulated, plus also very un-human-like (to us dumb looking) mistakes when the simulation is inaccurate.
So our knowledge of human behavior is useful, but I agree is not sufficient, to reason about an LLM running a simulation of human.
LLMs are simulators. They are normally trained to simulate humans (and fictional characters, and groups of humans cooperating to write something), though DeepMind has trained them to instead simulate weather patterns. Humans are not well aligned to other humans: Joseph Stalin was not well aligned to the citizenry of Russia, and as you correctly note, a very smart manipulative human can be a very dangerous thing. LLM base models do not generally simulate the same human each time, they simulate a context-dependent distribution of human behaviors, However, as RLHF-instruct-trained LLMs show, they can be prompted and/or fine-tuned to mostly simulate rather similar humans (normally, helpful/honest/harmless assistants, or at least something that human raters score highly as such). LLMs also don’t simulate humans with IQs > ~180, since those are outside their training distribution. However, once we get a sufficiently a large LLM that has the capacity to do that well, there is going to be a huge financial incentive to figure out how to get it to extrapolate outside its training distribution and consistently simulate very smart humans with IQ 200+, and it’s fairly obvious how one might do this. At this point, you have something whose behavior is consistent enough to count as “sort of a single agent/homunculus”, capable enough to be very dangerous unless well aligned, and smart enough that telling the difference between real alignment and deceptive-alignment is likely to be hard, at least just from observing behavior.
IMO there are two main challenges to aligning ASI: 1) figuring out how to align a simulated superintelligent human-like mind given that you have direct access to their neural net, can filter their experiences, can read their translucent thoughts, and can do extensive training on them (and remembering that they are human-like in their trained behavior, but not in the underlying architecture, just as DeepMind’s weather system simulations are NOT a detailed 3D model of the atmosphere) 2) thinking very carefully about how you built your ASI to ensure that you didn’t accidentally build something weirder, more alien, or harder to align than a simulated human-like mind. I agree with the article that failing 2) is a plausible failure mode if you’re not being careful, but I don’t think 1) is trivial either, though I do think it might be tractable.
The LLM training loop shapes the ML models to be approximate simulators of the target distribution, yes. “Approximate” is the key word here.
I don’t think the LLM training loop, even scaled very far, is going to produce a model that’s actually generally intelligent, i. e. that’s inferred the algorithms that implement human general intelligence and has looped them into its own cognition. So no matter how you try to get it to simulate a genius-level human, it’s not going to produce genius-level human performance. Not in the ways that matter.
Particularly clever CoT-style setups may be able to do that, which I acknowledge in the post by saying that slightly-tweaked scaffolded LLMs may not be as safe as just LLMs. But I also expect that sort of setup to be prohibitively compute-expensive, such that we’ll get to AGI by architectural advances before we have enough compute to make them work. I’m not strongly confident on this point, however.
Oh, you don’t need to convince me of that.
On pure LLM-simulated humans, I’m not sure either way. I wouldn’t be astonished if a sufficiently large LLM trained on a sufficiently large amount of data could actually simulate an IQ ~100 – ~120 humans well enough that having a large supply of fast, promptable cheap simulations was Transformative AI. But I also wouldn’t be astonished if we found that was primarily good for an approximation of human System 1 thinking, and that doing a good job of simulating human System 2 thinking over significant periods it was either necessary, or at least a lot cheaper, to supply the needed cognitive abilities via scaffolding (it rather depends on how future LLMs act at very long context lengths, and if we can fix a few of their architecturally-induced blindspots, which I’m optimistic about but is unproven). And completely I agree that the alignment properties of a base model LLM, an RLHF trained LLM, a scaffolded LLM, and other yet-to-be-invented variants are not automatically the same, and we do need people working on them to think about this quite carefully. I’m just not convinced that even the base model is safe, if it can become an AGI by simulating a very smart human when sufficiently large and sufficiently prompted.
While scaffolding provides additional complexities to alignment, it also provides additional avenues for alignment: now their thoughts are translucent and we can audit and edit their long-term memories.
I had noticed you weren’t making that mistake; but I have seen other people on Less Wrong somehow assume that humans must be aligned to other humans (I assume because they understand human values?) Sadly that’s just not the case: if it was, we wouldn’t need locks or law enforcement, and would already have UBI. So I thought it was worth including those steps in my argument, for other readers who might benefit from me belaboring the point.
I agree that sufficiently clever scaffolding could likely supply this. But:
I expect that figuring out what this scaffolding is, is a hard scientific challenge, such that by-default, on the current paradigm, we’ll get to AGI by atheoretic tinkering with architectures rather than by figuring out how intelligence actually works and manually implementing that. (Hint: clearly it’s not as simple as the most blatantly obvious AutoGPT setup.)
If we get there by figuring out the scaffolding, that’d actually be a step towards a more alignable AGI, in the sense of us getting some idea of how to aim its cognition. Nowhere near sufficient for alignment and robust aimability, but a step in the right direction.
All valid points. (Though people are starting to get quite good results out of agentic scaffolds, for short chains of thought, so it’s not that hard, and the promary issue seems to be that exsting LLMs just aren’t consistent enough in their behavior to be able to keep it going for long.)
On you second bullet: personally I want to build a scaffolding suitable for an AGI-that-is-a-STEM-researcher in which the long term approximate-Bayesian reasoning on theses is something like explicit and mathematical symbol manipulation and/or programmed calculation and/or tool-AI (so a blend of LLM with AIXI-like GOFAI) — since I think then we could safely point it at Value Learning or AI-assisted Alignment and get a system with a basin of attraction converging from partial alignment to increasingly-accurate alignment (that’s basically my current SuperAlignment plan). But then for a sufficiently large transformer model their in-context learning is already approximately Bayesian, so we’d be duplicating an existing mechanism, like RAG duplicating long-term memory when the LLM already has in-context memory. I’m wondering if we could get an LLM sufficiently well-calibrated that we could just use its logits (on a carefully selected token) as a currency of exchange to the long-term approximate Bayesianism calculation: “I have weighed all the evidence and it has shifted my confidence in the thesis… [now compare logits of ‘up’ vs ‘down’, or do a trained linear probe calibrated in logits, or something]
Generative and predictive models can be substantially different.
there are finite generative models such that the optimal predictive model is infinite.
See this paper for more.
An LLM can be strongly super-human in its ability to predict the next token (that some distribution over humans with IQ < 100 would write) even if it was trained only on the written outputs of humans with IQ < 100.
More generally, the cognitive architecture of an LLM is very different from that of a person, and IMO we can use our knowledge of human behavior to reason about LLM behavior.
If you doubt that transformer models are simulators, why was DeepMind so successful in using them for predicting weather patterns? Why have they been so successful for many other sequence prediction tasks? I suggest you read up on some of the posts under Simulator Theory, which explain this better and at more length than I can in this comment thread.
On them being superhuman at predicting tokens — yes, absolutely. What’s your point? The capabilities of the agents simulated are capped by the computational complexity of the simulator, but not vice-versa. If you take the architecture and computational power needed to run GPT-10 and use it to train a base model only on (enough) text from humans with IQ <80, then the result will do an amazing, incredibly superhumanly accurate job of simulating the token-generation behavior of humans with an IQ <80.
If you want to reason about a transformer model, you should be using learning theory, SLT, compression, and so forth. However, what those tell us is basically that (within the limits of their capacity and training data) transformers run good simulations. So if you train them to simulate humans, then (to the extent that the simulation is accurate) human psychology applies, and thus things like EmotionPrompts work. So LLM-simulated humans make human-like mistakes when they’re being correctly simulated, plus also very un-human-like (to us dumb looking) mistakes when the simulation is inaccurate.
So our knowledge of human behavior is useful, but I agree is not sufficient, to reason about an LLM running a simulation of human.