On pure LLM-simulated humans, I’m not sure either way. I wouldn’t be astonished if a sufficiently large LLM trained on a sufficiently large amount of data could actually simulate an IQ ~100 – ~120 humans well enough that having a large supply of fast, promptable cheap simulations was Transformative AI. But I also wouldn’t be astonished if we found that was primarily good for an approximation of human System 1 thinking, and that doing a good job of simulating human System 2 thinking over significant periods it was either necessary, or at least a lot cheaper, to supply the needed cognitive abilities via scaffolding (it rather depends on how future LLMs act at very long context lengths, and if we can fix a few of their architecturally-induced blindspots, which I’m optimistic about but is unproven). And completely I agree that the alignment properties of a base model LLM, an RLHF trained LLM, a scaffolded LLM, and other yet-to-be-invented variants are not automatically the same, and we do need people working on them to think about this quite carefully. I’m just not convinced that even the base model is safe, if it can become an AGI by simulating a very smart human when sufficiently large and sufficiently prompted.
While scaffolding provides additional complexities to alignment, it also provides additional avenues for alignment: now their thoughts are translucent and we can audit and edit their long-term memories.
I had noticed you weren’t making that mistake; but I have seen other people on Less Wrong somehow assume that humans must be aligned to other humans (I assume because they understand human values?) Sadly that’s just not the case: if it was, we wouldn’t need locks or law enforcement, and would already have UBI. So I thought it was worth including those steps in my argument, for other readers who might benefit from me belaboring the point.
doing a good job of simulating human System 2 thinking over significant periods it was either necessary, or at least a lot cheaper, to supply the needed cognitive abilities via scaffolding
I agree that sufficiently clever scaffolding could likely supply this. But:
I expect that figuring out what this scaffolding is, is a hard scientific challenge, such that by-default, on the current paradigm, we’ll get to AGI by atheoretic tinkering with architectures rather than by figuring out how intelligence actually works and manually implementing that. (Hint: clearly it’s not as simple as the most blatantly obvious AutoGPT setup.)
If we get there by figuring out the scaffolding, that’d actually be a step towards a more alignable AGI, in the sense of us getting some idea of how to aim its cognition. Nowhere near sufficient for alignment and robust aimability, but a step in the right direction.
All valid points. (Though people are starting to get quite good results out of agentic scaffolds, for short chains of thought, so it’s not that hard, and the promary issue seems to be that exsting LLMs just aren’t consistent enough in their behavior to be able to keep it going for long.)
On you second bullet: personally I want to build a scaffolding suitable for an AGI-that-is-a-STEM-researcher in which the long term approximate-Bayesian reasoning on theses is something like explicit and mathematical symbol manipulation and/or programmed calculation and/or tool-AI (so a blend of LLM with AIXI-like GOFAI) — since I think then we could safely point it at Value Learning or AI-assisted Alignment and get a system with a basin of attraction converging from partial alignment to increasingly-accurate alignment (that’s basically my current SuperAlignment plan). But then for a sufficiently large transformer model their in-context learning is already approximately Bayesian, so we’d be duplicating an existing mechanism, like RAG duplicating long-term memory when the LLM already has in-context memory. I’m wondering if we could get an LLM sufficiently well-calibrated that we could just use its logits (on a carefully selected token) as a currency of exchange to the long-term approximate Bayesianism calculation: “I have weighed all the evidence and it has shifted my confidence in the thesis… [now compare logits of ‘up’ vs ‘down’, or do a trained linear probe calibrated in logits, or something]
On pure LLM-simulated humans, I’m not sure either way. I wouldn’t be astonished if a sufficiently large LLM trained on a sufficiently large amount of data could actually simulate an IQ ~100 – ~120 humans well enough that having a large supply of fast, promptable cheap simulations was Transformative AI. But I also wouldn’t be astonished if we found that was primarily good for an approximation of human System 1 thinking, and that doing a good job of simulating human System 2 thinking over significant periods it was either necessary, or at least a lot cheaper, to supply the needed cognitive abilities via scaffolding (it rather depends on how future LLMs act at very long context lengths, and if we can fix a few of their architecturally-induced blindspots, which I’m optimistic about but is unproven). And completely I agree that the alignment properties of a base model LLM, an RLHF trained LLM, a scaffolded LLM, and other yet-to-be-invented variants are not automatically the same, and we do need people working on them to think about this quite carefully. I’m just not convinced that even the base model is safe, if it can become an AGI by simulating a very smart human when sufficiently large and sufficiently prompted.
While scaffolding provides additional complexities to alignment, it also provides additional avenues for alignment: now their thoughts are translucent and we can audit and edit their long-term memories.
I had noticed you weren’t making that mistake; but I have seen other people on Less Wrong somehow assume that humans must be aligned to other humans (I assume because they understand human values?) Sadly that’s just not the case: if it was, we wouldn’t need locks or law enforcement, and would already have UBI. So I thought it was worth including those steps in my argument, for other readers who might benefit from me belaboring the point.
I agree that sufficiently clever scaffolding could likely supply this. But:
I expect that figuring out what this scaffolding is, is a hard scientific challenge, such that by-default, on the current paradigm, we’ll get to AGI by atheoretic tinkering with architectures rather than by figuring out how intelligence actually works and manually implementing that. (Hint: clearly it’s not as simple as the most blatantly obvious AutoGPT setup.)
If we get there by figuring out the scaffolding, that’d actually be a step towards a more alignable AGI, in the sense of us getting some idea of how to aim its cognition. Nowhere near sufficient for alignment and robust aimability, but a step in the right direction.
All valid points. (Though people are starting to get quite good results out of agentic scaffolds, for short chains of thought, so it’s not that hard, and the promary issue seems to be that exsting LLMs just aren’t consistent enough in their behavior to be able to keep it going for long.)
On you second bullet: personally I want to build a scaffolding suitable for an AGI-that-is-a-STEM-researcher in which the long term approximate-Bayesian reasoning on theses is something like explicit and mathematical symbol manipulation and/or programmed calculation and/or tool-AI (so a blend of LLM with AIXI-like GOFAI) — since I think then we could safely point it at Value Learning or AI-assisted Alignment and get a system with a basin of attraction converging from partial alignment to increasingly-accurate alignment (that’s basically my current SuperAlignment plan). But then for a sufficiently large transformer model their in-context learning is already approximately Bayesian, so we’d be duplicating an existing mechanism, like RAG duplicating long-term memory when the LLM already has in-context memory. I’m wondering if we could get an LLM sufficiently well-calibrated that we could just use its logits (on a carefully selected token) as a currency of exchange to the long-term approximate Bayesianism calculation: “I have weighed all the evidence and it has shifted my confidence in the thesis… [now compare logits of ‘up’ vs ‘down’, or do a trained linear probe calibrated in logits, or something]
Generative and predictive models can be substantially different.
there are finite generative models such that the optimal predictive model is infinite.
See this paper for more.