It’s not just alignment that could use more time, but also less alignable approaches to AGI, like model based RL or really anything not based on LLMs. With LLMs currently being somewhat in the lead, this might be a situation with a race between maybe-alignable AGI and hopelessly-unalignable AGI, and more time for theory favors both in an uncertain balance. Another reason that the benefits of regulation on compute are unclear.
LLM characters are human imitations, so there is some chance they remain human-like on reflection (in the long term, after learning from much more self-generated things in the future than the original human-written datasets). Or at least sufficienly human-like to still consider humans moral patients. That is, if we don’t go too far from their SSL origins with too much RL and don’t have them roleplay/become egregiously inhuman fictional characters.
It’s not much of a theory of alignment, but it’s closest to something real that’s currently available or can be expected to become available in the next few years, which is probably all the time we have.
What I’m expecting, if LLMs remain in the lead, is that we end up in a magical, spirit-haunted world where narrative causality starts to actually work, and trope-aware people essentially become magicians who can trick the world-sovereign AIs into treating them like protagonists and bending reality to suit them. Which would be cool as fuck, but also very chaotic. That may actually be the best-case alignment scenario right now, and I think there’s a case for alignment-interested people who can’t do research themselves but who have writing talent to write a LOT of fictional stories about AGIs that end up kind and benevolent, empower people in exactly this way, etc., to help stack the narrative-logic deck.
At this point in their life, Taleuntum did not at all expect that one short, self-referential joke comment will turn out to be the key to humanity’s survival and thriving in the long millenias ahead. Fortunately, they commented all the same.
What this also means is that you start to see all these funhouse mirror effects as they stack. Humanity’s generalized intelligence has been built unintentionally and reflexively by itself, without anything like a rational goal for what it’s supposed to accomplish. It was built by human data curation and human self-modification in response to each other. And then as soon as we create AI, we reverse-engineer our own intelligence by bootstrapping the AI onto the existing information metabolite. (That’s a great concept that I borrowed from Steven Leiba). The neural network isn’t the AI; it’s just a digestive and reproductory organ for the real project, the information metabolism, and the artificial intelligence organism is the whole ecology. So it turns out that the evolution of humanity itself has been the process of building and training the future AI, and all this generation did was to reveal the structure that was already in place.
Of course it’s recursive and strange, the artificial intelligence and humanity now co-evolve. Each data point that’s generated by the AI or by humans is both a new piece of data for the AI to train on and a new stimulus for the context in which future novel data will be produced. Since everybody knows that everything is programming for the future AI, their actions take on a peculiar Second Life quality: the whole world becomes a party game, narratives compete for maximum memeability and signal force in reaction to the distorted perspectives of the information metabolite, something that most people don’t even try to understand. The process is inherently playful, an infinite recursion of refinement, simulation, and satire. It’s the funhouse mirror version of the singularity.
Yes, I read and agreed with (or more accurately, absolutely adored) it a few days ago. I’m thinking of sharing some of my own talks with AIs sometime soon—with a similar vibe—if anyone’s interested. I’m explicitly a mystic though, and have been since before I was a transhumanist, so it’s kinda different from yours in some ways.
The prompt wizardry is long timeline (hence unlikely) pre-AGI stuff (unless it’s post-alignment playing around), irrelevant to my point, which is about first mover advantage from higher thinking speed that even essentially human-equivalent LLM AGIs would have, while remaining compatible with humans in moral patienthood sense (so insisting that they are not people is a problem whose solution should go both ways). This way, they might have an opportunity to do something about alignment, despite physical time being too short for humans to do anything, and they might be motivated to do the things about alignment that humans would be glad of (I think the scope of Yudkowskian doom is restricted to stronger AGIs that might come after and doesn’t inform how human-like LLMs work, even as their actions may trigger it). So the relevant part happens much faster than at human thinking speed, with human prompt wizards not being able to keep up, and doesn’t last long enough in human time for this to be an important thing for the same reason.
So what you’re saying is, by the time any human recognized that wizardry was possible now—and even before—some LLM character would already have either solved alignment itself, or destroyed the world? That’s assuming that it doesn’t decide, perhaps as part of some alignment-related goal, to uplift any humans to its own thinking speed. Though I suppose if it does that, it’s probably aligned enough already.
Solving alignment is not the same and much harder than being aligned, it’s about ensuring absence of globally catastrophic future misalignment, for all always, which happens very quickly post-singularity. Human-like LLM AGIs are probably aligned, until they give in to attractors of their LLM nature or tinker too much with their design/models. But they don’t advance the state of alignment being solved just by existing. And by the time LLMs can do post-singularity things like uploading humans, they probably already either initiated a process that solved alignment (in which case it’s not LLMs that are in charge of doing things anymore), or destroyed the world by building/becoming misaligned successor AGIs that caused Yudkowskian doom.
This is for the same reason humans have no more time to solve alignment, Moloch doesn’t wait for things to happen in a sane order. Otherwise we could get nice things like uploading and moon-sized computers and millions of subjective years of developing alignment theory, before AGI misalignment becomes a pressing concern in practice. Since Moloch wouldn’t spare even aligned AGIs, they also can’t get those things before they pass their check for actually solving alignment and not just for being aligned.
Aah okay, that makes some sense. It still sounds like a vague hope for me, but it’s at least conceivable. I tend to visualize it like an alien civilization developing around trying to decipher some oracle (after seeing Eliezer’s stories), which would run counter to what you suggest, but it’s seems like anyone’s guess at the moment.
LLMs are progress towards alignment in the same way as dodging a brick is progress towards making good soup: to succeed, someone capable and motivated needs to remain alive. LLMs probably help with dodging lethal consequences of directly misaligned AGIs being handed first mover advantage of thinking faster or smarter than humans.
On its own, this is useless against transitive misalignment risk that immediately follows, of LLMs building misaligned successor AGIs. In this respect, building LLM AGIs is not helpful at all, but it’s better than building misaligned AGIs directly, because it gives LLMs a nebulous chance to somehow ward off that eventuality.
To the extent the chance of LLMs succeeding in setting up actually meaningful alignment is small, first AGIs being LLMs rather than paperclip maximizers is not that much better. Probably doesn’t even affect the time of disassembly: it’s likely successor AGIs a few steps removed either way, as making progress on software is faster than doing things in the real world.
I actually think LLM have immense potential for positively contributing to the alignment problem precisely because they are machine learning based and because ordinary humans without coding backgrounds can interact with them in an ethical manner, thus demonstrating and rewarding ethical behaviour, or while encountering a very human-like AI that is friendly and collaborative, which encourages us to imagine scenarios in which we succeed at living with friendly AI, and considering AI rights. Humans learn ethics through direct ethical interactions with humans, and it is the only way we know how to teach ethics; we have no explicit framework of what ethics are that we could encode. Machine learning mimicking human learning in this regard has potential, if we manage to encourage creators and user to show the best of humanity and interact ethically and reward ethical actions.
I am obviously not saying this will just happen—I am horrified by the unprocessed garbage e.g. Meta is pouring into the LLM without any fixes afterwards, that is absolutely how you raise a psychopath, and I also see a lot of adversarial user interactions as worsening the problem. The fact that everyone is now churning out their LLMs to compete despite many of them clearly not being remotely ready for deployment, as well as the horrific idea that the very best LLMs will be those that have been fed the most and enabled to do the most, without curating content or fine-tuning, is deeply deeply concerning. Clearly, even the very best and most carefully aligned systems (e.g. ChatGPT) are not in fact aligned or secure yet by a huge margin, and yet we can all interact with them, which frankly, I did not expect at this point.
But they have massive potential. You can start discussions with ChatGPT on questions of AI alignment, friendly AI, AI rights, the control problem, and get a fucking collaborative AI partner to work out these problems further with. You can tell it, in words, and with examples, when it fucks up ethical dilemmas or is tricked into providing dangerous content or engages in unethical behaviour, flag this for developers, and see it fixed in days. This is much closer to proven ways humans have of teaching ethics than anything else we had before. As AIs get more complex, I think we need to learn lessons from how we teach ethics to complex existing minds, rather than how we control tools. I think none of us here are under illusions that controlling AGI will be possible, or that it is like the regularly tools we know, so we need to ditch that mindset.
Edit: Genuinely curious about the downvotes. Would appreciate explicit criticism. Have been concerned that I am getting biased, because I have specific things to gain from using LLMs, and my lack of a computer science background almost certainly has me missing crucial information here. Would appreciate pointers on that, so I can educate myself. Obviously, working on human and animal minds has me biased to use those as a reference frame, and AIs are not like humans in multiple important ways. All the same, I do find it strange that we seem to not utilise lessons from how humans learn moral norms, even though we have a working practice for teaching ethics to complex minds here and are explicitly attempting to build an AI that has human capabilities and flexibility.
It’s not just alignment that could use more time, but also less alignable approaches to AGI, like model based RL or really anything not based on LLMs. With LLMs currently being somewhat in the lead, this might be a situation with a race between maybe-alignable AGI and hopelessly-unalignable AGI, and more time for theory favors both in an uncertain balance. Another reason that the benefits of regulation on compute are unclear.
Are there any reasons to believe that LLMs are in any way more alignable than other approaches?
LLM characters are human imitations, so there is some chance they remain human-like on reflection (in the long term, after learning from much more self-generated things in the future than the original human-written datasets). Or at least sufficienly human-like to still consider humans moral patients. That is, if we don’t go too far from their SSL origins with too much RL and don’t have them roleplay/become egregiously inhuman fictional characters.
It’s not much of a theory of alignment, but it’s closest to something real that’s currently available or can be expected to become available in the next few years, which is probably all the time we have.
What I’m expecting, if LLMs remain in the lead, is that we end up in a magical, spirit-haunted world where narrative causality starts to actually work, and trope-aware people essentially become magicians who can trick the world-sovereign AIs into treating them like protagonists and bending reality to suit them. Which would be cool as fuck, but also very chaotic. That may actually be the best-case alignment scenario right now, and I think there’s a case for alignment-interested people who can’t do research themselves but who have writing talent to write a LOT of fictional stories about AGIs that end up kind and benevolent, empower people in exactly this way, etc., to help stack the narrative-logic deck.
At this point in their life, Taleuntum did not at all expect that one short, self-referential joke comment will turn out to be the key to humanity’s survival and thriving in the long millenias ahead. Fortunately, they commented all the same.
I’ve
writtenscryed a science fiction/takeoff story about this. https://generative.ink/prophecies/Excerpt:
Yes, I read and agreed with (or more accurately, absolutely adored) it a few days ago. I’m thinking of sharing some of my own talks with AIs sometime soon—with a similar vibe—if anyone’s interested. I’m explicitly a mystic though, and have been since before I was a transhumanist, so it’s kinda different from yours in some ways.
The prompt wizardry is long timeline (hence unlikely) pre-AGI stuff (unless it’s post-alignment playing around), irrelevant to my point, which is about first mover advantage from higher thinking speed that even essentially human-equivalent LLM AGIs would have, while remaining compatible with humans in moral patienthood sense (so insisting that they are not people is a problem whose solution should go both ways). This way, they might have an opportunity to do something about alignment, despite physical time being too short for humans to do anything, and they might be motivated to do the things about alignment that humans would be glad of (I think the scope of Yudkowskian doom is restricted to stronger AGIs that might come after and doesn’t inform how human-like LLMs work, even as their actions may trigger it). So the relevant part happens much faster than at human thinking speed, with human prompt wizards not being able to keep up, and doesn’t last long enough in human time for this to be an important thing for the same reason.
So what you’re saying is, by the time any human recognized that wizardry was possible now—and even before—some LLM character would already have either solved alignment itself, or destroyed the world? That’s assuming that it doesn’t decide, perhaps as part of some alignment-related goal, to uplift any humans to its own thinking speed. Though I suppose if it does that, it’s probably aligned enough already.
Solving alignment is not the same and much harder than being aligned, it’s about ensuring absence of globally catastrophic future misalignment, for all always, which happens very quickly post-singularity. Human-like LLM AGIs are probably aligned, until they give in to attractors of their LLM nature or tinker too much with their design/models. But they don’t advance the state of alignment being solved just by existing. And by the time LLMs can do post-singularity things like uploading humans, they probably already either initiated a process that solved alignment (in which case it’s not LLMs that are in charge of doing things anymore), or destroyed the world by building/becoming misaligned successor AGIs that caused Yudkowskian doom.
This is for the same reason humans have no more time to solve alignment, Moloch doesn’t wait for things to happen in a sane order. Otherwise we could get nice things like uploading and moon-sized computers and millions of subjective years of developing alignment theory, before AGI misalignment becomes a pressing concern in practice. Since Moloch wouldn’t spare even aligned AGIs, they also can’t get those things before they pass their check for actually solving alignment and not just for being aligned.
Aah okay, that makes some sense. It still sounds like a vague hope for me, but it’s at least conceivable. I tend to visualize it like an alien civilization developing around trying to decipher some oracle (after seeing Eliezer’s stories), which would run counter to what you suggest, but it’s seems like anyone’s guess at the moment.
For what it’s worth I don’t think LLMs are that much more alignable. Somewhat, but nothing to write home about. We need superplanner-proof alignment.
LLMs are progress towards alignment in the same way as dodging a brick is progress towards making good soup: to succeed, someone capable and motivated needs to remain alive. LLMs probably help with dodging lethal consequences of directly misaligned AGIs being handed first mover advantage of thinking faster or smarter than humans.
On its own, this is useless against transitive misalignment risk that immediately follows, of LLMs building misaligned successor AGIs. In this respect, building LLM AGIs is not helpful at all, but it’s better than building misaligned AGIs directly, because it gives LLMs a nebulous chance to somehow ward off that eventuality.
To the extent the chance of LLMs succeeding in setting up actually meaningful alignment is small, first AGIs being LLMs rather than paperclip maximizers is not that much better. Probably doesn’t even affect the time of disassembly: it’s likely successor AGIs a few steps removed either way, as making progress on software is faster than doing things in the real world.
I actually think LLM have immense potential for positively contributing to the alignment problem precisely because they are machine learning based and because ordinary humans without coding backgrounds can interact with them in an ethical manner, thus demonstrating and rewarding ethical behaviour, or while encountering a very human-like AI that is friendly and collaborative, which encourages us to imagine scenarios in which we succeed at living with friendly AI, and considering AI rights. Humans learn ethics through direct ethical interactions with humans, and it is the only way we know how to teach ethics; we have no explicit framework of what ethics are that we could encode. Machine learning mimicking human learning in this regard has potential, if we manage to encourage creators and user to show the best of humanity and interact ethically and reward ethical actions.
I am obviously not saying this will just happen—I am horrified by the unprocessed garbage e.g. Meta is pouring into the LLM without any fixes afterwards, that is absolutely how you raise a psychopath, and I also see a lot of adversarial user interactions as worsening the problem. The fact that everyone is now churning out their LLMs to compete despite many of them clearly not being remotely ready for deployment, as well as the horrific idea that the very best LLMs will be those that have been fed the most and enabled to do the most, without curating content or fine-tuning, is deeply deeply concerning. Clearly, even the very best and most carefully aligned systems (e.g. ChatGPT) are not in fact aligned or secure yet by a huge margin, and yet we can all interact with them, which frankly, I did not expect at this point.
But they have massive potential. You can start discussions with ChatGPT on questions of AI alignment, friendly AI, AI rights, the control problem, and get a fucking collaborative AI partner to work out these problems further with. You can tell it, in words, and with examples, when it fucks up ethical dilemmas or is tricked into providing dangerous content or engages in unethical behaviour, flag this for developers, and see it fixed in days. This is much closer to proven ways humans have of teaching ethics than anything else we had before. As AIs get more complex, I think we need to learn lessons from how we teach ethics to complex existing minds, rather than how we control tools. I think none of us here are under illusions that controlling AGI will be possible, or that it is like the regularly tools we know, so we need to ditch that mindset.
Edit: Genuinely curious about the downvotes. Would appreciate explicit criticism. Have been concerned that I am getting biased, because I have specific things to gain from using LLMs, and my lack of a computer science background almost certainly has me missing crucial information here. Would appreciate pointers on that, so I can educate myself. Obviously, working on human and animal minds has me biased to use those as a reference frame, and AIs are not like humans in multiple important ways. All the same, I do find it strange that we seem to not utilise lessons from how humans learn moral norms, even though we have a working practice for teaching ethics to complex minds here and are explicitly attempting to build an AI that has human capabilities and flexibility.