Agentized LLMs will change the alignment landscape
Epistemic status: head spinning, suddenly unsure of everything in alignment. And unsure of these predictions.
I’m following the suggestions in 10 reasons why lists of 10 reasons might be a winning strategy in order to get this out quickly (reason 10 will blow your mind!). I’m hoping to prompt some discussion, rather than try to do the definitive writeup on this topic when this technique was introduced so recently.
Ten reasons why agentized LLMs will change the alignment landscape:
Agentized[1] LLMs like Auto-GPT and Baby AGI may fan the sparks of AGI in GPT-4 into a fire. These techniques use an LLM as a central cognitive engine, within a recursive loop of breaking a task goal into subtasks, working on those subtasks (including calling other software), and using the LLM to prioritize subtasks and decide when they’re adequately well done. They recursively check whether they’re making progress on their top-level goal.
While it remains to be seen what these systems can actually accomplish, I think it’s very likely that they will dramatically enhance the effective intelligence of the core LLM. I think this type of recursivity and breaking problems into separate cognitive tasks is central to human intelligence. This technique adds several key aspects of human cognition; executive function; reflective, recursive thought; and episodic memory for tasks, despite using non-brainlike implementations. To be fair, the existing implementations seem pretty limited and error-prone. But they were implemented in days. So this is a prediction of near-future progress, not a report on amazing new capabilities.
This approach appears to be easier than I’d thought. I’ve been expecting this type of self-prompting to imitate the advantages of human thought, but I didn’t expect the cognitive capacities of GPT-4 to make it so easy to do useful multi-step thinking and planning. The ease of initial implementation (something like 3 days, with all of the code also written by GPT-4 for baby AGI) implies that improvements may also be easier than we would have guessed.
Integration with HuggingGPT and similar approaches can provide these cognitive loops with more cognitive capacities. This integration was also easier than I’d have guessed, with GPT-4 learning from a handful (e.g., 40) of examples how to use other software tools. Those tools will include both sensory capacities, with vision models and other sensory models of various types, and the equivalent of a variety of output capabilities.
Integration of recursive LLM self-improvement like “Reflexion” can utilize these cognitive loops to make the core model better at a variety of tasks.
Easily agentized LLMs is terrible news for capabilities. I think we’ll have an internet full of LLM-bots “thinking” up and doing stuff within a year.
This is absolutely bone-chilling for the urgency of the alignment and coordination problems. Some clever chucklehead already created ChaosGPT, an instance of Auto-GPT given the goal to destroy humanity and create chaos. You are literally reading the thoughts of something thinking about how to kill you. It’s too stupid to get very far, but it will get smarter with every LLM improvement, and every improvement to the recursive self-prompting wrapper programs. This gave me my very first visceral fear of AGI destroying us. I recommend it, unless you’re already plenty viscerally freaked out.
Watching agents think is going to shift public opinion. We should be ready for more AI scares and changing public beliefs. I have no idea how this is going to play out in the political sphere, but we need to figure this out to have a shot at successful alignment, because
We will be in a multilateral AGI world. Anyone can spawn a dumb AGI and have it either manage their social media, or try to destroy humanity. And over the years, those commercially available AGIs will get smarter. Because defense is harder than offense, it is going to be untenable to indefinitely defend the world against out-of-control AGIs. But
Important parts of alignment and interpretability might be a lot easier than most of us have been thinking. These agents take goals as input, in English. They reason about those goals much as humans do, and this will likely improve with model improvements. This does not solve the outer alignment problem; one existing suggestion is to include a top-level goal of “reducing suffering.” No! No! No!. This also does not solve the alignment stability problem. Starting goals can be misinterpreted or lost to recursive subgoals, and if any type of continued learning is included, behavior will shift over time. It doesn’t even solve the inner alignment problem if recursive training methods create mesa-optimizers in the LLMs. But it also provides incredibly easy interpretability, because these systems think in English.
If I’m right about any reasonable subset of this stuff, this lands us in a terrifying, promising new landscape of alignment issues. We will see good bots and bad bots, and the balance of power will shift. Ultimately I think this leads to the necessity of very strong global monitoring, including breaking all encryption, to prevent hostile AGI behavior. The array of issues is dizzying (I am personally dizzied, and a bit short on sleep from fear and excitement). I would love to hear others’ thoughts.
- ^
I’m using a neologism, and a loose definition of agency as things that flexibly pursue goals. That’s similar to this more rigorous definition.
- Scaffolded LLMs as natural language computers by 12 Apr 2023 10:47 UTC; 95 points) (
- Capabilities and alignment of LLM cognitive architectures by 18 Apr 2023 16:29 UTC; 86 points) (
- Research agenda: Supervising AIs improving AIs by 29 Apr 2023 17:09 UTC; 76 points) (
- Internal independent review for language model agent alignment by 7 Jul 2023 6:54 UTC; 55 points) (
- ~80 Interesting Questions about Foundation Model Agent Safety by 28 Oct 2024 16:37 UTC; 45 points) (
- AI Risk & Policy Forecasts from Metaculus & FLI’s AI Pathways Workshop by 16 May 2023 8:53 UTC; 41 points) (EA Forum;
- An explanation for every token: using an LLM to sample another LLM by 11 Oct 2023 0:53 UTC; 35 points) (
- Scaffolded LLMs: Less Obvious Concerns by 16 Jun 2023 10:39 UTC; 32 points) (
- Aligned AI via monitoring objectives in AutoGPT-like systems by 24 May 2023 15:59 UTC; 27 points) (
- 1 Oct 2024 9:02 UTC; 22 points) 's comment on the case for CoT unfaithfulness is overstated by (
- Research agenda: Supervising AIs improving AIs by 29 Apr 2023 17:09 UTC; 16 points) (EA Forum;
- 13 Apr 2023 5:54 UTC; 16 points) 's comment on Natural language alignment by (
- Should AutoGPT update us towards researching IDA? by 12 Apr 2023 16:41 UTC; 15 points) (
- AI Risk & Policy Forecasts from Metaculus & FLI’s AI Pathways Workshop by 16 May 2023 18:06 UTC; 11 points) (
- 5 Sep 2024 20:29 UTC; 6 points) 's comment on Conflating value alignment and intent alignment is causing confusion by (
- 23 Dec 2024 0:42 UTC; 3 points) 's comment on AGI with RL is Bad News for Safety by (
- 20 Apr 2023 21:00 UTC; 2 points) 's comment on What if we Align the AI and nobody cares? by (
- 18 May 2023 5:13 UTC; 1 point) 's comment on Stephen Fowler’s Shortform by (
Have agentized LLMs changed the alignment landscape? I’m not sure.
People are doing a bunch of work on LLM alignment, which is definitely useful for aligning an agent built on top of that LLM. But it’s not the whole picture, and I don’t see as many people as I’d like thinking about agent-specific alignment issues.
But I still expect agentized LLMs to change the alignment landscape. They still seem pretty likely to be the first transformative and dangerous AGIs.
Progress has been a bit slower than I expected. I think there are two main reasons:
Chain of thought doesn’t work as well by default as I expected.
Human cognition relies heavily on chain of thought, also known as System 2 processing. But we don’t put all of that into language frequently enough for the standard training set to capture our skills at reasoning step-by-step. That’s why it took specialized training as in o1, R1, QwQ, the new Gemini 2.0 Flash reasoning experimental, etc to make improvements on CoT reasoning.
Agents couldn’t read webpages very well without vision
This was unexpected. While the web is written in HTML, which LLMs should be capable of parsing rather well, it is reportedly not written in very clear HTML. Combined with low-hanging fruit from agents involving lots of internet use, this slowed progress as innovators spent time hand-engineering around frequent failures to parse. Anthropic’s Claude with computer use, and DeepMind’s Astra and Mariner all use vision so they can parse arbitrary webpages better.
There’s more enthusiasm for making better models vs. better agents than I expected. It now looks like major orgs are turning their enthusiasm toward agents, so I expect progress to accelerate. And there’s promising work in the few small orgs I know about working in stealth mode, so we might see some impressive reveals soon.
With those models in place and improvements surely in the pipeline, I expect progress on agents to proceed apace. This now appears to be the majority opinion among everyone building and funding LLM agents.
I have short timelines for “subhuman AGI”, but relatively slow takeoff times to really scary superhuman stuff. Which I think is very good for our prospects of mastering alignment by that time.
In retrospect, the biggest advantage of LLM agents is that LLMs are basically trained to follow instructions as intended, and agentic architectures can enhance that tendency. That’s a non-consequentialist alignment goal that bypasses many of the most severe alignment worries by providing corrigibility that’s not in conflict with a consequentialist goal. See Instruction-following AGI is easier and more likely than value aligned AGI and Max Harms’ Corrigibility as Singular Target sequence for more depth.
Yudkowsky and similar agent foundations network alignment pessimists have not, to my knowledge, addressed that class of alignment proposals in any depth. I’m looking forward to hearing their takes.