Have agentized LLMs changed the alignment landscape? I’m not sure.
People are doing a bunch of work on LLM alignment, which is definitely useful for aligning an agent built on top of that LLM. But it’s not the whole picture, and I don’t see as many people as I’d like thinking about agent-specific alignment issues.
But I still expect agentized LLMs to change the alignment landscape. They still seem pretty likely to be the first transformative and dangerous AGIs.
Progress has been a bit slower than I expected. I think there are two main reasons:
Chain of thought doesn’t work as well by default as I expected.
Human cognition relies heavily on chain of thought, also known as System 2 processing. But we don’t put all of that into language frequently enough for the standard training set to capture our skills at reasoning step-by-step. That’s why it took specialized training as in o1, R1, QwQ, the new Gemini 2.0 Flash reasoning experimental, etc to make improvements on CoT reasoning.
Agents couldn’t read webpages very well without vision
This was unexpected. While the web is written in HTML, which LLMs should be capable of parsing rather well, it is reportedly not written in very clear HTML. Combined with low-hanging fruit from agents involving lots of internet use, this slowed progress as innovators spent time hand-engineering around frequent failures to parse. Anthropic’s Claude with computer use, and DeepMind’s Astra and Mariner all use vision so they can parse arbitrary webpages better.
There’s more enthusiasm for making better models vs. better agents than I expected. It now looks like major orgs are turning their enthusiasm toward agents, so I expect progress to accelerate. And there’s promising work in the few small orgs I know about working in stealth mode, so we might see some impressive reveals soon.
With those models in place and improvements surely in the pipeline, I expect progress on agents to proceed apace. This now appears to be the majority opinion among everyone building and funding LLM agents.
I have short timelines for “subhuman AGI”, but relatively slow takeoff times to really scary superhuman stuff. Which I think is very good for our prospects of mastering alignment by that time.
In retrospect, the biggest advantage of LLM agents is that LLMs are basically trained to follow instructions as intended, and agentic architectures can enhance that tendency. That’s a non-consequentialist alignment goal that bypasses many of the most severe alignment worries by providing corrigibility that’s not in conflict with a consequentialist goal. See
Instruction-following AGI is easier and more likely than value aligned AGI and Max Harms’ Corrigibility as Singular Target sequence for more depth.
Yudkowsky and similar agent foundations network alignment pessimists have not, to my knowledge, addressed that class of alignment proposals in any depth. I’m looking forward to hearing their takes.
Have agentized LLMs changed the alignment landscape? I’m not sure.
People are doing a bunch of work on LLM alignment, which is definitely useful for aligning an agent built on top of that LLM. But it’s not the whole picture, and I don’t see as many people as I’d like thinking about agent-specific alignment issues.
But I still expect agentized LLMs to change the alignment landscape. They still seem pretty likely to be the first transformative and dangerous AGIs.
Progress has been a bit slower than I expected. I think there are two main reasons:
Chain of thought doesn’t work as well by default as I expected.
Human cognition relies heavily on chain of thought, also known as System 2 processing. But we don’t put all of that into language frequently enough for the standard training set to capture our skills at reasoning step-by-step. That’s why it took specialized training as in o1, R1, QwQ, the new Gemini 2.0 Flash reasoning experimental, etc to make improvements on CoT reasoning.
Agents couldn’t read webpages very well without vision
This was unexpected. While the web is written in HTML, which LLMs should be capable of parsing rather well, it is reportedly not written in very clear HTML. Combined with low-hanging fruit from agents involving lots of internet use, this slowed progress as innovators spent time hand-engineering around frequent failures to parse. Anthropic’s Claude with computer use, and DeepMind’s Astra and Mariner all use vision so they can parse arbitrary webpages better.
There’s more enthusiasm for making better models vs. better agents than I expected. It now looks like major orgs are turning their enthusiasm toward agents, so I expect progress to accelerate. And there’s promising work in the few small orgs I know about working in stealth mode, so we might see some impressive reveals soon.
With those models in place and improvements surely in the pipeline, I expect progress on agents to proceed apace. This now appears to be the majority opinion among everyone building and funding LLM agents.
I have short timelines for “subhuman AGI”, but relatively slow takeoff times to really scary superhuman stuff. Which I think is very good for our prospects of mastering alignment by that time.
In retrospect, the biggest advantage of LLM agents is that LLMs are basically trained to follow instructions as intended, and agentic architectures can enhance that tendency. That’s a non-consequentialist alignment goal that bypasses many of the most severe alignment worries by providing corrigibility that’s not in conflict with a consequentialist goal. See Instruction-following AGI is easier and more likely than value aligned AGI and Max Harms’ Corrigibility as Singular Target sequence for more depth.
Yudkowsky and similar agent foundations network alignment pessimists have not, to my knowledge, addressed that class of alignment proposals in any depth. I’m looking forward to hearing their takes.