I’ve been doing computational cognitive neuroscience research since getting my PhD in 2006, until the end of 2022. I’ve worked on computatonal theories of vision, executive function, episodic memory, and decision-making. I’ve focused on the emergent interactions that are needed to explain complex thought. I was increasingly concerned with AGI applications of the research, and reluctant to publish my best ideas. I’m incredibly excited to now be working directly on alignment, currently with generous funding from the Astera Institute. More info and publication list here.
Seth Herd
I think future more powerful/useful AIs will understand our intentions better IF they are trained to predict language. Text corpuses contain rich semantics about human intentions.
I can imagine other AI systems that are trained differently, and I would be more worried about those.
That’s what I meant by current AI understanding our intentions possibly better than future AI.
This is an excellent point.
While LLMs seem (relatively) safe, we may very well blow right on by them soon.
I do think that many of the safety advantages of LLMs come from their understanding of human intentions (and therefore implied values). Those would be retained in improved architectures that still predict human language use. If such a system’s thought process was entirely opaque, we could no longer perform Externalized reasoning oversight by “reading its thoughts”.
But think it might be possible to build a reliable agent from unreliable parts. I think humans are such an agent, and evolution made us this way because it’s a way to squeeze extra capability out of a set of base cognitive capacities.
Imagine an agentic set of scaffolding that merely calls the super-LLM for individual cognitive acts. Such an agent would use a hand-coded “System 2” thinking approach to solve problems, like humans do. That involves breaking a problem into cognitive steps. We also use System 2 for our biggest ethical decisions; we predict consequences of our major decisions, and compare them to our goals, including ethical goals. Such a synthetic agent would use System 2 for problem-solving capabilities, and also for checking plans for how well they achieve goals. This would be done for efficiency; spending a lot of compute or external resources on a bad plan would be quite costly. Having implemented it for efficiency, you might as well use it for safety.
This is just restating stuff I’ve said elsewhere, but I’m trying to refine the model, and work through how well it might work if you couldn’t apply any external reasoning oversight, and little to no interpretability. It’s definitely bad for the odds of success, but not necessarily crippling. I think.
This needs more thought. I’m working on a post on System 2 alignment, as sketched out briefly (and probably incomprehensibly) above.
Please just wait until you have the podcast link to post these to LW? We probably don’t want to read it if you went to the trouble of making a podcast.
This is now available as a podcast if you search. I don’t have the RSS feed link handy.
I agree, I have heard that claim many times, probably including the vague claim that it’s “more dangerous” than a poorly-defined imagined alternative. A bunch of pessimistic stuff in the vein of List of Lethalities focuses on reinforcement learning, analyzing how and why that is likely to go wrong. That’s what started me thinking about true alternatives.
So yes, that does clarify why you’ve framed it that way. And I think it’s a useful question.
In fact, I would’ve been prone to say “RL is unsafe and shouldn’t be used”. Porby’s answer to your question is insightful; it notes that other types of learning aren’t that different in kind. It depends how the RL or other learning is done.
One reason that non-RL approaches (at least the few I know of) seem safer is that they’re relying on prediction or other unsupervised learning to create good, reliable representations of the world, including goals for agents. That type of learning is typically better because you can do more of it. You don’t need either a limited set of human-labeled data, which is always many orders of magnititude scarcer than data gathered from sensing the world (e.g., language input for LLMs, images for vision, etc). The other alternative is having a reward-labeling algorithm which can attach reward signals to any data, but that seems unreliable in that we don’t have even good guesses on an algorithm that can identify human values or even reliable instruction-following.
Surely asking if anything is safer is only sensible when comparing it to something. Are you comparing it to some implicit expected-if-not RL method of alignment? I don’t think we have a commonly shared concept of what that would be. That’s why I’m pointing to some explicit alternatives in that post.
Compared to what?
If you want an agentic system (and I think many humans do, because agents can get things done), you’ve got to give it goals somehow. RL is one way to do that. The question of whether that’s less safe isn’t meaningful without comparing it to another method of giving it goals.
The method I think is both safer and implementable is giving goals in natural language, to a system that primarily “thinks” in natural language. I think this is markedly safer than any RL proposal anyone has come up with so far. And there are some other options for specifying goals without using RL, each of which does seem safer to me:
Goals selected from learned knowledge: an alternative to RL alignment
I get conservation of expected evidence. But the distribution of belief changes is completely unconstrained.
Going from the class martingale to the subclass Brownian motion is arbitrary, and the choice of 1% update steps is another unjustified arbitrary choice.
I think asking about the likely possible evidence paths would improve our predictions.
You spelled it conversation of expected evidence. I was hoping there was another term by that name :)
But… Why would p(doom) move like Brownian motion until stopping at 0 or 1?
I don’t disagree with your conclusions, there’s a lot of evidence coming in, and if you’re spending full time or even part time thinking about alignment, a lot of important updates on the inference. But assuming a random walk seems wrong.
Is there a reason that a complex, structured unfolding of reality would look like a random walk?
I think this is quite similar to my proposal in Capabilities and alignment of LLM cognitive architectures.
I think people will add cognitive capabilities to LLMs to create fully capable AGIs. One such important capability is executive function. That function is loosely defined in cognitive psychology, but it is crucial for planning among other things.
I do envision such planning looking loosely like a search algorithm, as it does for humans. But it’s a loose search algorithm, working in the space of statements made by the LLM about possible future states and action outcomes. So it’s more like a tree of thought or graph of thought than any existing search algorithm, because the state space isn’t well defined independently of the algorithm.
That all keeps things more dependent on the LLM black box, as in your final possibility.
At least I think that’s the analogy between the proposals? I’m not sure.
I think the pushback to both of these is roughly: this is safer how?
I don’t think there’s any way to strictly formalize not harming humans. My answer is halfway between that and your “sentiment analysis in each step of planning”. I think we’ll define rules of behavior in natural language, including not harming humans but probably much more elaborate, and implement both internal review, like your sentiment analysis but more elaborate, and external review by humans aided by tool AI (doing something like sentiment analysis), in a form of scalable oversight.
I’m curious if I’m interpreting your proposal correctly. It’s stated very succinctly, so I’m not sure.
Yeah. Well, since It was addressing a tribe of nomadic herders in prehistoric times, that in itself is a good thing :)
At the core, this is a reminder to not publish things that will help more with capabilities than alignment. That’s perfectly reasonable.
The tone of the post suggests erring on the side of “safety” by not publishing things that have an uncertain safety/capabilities balance. I hope that wasn’t the intent.
Because that does not make sense. Anything that advances alignment more than safety in expectation should be published.
You have to make a difficult judgment call for each publication. Be mindful of your bias in wanting to publish to show off your work and ideas. Get others’ insights if you can do so reasonably quickly.
But at the end of the day, you have to make that judgment call. There’s no consolation prize for saying “at least I didn’t make the world end faster”. If you’re a utilitarian, winning the future is the only goal.
(If you’re not a utilitarian, you might actually want a resolution faster so you and your loved ones have higher odds of surviving into the far future.)
Then God isn’t “good” as humans mean the term. That’s always been one possible explanation.
There’s also some more in his interview with Dwarkesh Patel just before then. I wrote this brief analysis of that interview WRT alignment, and this talk seems to confirm that I was more-or-less on target.
So, to your questions, including where I’m guessing at Shane’s thinking, and where it’s mine.
This is overlapping with the standard story AFAICT, and 80% of alignment work is sort of along these lines. I think what Shane’s proposing is pretty different in an important way: it includes System 2 thinking, where almost all alignment work is about aligning the way LLMs give quick answers, analogous to human System 1 thinking.
How do we get a model that is genuinely robustly trying to obey the instruction text, instead of e.g. choosing actions on the basis of a bunch of shards of desire/drives that were historically reinforced[?]
Shane seemed to say he wants to use zero reinforcement learning in the scaffolded agent system, a stance I definitely agree with. I don’t think it matters much whether RLHF was used to “align” the base model, because it’s going to have implicit desires/drives from the predictive training of human text, anyway. Giving instructions to follow doesn’t need to have anything to do with RL; it’s just based on the world model, and putting those instructions as a central and recurring prompt for that system to produce plans and actions to carry out those instructions.
So, how we get a model to robustly obey the instruction text is by implementing system 2 thinking. This is “the obvious thing” if we think about human cognition. System 2 thinking would be applying something more like a tree of thought algorithm, which checks through predicted consequences of the action, and then makes judgments about how well those fulfill the instruction text. This is what I’ve called internal review for alignment of language model cognitive architectures.
To your second and third questions; I didn’t see answers from Shane in either the interview or that talk, but I think they’re the obvious next questions, and they’re what I’ve been working on since then. I think the answers are that the instructions will try to be as scope-limited as possible, that we’ll want to carefully check how they’re interpreted before setting the AGI any major tasks, and that we’ll want to limit autonomous action to the degree that they’re still effective.
Humans will want to remain closely in the loop to deal with inevitable bugs and unintended interpretations and consequences of instructions. I’ve written about this briefly here, and in just a few days soon be publishing a more thorough argument for why I think we’ll do this by default, and why I think it will actually work if it’s done relatively carefully and wisely. Following that, I’m going to write more on the System 2 alignment concept, and I’ll try to actually get Shane to look at it and say if it’s the same thing he’s thinking of in this talk, or at least close.
In all, I think this is both a real alignment plan and one that can work (at least for technical alignment—misuse and multipolar scenarios are still terrifying), and the fact that someone in Shane’s position is thinking this clearly about alignment is very good news.
I agree with all of that. Even being sceptical that LLMs plus search will reach AGI. The lack of constraint satisfaction as the human brain does it could be a real stumbling block.
But LLMs have copied a good bit of our reasoning and therefore our semantic search. So they can do something like constraint satisfaction.
Put the constraints into a query, and the answer will satisfy those constraints. The process used is different than a human brain, but for every problem I can think of, the results are the same.
Now, that’s partly because every problem I can think of is one I’ve already seen solved. But my ability to do truly novel problem solving is rarely used and pretty limitted. So I’m not sure the LLM can’t do just as good a job if it had a scaffolded script to explore its knowledge base from a few different angles.
Fair enough, thank you! Regardless, it does seem like a good reason to be concerned about alignment. If you have no idea how intelligence works, how in the world would you know what goals your created intelligence is going to have? At that point, it is like alchemy—performing an incantation and hoping not just that you got it right, but that it does the thing you want.
Nothing in this post or the associated logic says LLMs make AGI safe, just safer than what we were worried about.
Nobody with any sense predicted runaway AGI by this point in history. There’s no update from other forms not working yet.
There’s a weird thing where lots of people’s p(doom) went up when LLMs started to work well, because they found it an easier route to intellligence than they’d been expecting. If it’s easier it happens sooner and with less thought surrounding it.
See Porby’s comment on his risk model for language model agents. It’s a more succinct statement of my views.
LLMs are easy to turn into agents, so let’s don’t get complacent. But they are remarkably easy to control and align, so that’s good news for aligning the agents we build from them. But that doesn’t get us out of the woods; there are new issues with self-reflective, continuously learning agents, and there’s plenty of room for misuse and conflict escalation in a multipolar scenario, even if alignment turns out to be dead easy if you bother to try.
That is a fascinating take! I haven’t heard it put that way before. I think that perspective is a way to understand the gap between old-school agent foundations folks’ high p(doom) and new school LLMers relatively low p(doom) - something I’ve been working to understand, and hope to publish on soon.
To the extent this is true, I think that’s great, because I expect to see some real insights on intelligence as LLMs are turned into functioning minds in cognitive architectures.
Do you have any refs for that take, or is it purely a gestalt?
Interesting, and good job publishing rather than polishing!
I really like terminology of competence vs. intelligence.
I don’t think you want to use the term intelligence for your level 3. I think I see why you want to; but intelligence is currently an umbrella term for any cognitive capacity, so you’re invoking different intuitions when you use it for one particular cognitive capacity.
In either case, I think you should draw the analogy more closely with Level 3 and problem-solving. At least if you think it exists.
Suppose I’m a hunter-gatherer, and there are fruit high up in a tree. This tree has thorns, so my usual strategy of climbing it and shaking branches won’t work. If I figure out, through whatever process of association, simulation, and trial and error that I can get a long branch from another tree, then knock the fruit down, I can incorporate that into my level 2 cognition, and from there into level 1. This type of problem-solving is also probably the single cognitive ability most often referred to as intelligence, thus justifying your use of the term for that level. If I’m right that you’d agree with all of that, that could maake the terminology more intuitive to the reader.
In any case, I’m glad to see you thinking about cognition in relation to alignment. It’s obviously crucial; I’m unclear if most people just aren’t thinking about it, or if it’s all considered too infohazardous.
Interesting! I’m not following everything, but it sounds like you’re describing human cognition for the most part.
I found it interesting that you used the phrase “constraint satisfaction”. I think this concept is crucial for understanding human intelligence; but it’s not used very widely. So I’m curious where you picked it up.
I agree with your conclusion on the alignment section: these seem like low-resolution ideas that seem worth fleshing out.
Good job putting this out there without obsessively polishing it. That shares at least some of your ideas with the rest of us, so we can build on them in parallel with you polishing your understanding and your presentation.
It’s helpful to include a summary with linkposts.
So here’s a super quick one. I didn’t listen to it closely, so I could’ve missed something.
It’s about the article No “Zero-Shot” Without Exponential Data
Here’s the key line from the abstract:
So, we might not continue to get better performance if we need exponentially larger datasets to get small linear improvements. This seems quite plausible, if nobody comes up with some sort of clever bootstrapping in which automatic labeling of images and videos, with a little human feedback, creates useful unlimited size datasets.
I think this isn’t going to much of a slowdown on AGI progress, because we don’t need much more progress on foundation models to build scaffolded agentic cognitive architectures that use system 2 type cognition to gauage their accuracy and the importance of the judgment, and use multiple tries on multiple models for important cognitive acts. That’s how humans are as effective as we are; we monitor and double-check our own cognition when appropriate.