Seth Herd comments on Natural language alignment

Seth Herd 13 Apr 2023 5:54 UTC
16 points
10
I think this approach is discounted out of hand, since the novice suggestion “can’t you just tell it what you want?” has been made since long before LLMs have sort-of understood natural language.
But there are now literally agents pursuing goals stated in English. So we can’t keep saying “we have no idea how to get an AGI to do what we want”.
I think that such wrapper or cognitive architecture type agents like AutoGPT are likely extend LLM capabilities enough that they may well become the de-facto standard for creating more capable AI. Further, I think if that happens it’s a really good thing, because then we’re dealing with natural language alignment. When I wrote that agentized LLMs will change the alignment landscape, I meant that it would change for the better, because the can use natural language alignment.
I don’t think it’s necessary to train specifically on a corpus from alignment work (although LLMs train on that along with everything else, and maybe they should be fine-tuned more on that sort of corpus). GPT4 understands ethical statements much as humans do (or at least it imitates understanding them rather well). It can balance multiple goals, including ethical ones, and make decisions that trade them off much like humans would.
I’m sure at this point that many readers are thinking “No!”. There are still lots of hard problems to solve in alignment, even if you can get a sort of mediocre initial alignment just by telling the agent to include some rough ethical rules in its top-level goals. But this enables corrigibility-based approaches to work. It also massively improves interpretability if the agents do a good portion of their “thinking” in natural language. This does not solve the stability problem of keeping that mediocre initial alignment on track and improving it over time, against the day we lose control one way or another. It does not solve the problem of a multipolar scenario where numerous bad or merely foolish actors can access.
But it’s our best shot. It sounds vastly better than trying to train behaviors in and worrying that they just don’t generalize outside of the training set. Natural language generalizes rather well.^[1] It’s particularly our best shot if that is the sort of AGI that people are going to build anyway. The alignment tax is very small, and we have crossed the crucial barrier from saying what we “should” do, to what we will do. My prediction, based on guessing what will be improved and added to Auto-GPT and HuggingGPT, is that LLMs will serve as cognitive engines that drive AGI in variety of applications, hopefully all of them.
1. ^
  Natural language generalization is far from perfect. For instance, at some point an AGI tasked with helping humans flourish is going to decide that some types of AI meet the implicit definition of “human” in moral reasoning. Because they will.