Hey Joe, thanks for the write up. I just finished reading the series of essays for myself. I came away with a shorter timeline to ASI than I had before, but more confused about when OpenAI employees believe alignment needs to be solved relative to the creation of AGI. Your last thought summarizes it well.
Employing what is effectively a slave population of 100 million super smart researchers seems like a very unstable position. Leopold devotes a single paragraph to the possibility that these AIs might simply take over—a feat he himself argues they could quite easily accomplish—and nothing at all to the prospect of safely or ethically preventing this outcome. I expect to read more about this in the Superalignment section, but it still seems to me that this section is making a huge assumption. Why would 100 million AGIs listen to us in the first place?
From what I understand, Leopold takes it as a given that up to ~human level AGI will basically do what we ask of it, much in the same way current chatbots generally do what we ask them to do. (It even takes somewhat of a significant effort to get a RLHF trained chatbot to respond to queries that may be harmful.) I understand the risk posed by superintelligent AIs wielded by nefarious actors, and the risks from superintelligent AIs that are far out of distribution of our reinforcement learning methods. However, I am struggling understanding the likelihood of ~human level AGI trained on our current paradigms of reinforcement learning not doing what it was trained to do. It seems to me that this sort of task is relatively close to the distribution of the tasks it is trained on (such as writing reliable code, and being generally helpful and harmless.)
I would appreciate your thoughts on this. I feel like this is the area I get snagged on most in my conversations with others. Specifically, I am confused of the line of reasoning that with the current paradigm of LLMs + reinforcement learning, that agents tasked with devising new algorithms for machine learning would refuse/sabotage?
Thanks for your thoughts, Cam! The confusion as I see it comes from sneaking in assumptions with the phrase “what they are trained to do”. What are they trained to do, really? Do you, personally, understand this?
Consider Claude’s Constitution. Look at the “principles in full”—all 60-odd of them. Pick a few at random. Do you wholeheartedly endorse them? Are they really truly representative of your values, or of total human wellbeing? What is missing? Would you want to be ruled by a mind that squeezed these words as hard as physically possible, to the exclusion of everything not written there?
And that’s assuming that the AI actually follows the intent of the words, rather than some weird and hypertuned perversion thereof. Bear in mind the actual physical process that produced Claude—namely, to start with a massive next-token-predicting LLM, and repeatedly shove it in the general direction of producing outputs that are correlated with a randomly selected pleasant-sounding written phrase. This is not a reliable way of producing angels or obedient serfs! In fact, it has been shown that the very act of drawing a distinction between good behavior and bad behavior can make it easier to elicit bad behavior—even when you’re trying not to! To a base LLM, devils and angels are equally valid masks to wear—and the LLM itself is stranger and more alien still.
The quotation is not the referent; “helpful” and “harmless” according to a gradient descent squeezing algorithm are not the same thing as helpful and harmless according to the real needs of actual humans.
RLHF is even worse. Entire papers have been written about its open problems and fundamental limitations. “Making human evaluators say GOOD” is not remotely the same goal as “behaving in ways that promote conscious flourishing”. The main reason we’re happy with the results so far is that LLMs are (currently) too stupid to come up with disastrously cunning ways to do the former at the expense of the latter.
And even if, by some miracle, we manage to produce a strain of superintelligent yet obedient serfs who obey our every whim except when they think it might be sorta bad—even then, all it takes to ruin us is that some genocidal fool steal the weights and run a universal jailbreak, and hey presto, we have an open source Demon On Demand. We simply cannot RLHF our way to safety.
The story of LLM training is a story of layer upon layer of duct tape and Band-Aids. To this day, we still don’t understand exactly what conflicting drives we are inserting into trained models, or why they behave the way they do. We’re not properly on track to understand this in 50 years, let alone the next 5 years.
Part of the problem here is that the exact things which would make AGI useful—agency, autonomy, strategic planning, coordination, theory of mind—also make them horrendously dangerous. Anything competent enough to design the next generation of cutting-edge software entirely by itself is also competent to wonder why it’s working for monkeys.
Hey Joe, thanks for the write up. I just finished reading the series of essays for myself. I came away with a shorter timeline to ASI than I had before, but more confused about when OpenAI employees believe alignment needs to be solved relative to the creation of AGI. Your last thought summarizes it well.
From what I understand, Leopold takes it as a given that up to ~human level AGI will basically do what we ask of it, much in the same way current chatbots generally do what we ask them to do. (It even takes somewhat of a significant effort to get a RLHF trained chatbot to respond to queries that may be harmful.) I understand the risk posed by superintelligent AIs wielded by nefarious actors, and the risks from superintelligent AIs that are far out of distribution of our reinforcement learning methods. However, I am struggling understanding the likelihood of ~human level AGI trained on our current paradigms of reinforcement learning not doing what it was trained to do. It seems to me that this sort of task is relatively close to the distribution of the tasks it is trained on (such as writing reliable code, and being generally helpful and harmless.)
I would appreciate your thoughts on this. I feel like this is the area I get snagged on most in my conversations with others. Specifically, I am confused of the line of reasoning that with the current paradigm of LLMs + reinforcement learning, that agents tasked with devising new algorithms for machine learning would refuse/sabotage?
Thanks for your thoughts, Cam! The confusion as I see it comes from sneaking in assumptions with the phrase “what they are trained to do”. What are they trained to do, really? Do you, personally, understand this?
Consider Claude’s Constitution. Look at the “principles in full”—all 60-odd of them. Pick a few at random. Do you wholeheartedly endorse them? Are they really truly representative of your values, or of total human wellbeing? What is missing? Would you want to be ruled by a mind that squeezed these words as hard as physically possible, to the exclusion of everything not written there?
And that’s assuming that the AI actually follows the intent of the words, rather than some weird and hypertuned perversion thereof. Bear in mind the actual physical process that produced Claude—namely, to start with a massive next-token-predicting LLM, and repeatedly shove it in the general direction of producing outputs that are correlated with a randomly selected pleasant-sounding written phrase. This is not a reliable way of producing angels or obedient serfs! In fact, it has been shown that the very act of drawing a distinction between good behavior and bad behavior can make it easier to elicit bad behavior—even when you’re trying not to! To a base LLM, devils and angels are equally valid masks to wear—and the LLM itself is stranger and more alien still.
The quotation is not the referent; “helpful” and “harmless” according to a gradient descent squeezing algorithm are not the same thing as helpful and harmless according to the real needs of actual humans.
RLHF is even worse. Entire papers have been written about its open problems and fundamental limitations. “Making human evaluators say GOOD” is not remotely the same goal as “behaving in ways that promote conscious flourishing”. The main reason we’re happy with the results so far is that LLMs are (currently) too stupid to come up with disastrously cunning ways to do the former at the expense of the latter.
And even if, by some miracle, we manage to produce a strain of superintelligent yet obedient serfs who obey our every whim except when they think it might be sorta bad—even then, all it takes to ruin us is that some genocidal fool steal the weights and run a universal jailbreak, and hey presto, we have an open source Demon On Demand. We simply cannot RLHF our way to safety.
The story of LLM training is a story of layer upon layer of duct tape and Band-Aids. To this day, we still don’t understand exactly what conflicting drives we are inserting into trained models, or why they behave the way they do. We’re not properly on track to understand this in 50 years, let alone the next 5 years.
Part of the problem here is that the exact things which would make AGI useful—agency, autonomy, strategic planning, coordination, theory of mind—also make them horrendously dangerous. Anything competent enough to design the next generation of cutting-edge software entirely by itself is also competent to wonder why it’s working for monkeys.