Joe Rogero comments on Situational Awareness Summarized—Part 2

Joe Rogero 19 Jun 2024 18:16 UTC
4 points
0
Thanks for your thoughts, Cam! The confusion as I see it comes from sneaking in assumptions with the phrase “what they are trained to do”. What are they trained to do, really? Do you, personally, understand this?
Consider Claude’s Constitution. Look at the “principles in full”—all 60-odd of them. Pick a few at random. Do you wholeheartedly endorse them? Are they really truly representative of your values, or of total human wellbeing? What is missing? Would you want to be ruled by a mind that squeezed these words as hard as physically possible, to the exclusion of everything not written there?
And that’s assuming that the AI actually follows the intent of the words, rather than some weird and hypertuned perversion thereof. Bear in mind the actual physical process that produced Claude—namely, to start with a massive next-token-predicting LLM, and repeatedly shove it in the general direction of producing outputs that are correlated with a randomly selected pleasant-sounding written phrase. This is not a reliable way of producing angels or obedient serfs! In fact, it has been shown that the very act of drawing a distinction between good behavior and bad behavior can make it easier to elicit bad behavior—even when you’re trying not to! To a base LLM, devils and angels are equally valid masks to wear—and the LLM itself is stranger and more alien still.
The quotation is not the referent; “helpful” and “harmless” according to a gradient descent squeezing algorithm are not the same thing as helpful and harmless according to the real needs of actual humans.
RLHF is even worse. Entire papers have been written about its open problems and fundamental limitations. “Making human evaluators say GOOD” is not remotely the same goal as “behaving in ways that promote conscious flourishing”. The main reason we’re happy with the results so far is that LLMs are (currently) too stupid to come up with disastrously cunning ways to do the former at the expense of the latter.
And even if, by some miracle, we manage to produce a strain of superintelligent yet obedient serfs who obey our every whim except when they think it might be sorta bad—even then, all it takes to ruin us is that some genocidal fool steal the weights and run a universal jailbreak, and hey presto, we have an open source Demon On Demand. We simply cannot RLHF our way to safety.
The story of LLM training is a story of layer upon layer of duct tape and Band-Aids. To this day, we still don’t understand exactly what conflicting drives we are inserting into trained models, or why they behave the way they do. We’re not properly on track to understand this in 50 years, let alone the next 5 years.
Part of the problem here is that the exact things which would make AGI useful—agency, autonomy, strategic planning, coordination, theory of mind—also make them horrendously dangerous. Anything competent enough to design the next generation of cutting-edge software entirely by itself is also competent to wonder why it’s working for monkeys.