I wish we had more psychologists working on alignment. Our first AGIs are probably going to have complex mix of multiple goals, values, and habits, like humans. I’m not a clinical psychologist, and I don’t know of other alignment researchers who are. I wish we didn’t need them, but I’m afraid we do.
Understanding ML and training will be important, but we’re going to have to consider how context-dependent motivations interact in a complex, evolving system.
It looks like the first AGIs are fairly likely to be agents or cognitive architectures based on foundation models.
Those foundation models are shoggoths with a thousand faces, like the multitudes we each contain, but even more varied. And the scaffolding we surround those systems with to turn them into agents will make them more complex at the same time as offering some leverage points for understanding and directing them.
Edit: This is mostly framed in relation to formal methods, but it’s meant to address the larger group working on foundation model alignment, too. For them, the move I’m suggesting is toward thinking more about systems as entities rather than just models: minds that learn continuously (and so can change how they interpret the world and their goals) as well as having goals from user prompts, system prompts, RLHF/fine-tuning, and simulacra/or psuedo-goals from predictive base model training.
These systems will still be subject to the types of goal misspecification and optimization risks that classical agent foundations theory addresses; but they’ll also have goals, values, and habits competing in a complex, context-dependent way, like humans. When they learn continuously, either by episodic memory like improved vector databases, or by selecting important information for periodic fine-tuning, they will have not just a context-dependent set of tendencies, but a whole trajectory through value/goal interpretation/alignment space.
“We’re doomed then! Psychology has barely learned a thing in its couple hundred years!”, you might well be thinking. I agree that psychology can’t say much, and can’t say anything with certainty. But I don’t think that means the enterprise is doomed. More on that later.
Regardless of the consequences, I’m afraid this is a simple fact about the alignment problem. We are not likely to convince the whole world to start over and do it right by taking a more systematic approach. A slowdown seems out of reach, let alone a full pause long enough for an alternate approach to catch up.
Applying formal methods to the alignment problem in systems based on deep networks looks mostly like a pipe dream to me. I wish it weren’t so, but I desire to believe the truth.
And psychology is not a completely hopeless subject. Human behavior is predictable in broad form even while it’s highly unpredictable in its detail. An ethical human will generally do ethical things, even though they have some unethical thoughts. Humans are mostly coherent; our many urges have to fight each other.
So the burbling chaos of a shoggoth at its core might not be a dealbreaker, if there are ways to ensure that the majority of the agent’s motivations are aligned, and those aligned motivations can keep the rest from doing much damage.
Accepting that this is the type of AGI we need to align may be difficult if you’ve pinned your hopes for alignment on formal methods. But foundation model agents have upsides for alignment that are about as dramatically good as their inscrutable complexity is bad.
Reasons for guarded optimism about aligning the type of AGI we’re probably getting first are the subject of most of my other posts. Accepting that we need to think about alignment even when it’s messy is a first step toward succeeding if we can.
The psychological perspective on AGI alignment:
I wish we had more psychologists working on alignment. Our first AGIs are probably going to have complex mix of multiple goals, values, and habits, like humans. I’m not a clinical psychologist, and I don’t know of other alignment researchers who are. I wish we didn’t need them, but I’m afraid we do.
Understanding ML and training will be important, but we’re going to have to consider how context-dependent motivations interact in a complex, evolving system.
It looks like the first AGIs are fairly likely to be agents or cognitive architectures based on foundation models.
Those foundation models are shoggoths with a thousand faces, like the multitudes we each contain, but even more varied. And the scaffolding we surround those systems with to turn them into agents will make them more complex at the same time as offering some leverage points for understanding and directing them.
Edit: This is mostly framed in relation to formal methods, but it’s meant to address the larger group working on foundation model alignment, too. For them, the move I’m suggesting is toward thinking more about systems as entities rather than just models: minds that learn continuously (and so can change how they interpret the world and their goals) as well as having goals from user prompts, system prompts, RLHF/fine-tuning, and simulacra/or psuedo-goals from predictive base model training.
These systems will still be subject to the types of goal misspecification and optimization risks that classical agent foundations theory addresses; but they’ll also have goals, values, and habits competing in a complex, context-dependent way, like humans. When they learn continuously, either by episodic memory like improved vector databases, or by selecting important information for periodic fine-tuning, they will have not just a context-dependent set of tendencies, but a whole trajectory through value/goal interpretation/alignment space.
“We’re doomed then! Psychology has barely learned a thing in its couple hundred years!”, you might well be thinking. I agree that psychology can’t say much, and can’t say anything with certainty. But I don’t think that means the enterprise is doomed. More on that later.
Regardless of the consequences, I’m afraid this is a simple fact about the alignment problem. We are not likely to convince the whole world to start over and do it right by taking a more systematic approach. A slowdown seems out of reach, let alone a full pause long enough for an alternate approach to catch up.
Applying formal methods to the alignment problem in systems based on deep networks looks mostly like a pipe dream to me. I wish it weren’t so, but I desire to believe the truth.
And psychology is not a completely hopeless subject. Human behavior is predictable in broad form even while it’s highly unpredictable in its detail. An ethical human will generally do ethical things, even though they have some unethical thoughts. Humans are mostly coherent; our many urges have to fight each other.
So the burbling chaos of a shoggoth at its core might not be a dealbreaker, if there are ways to ensure that the majority of the agent’s motivations are aligned, and those aligned motivations can keep the rest from doing much damage.
Accepting that this is the type of AGI we need to align may be difficult if you’ve pinned your hopes for alignment on formal methods. But foundation model agents have upsides for alignment that are about as dramatically good as their inscrutable complexity is bad.
Reasons for guarded optimism about aligning the type of AGI we’re probably getting first are the subject of most of my other posts. Accepting that we need to think about alignment even when it’s messy is a first step toward succeeding if we can.